Commit Graph

56429 Commits

Author SHA1 Message Date
Nikolay Borisov
07e21c4dad btrfs: Document locking requirement via lockdep_assert_held
Remove stale comment since there is no longer an eb->eb_lock and
document the locking expectation with a lockdep_assert_held statement.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
David Sterba
55ac01396a btrfs: rename btrfs_release_extent_buffer_page
The function used to release one page (and always the first one), but
not anymore since a50924e3a4 ("btrfs: drop constant param
from btrfs_release_extent_buffer_page").  Update the name and comment.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
d64766fdf9 btrfs: Refactor loop in btrfs_release_extent_buffer_page
The purpose of the function is to free all the pages comprising an
extent buffer. This can be achieved with a simple for loop rather than
the slightly more involved 'do {} while' construct. So rewrite the
loop using a 'for' construct. Additionally we can never have an
extent_buffer that has 0 pages so remove the check for index == 0. No
functional changes.

The reversed order used to have a meaning in the past where the first
page served as a blocking point for several callers. See eg
4f2de97ace ("Btrfs: set page->private to the eb").

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
b16d011e79 btrfs: Reword dodgy comments in alloc_extent_buffer
Commit eb14ab8ed2 ("Btrfs: fix page->private races") fixed a genuine
race between extent buffer initialisation and btree_releasepage.
Unfortunately as the code has evolved the comments weren't changed which
made them slightly wrong and they weren't very clear in the fist place.
Fix this by (hopefully) rewording them in a more approachable manner.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Nikolay Borisov
28187ae569 btrfs: Simplify page unlocking in alloc_extent_buffer
Current version of the page unlocking code was added in
727011e07c ("Btrfs: allow metadata blocks larger than the page size")
but even in this commit that particular flag was never used per-se. In
fact, btrfs only uses PageChecked for data pages to identify pages
which have been dirtied but don't have ORDERED bit set. For more
information see 247e743cbe ("Btrfs: Use async helpers to deal with
pages that have been improperly dirtied").

However, this doesn't apply to extent buffer pages. The important bit
here is that the pages are unlocked AFTER the extent buffer has been
properly recorded in the radix tree to avoid races with
btree_releasepage. Let's exploit this fact and simplify the page
unlocking sequence by unlocking the pages in-order and removing the
redundant PageChecked flag setting/clearing.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:54 +02:00
Qu Wenruo
8b9b6f2554 btrfs: scrub: cleanup the remaining nodatasum fixup code
Remove the remaining code that misused the page cache pages during
device replace and could cause data corruption for compressed nodatasum
extents. Such files do not normally exist but there's a bug that allows
this combination and the corruption was exposed by device replace fixup
code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
David Sterba
46df06b85e btrfs: refactor block group replication factor calculation to a helper
There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Anand Jain
321a4bf72b btrfs: use the assigned fs_devices instead of the dereference
We have assigned the %fs_info->fs_devices in %fs_devices as its not
modified just use it for the mutex_lock().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
62088ca742 btrfs: qgroup: Drop fs_info parameter from qgroup_rescan_leaf
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
a937742250 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_inherit
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:53 +02:00
Lu Fengqi
280f8bd2cb btrfs: qgroup: Drop fs_info parameter from btrfs_run_qgroups
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8696d76045 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_account_extent
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
deb4062743 btrfs: qgroup: Drop root parameter from btrfs_qgroup_trace_subtree
The fs_info can be fetched from the transaction handle directly.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
8d38d7eb7b btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_leaf_items
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
a95f3aafd6 btrfs: qgroup: Drop fs_info parameter from btrfs_qgroup_trace_extent
It can be fetched from the transaction handle. In addition, remove the
WARN_ON(trans == NULL) because it's not possible to hit this condition.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:52 +02:00
Lu Fengqi
f0042d5e92 btrfs: qgroup: Drop fs_info parameter from btrfs_limit_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
3efbee1d00 btrfs: qgroup: Drop fs_info parameter from btrfs_remove_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
49a05ecde3 btrfs: qgroup: Drop fs_info parameter from btrfs_create_qgroup
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
39616c2735 btrfs: qgroup: Drop fs_info parameter from btrfs_del_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
6b36f1aa5c btrfs: qgroup: Drop fs_info parameter from __del_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
9f8a6ce6ba btrfs: qgroup: Drop fs_info parameter from btrfs_add_qgroup_relation
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:51 +02:00
Lu Fengqi
2e980acdd8 btrfs: qgroup: Drop quota_root and fs_info parameters from update_qgroup_status_item
They can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
3e07e9a09f btrfs: qgroup: Drop root parameter from update_qgroup_info_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
ac8a866af1 btrfs: qgroup: Drop root parameter from update_qgroup_limit_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
69104618f4 btrfs: qgroup: Drop quota_root parameter from del_qgroup_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
99d7f09ac0 btrfs: qgroup: Drop quota_root parameter from del_qgroup_relation_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:50 +02:00
Lu Fengqi
711169c40f btrfs: qgroup: Drop quota_root parameter from add_qgroup_relation_item
It can be fetched from the transaction handle.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Anand Jain
fa59f27c8c btrfs: rename btrfs_parse_early_options
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As
btrfs_parse_early_options() parses the -o device options and scan the
device provided. So this rename specifies its action. Also the function
name is in line with btrfs_parse_subvol_options().
No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Lu Fengqi
c8389d4c0d btrfs: qgroup: cleanup the unused srcroot from btrfs_qgroup_inherit
Since commit 0b246afa62 ("btrfs: root->fs_info cleanup, add fs_info
convenience variables"), the srcroot is no longer used to get
fs_info::nodesize.  In fact, it can be dropped after commit 707e8a0715
("btrfs: use nodesize everywhere, kill leafsize").

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Qu Wenruo
031f24da2c btrfs: Use btrfs_mark_bg_unused to replace open code
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and
add a block group to unused_bgs list.

No functional modification, and only 3 callers are involved.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Nikolay Borisov
2556fbb0be btrfs: Rewrite retry logic in do_chunk_alloc
do_chunk_alloc implements logic to detect whether there is currently
pending chunk allocation (by means of space_info->chunk_alloc being
set) and if so it loops around to the 'again' label. Additionally,
based on the state of the space_info (e.g. whether it's full or not)
and the return value of should_alloc_chunk() it decides whether this
is a "hard" error (ENOSPC) or we can just return 0.

This patch refactors all of this:

1. Put order to the scattered ifs handling the various cases in an
easy-to-read if {} else if{} branches. This makes clear the various
cases we are interested in handling.

2. Call should_alloc_chunk only once and use the result in the
if/else if constructs. All of this is done under space_info->lock, so
even before multiple calls of should_alloc_chunk were unnecessary.

3. Rewrite the "do {} while()" loop currently implemented via label
into an explicit loop construct.

4. Move the mutex locking for the case where the caller is the one doing
the allocation. For the case where the caller needs to wait a concurrent
allocation, introduce a pair of mutex_lock/mutex_unlock to act as a
barrier and reword the comment.

5. Switch local vars to bool type where pertinent.

All in all this shouldn't introduce any functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:49 +02:00
Ethan Lien
dec59fa3a7 btrfs: use customized batch size for total_bytes_pinned
In commit b150a4f10d ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.

So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.

[Test]
We tested the patch on a 4-cores x86 machine:

1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file

We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.

[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:            82             82
        0.21           0.61          29.46           0.36         298340
      635973           0.09          11.01      173476.25           0.27

patched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:             1              1
        0.62           0.62           0.62           0.62          13601
       31542           0.14           9.61       11016.90           0.35

[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
Ethan Lien
d814a49198 btrfs: use correct compare function of dirty_metadata_bytes
We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().

Fixes: e2d845211e ("Btrfs: use percpu counter for dirty metadata count")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
Gu Jinxiang
36350e95a2 btrfs: return device pointer from btrfs_scan_one_device
Return device pointer (with the IS_ERR semantics) from
btrfs_scan_one_device so we don't have to return in through pointer.

And since btrfs_fs_devices can be obtained from btrfs_device, return that.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ fixed conflics after recent changes to btrfs_scan_one_device ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
Gu Jinxiang
d64dcbd183 btrfs: make fs_devices a local variable in btrfs_parse_early_options
fs_devices is always passed to btrfs_scan_one_device which overrides it.
In the call stack below fs_devices is passed to btrfs_scan_one_device
from btrfs_mount_root.  In btrfs_mount_root the output fs_devices of
this call stack is not used.

btrfs_mount_root
  btrfs_parse_early_options
    btrfs_scan_one_device

So, it is not necessary to pass fs_devices from btrfs_mount_root, using
a local variable in btrfs_parse_early_options is enough.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
81ffd56b57 btrfs: fix mount and ioctl device scan ioctl race
Technically this extends the critical section covered by uuid_mutex to:

- parse early mount options -- here we can call device scan on paths
  that can be passed as 'device=/dev/...'

- scan the device passed to mount

- open the devices related to the fs_devices -- this increases
  fs_devices::opened

The race can happen when mount calls one of the scans and there's
another one called eg. by mkfs or 'btrfs dev scan':

Mount                                  Scan
-----                                  ----
scan_one_device (dev1, fsid1)
                                       scan_one_device (dev2, fsid1)
				           add the device
					   free stale devices
					       fsid1 fs_devices::opened == 0
					           find fsid1:dev1
					           free fsid1:dev1
					           if it's the last one,
					            free fs_devices of fsid1
						    too

open_devices (dev1, fsid1)
   dev1 not found

When fixed, the uuid mutex will make sure that mount will increase
fs_devices::opened and this will not be touched by the racing scan
ioctl.

Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
399f7f4c42 btrfs: reorder initialization before the mount locks uuid_mutex
In preparation to take a big lock, move resource initialization before
the critical section. It's not obvious from the diff, the desired order
is:

- initialize mount security options
- allocate temporary fs_info
- allocate superblock buffers

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:48 +02:00
David Sterba
5139cff598 btrfs: lift uuid_mutex to callers of btrfs_parse_early_options
Prepartory work to fix race between mount and device scan.

btrfs_parse_early_options calls the device scan from mount and we'll
need to let mount completely manage the critical section.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
David Sterba
f5194e34ca btrfs: lift uuid_mutex to callers of btrfs_open_devices
Prepartory work to fix race between mount and device scan.

The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
David Sterba
899f9307c3 btrfs: lift uuid_mutex to callers of btrfs_scan_one_device
Prepartory work to fix race between mount and device scan.

The callers will have to manage the critical section, eg. mount wants to
scan and then call btrfs_open_devices without the ioctl scan walking in
and modifying the fs devices in the meantime.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
Anand Jain
7bcb8164ad btrfs: use device_list_mutex when removing stale devices
btrfs_free_stale_devices() finds a stale (not opened) device matching
path in the fs_uuid list. We are already under uuid_mutex so when we
check for each fs_devices, hold the device_list_mutex too.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
Anand Jain
fa6d2ae540 btrfs: rename local devices for fs_devices in btrfs_free_stale_devices(
Over the years we named %fs_devices and %devices to represent the
struct btrfs_fs_devices and the struct btrfs_device. So follow the same
scheme here too. No functional changes.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:47 +02:00
Anand Jain
9c6d173ea6 btrfs: extend locked section when adding a new device in device_list_add
Make sure the device_list_lock is held the whole time:

* when the device is being looked up
* new device is initialized and put to the list
* the list counters are updated (fs_devices::opened, fs_devices::total_devices)

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Anand Jain
4306a97449 btrfs: do btrfs_free_stale_devices outside of device_list_add
btrfs_free_stale_devices() looks for device path reused for another
filesystem, and deletes the older fs_devices::device entry.

In preparation to handle locking in device_list_add, move
btrfs_free_stale_devices outside as these two functions serve a
different purpose.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Nikolay Borisov
959b1c0467 btrfs: close devices without offloading to a temporary list
Since commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname for
device list traversal") btrfs_show_devname no longer takes
device_list_mutex. As such the deadlock that 0ccd05285e ("btrfs: fix a
possible umount deadlock") aimed to fix no longer exists, we can free
the devices immediatelly and remove the code that does the pending work.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Qu Wenruo
621567a28c btrfs: Remove unused function btrfs_account_dev_extents_size
This function is not used since the alloc_start parameter has been
obsoleted in commit 0d0c71b317 ("btrfs: obsolete and remove
mount option alloc_start").

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Gu Jinxiang
93b9bcdf9f btrfs: remove unused parameter from btrfs_parse_subvol_options
Since parameter flags is no more used since commit d740760656 ("btrfs:
split parse_early_options() in two"), remove it.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:46 +02:00
Anand Jain
b4993e64f7 btrfs: fix in-memory value of total_devices after seed device deletion
In case of deleting the seed device the %cur_devices (seed) and the
%fs_devices (parent) are different. Now, as the parent
fs_devices::total_devices also maintains the total number of devices
including the seed device, so decrement its in-memory value for the
successful seed delete. We are already updating its corresponding
on-disk btrfs_super_block::number_devices value.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
Nikolay Borisov
340f1aa27f btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable
Commit 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to
btrfs_quota_enable") not only resulted in an easier to follow code but
it also introduced a subtle bug. It changed the timing when the initial
transaction rescan was happening:

- before the commit: it would happen after transaction commit had occured
- after the commit: it might happen before the transaction was committed

This results in failure to correctly rescan the quota since there could
be data which is still not committed on disk.

This patch aims to fix this by moving the transaction creation/commit
inside btrfs_quota_enable, which allows to schedule the quota commit
after the transaction has been committed.

Fixes: 5d23515be6 ("btrfs: Move qgroup rescan on quota enable to btrfs_quota_enable")
Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Link: https://marc.info/?l=linux-btrfs&m=152999289017582
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
c7b562c548 btrfs: raid56: catch errors from full_stripe_write
Add fall-back code to catch failure of full_stripe_write. Proper error
handling from inside run_plug would need more code restructuring as it's
called at arbitrary points by io scheduler.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
176571a1f6 btrfs: raid56: merge rbio_is_full helpers
There's only one call site of the unlocked helper so it can be folded
into the caller.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
a81b747d0f btrfs: raid56: use new helper for async_scrub_parity
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
e66d8d5a41 btrfs: raid56: use new helper for async_read_rebuild
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:45 +02:00
David Sterba
cf6a4a7587 btrfs: raid56: use new helper for async_rmw_stripe
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
ac63885907 btrfs: raid56: add new helper for starting async work
Add helper that schedules a given function to run on the rmw workqueue.
This will replace several standalone helpers.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
ebcc326316 btrfs: open-code bio_set_op_attrs
The helper is trivial and marked as deprecated.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
cc5e31a477 btrfs: switch types to int when counting eb pages
The loops iterating eb pages use unsigned long, that's an overkill as
we know that there are at most 16 pages (64k / 4k), and 4 by default
(with nodesize 16k).

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
8791d43207 btrfs: use round_up wrapper in num_extent_pages
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba
65ad010488 btrfs: pass only eb to num_extent_pages
Almost all callers pass the start and len as 2 arguments but this is not
necessary, all the information is provided by the eb. By reordering the
calls to num_extent_pages, we don't need the local variables with
start/len.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
d7f663fa3f btrfs: prune unused includes
Remove includes if none of the interfaces and exports is used in the
given source file.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
69d2480456 btrfs: use copy_page for copying pages instead of memcpy
Use the helper that's possibly optimized for full page copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
3ffbd68c48 btrfs: simplify pointer chasing of local fs_info variables
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.

If the transaction handle is available and not NULL it's used instead.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba
3750851562 btrfs: simplify some assignments of inode numbers
There are several places when the btrfs inode is converted to the
generic inode, back to btrfs and then passed to btrfs_ino. We can remove
the extra back and forth conversions.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
Zhihui Zhang
8f6c72a9e0 Btrfs: free space cache: make sure there is always room for generation number
io_ctl_set_generation() assumes that the generation number shares
the same page with inline CRCs. Let's make sure this is always true.

Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Anand Jain
694c51fb2e btrfs: drop unnecessary variable in btrfs_init_new_device
There is only usage of the declared devices variable, instead use its
value directly.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Anand Jain
5da54bc138 btrfs: use a temporary variable for fs_devices in btrfs_init_new_device
There are many instances of the %fs_info->fs_devices pointer
dereferences, use a temporary variable instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Qu Wenruo
389305b2aa btrfs: relocation: Only remove reloc rb_trees if reloc control has been initialized
Invalid reloc tree can cause kernel NULL pointer dereference when btrfs
does some cleanup of the reloc roots.

It turns out that fs_info::reloc_ctl can be NULL in
btrfs_recover_relocation() as we allocate relocation control after all
reloc roots have been verified.
So when we hit: note, we haven't called set_reloc_control() thus
fs_info::reloc_ctl is still NULL.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199833
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Qu Wenruo
ba480dd4db btrfs: tree-checker: Detect invalid and empty essential trees
A crafted image has empty root tree block, which will later cause NULL
pointer dereference.

The following trees should never be empty:
1) Tree root
   Must contain at least root items for extent tree, device tree and fs
   tree

2) Chunk tree
   Or we can't even bootstrap as it contains the mapping.

3) Fs tree
   At least inode item for top level inode (.).

4) Device tree
   Dev extents for chunks

5) Extent tree
   Must have corresponding extent for each chunk.

If any of them is empty, we are sure the fs is corrupted and no need to
mount it.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:42 +02:00
Qu Wenruo
fce466eab7 btrfs: tree-checker: Verify block_group_item
A crafted image with invalid block group items could make free space cache
code to cause panic.

We could detect such invalid block group item by checking:
1) Item size
   Known fixed value.
2) Block group size (key.offset)
   We have an upper limit on block group item (10G)
3) Chunk objectid
   Known fixed value.
4) Type
   Only 4 valid type values, DATA, METADATA, SYSTEM and DATA|METADATA.
   No more than 1 bit set for profile type.
5) Used space
   No more than the block group size.

This should allow btrfs to detect and refuse to mount the crafted image.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=199849
Reported-by: Xu Wen <wen.xu@gatech.edu>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
David Sterba
6d8ff4e458 btrfs: annotate unlikely branches after V0 extent type removal
The v0 extent type checks are the right case for the unlikely
annotations as we don't expect to ever see them, so let's give the
compiler some hint.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Nikolay Borisov
ba3c2b196b btrfs: Add graceful handling of V0 extents
Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.

In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Nikolay Borisov
a79865c680 btrfs: Remove V0 extent support
The v0 compat code was introduced in commit 5d4f98a28c
("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.

This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Chengguang Xu
4de426cd39 btrfs: remove unnecessary curly braces in btrfs_get_acl
It's only coding style fix not functinal change.  When if/else has only
one statement then the braces are not needed.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:41 +02:00
Chengguang Xu
dc7789ef87 btrfs: avoid error code override in btrfs_get_acl
It's not good to override the error code when failing from
btrfs_getxattr() in btrfs_get_acl() because it hides the real reason of
the failure.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Chengguang Xu
5ee552da50 btrfs: remove unnecessary -ERANGE check in btrfs_get_acl
There is no chance to get into -ERANGE error condition because we first
call btrfs_getxattr to get the length of the attribute, then we do a
subsequent call with the size from the first call.  Between the 2 calls
the size shouldn't change. So remove the unnecessary -ERANGE error
check.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Chengguang Xu
7e35eab958 btrfs: replace empty string with NULL when getting attribute length in btrfs_get_acl
In btrfs_get_acl() the first call of btr_getxattr() is for getting the
length of attribute, the value buffer is never used in this case. So
it's better to replace empty string with NULL.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Chengguang Xu
ab3629ed86 btrfs: return error instead of crash when detecting unexpected type in btrfs_get_acl
The caller of btrfs_get_acl() checks error condition so there is no
impact from this change. In practice there is no chance to get into
default case of switch statement because VFS has already checked the
type.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Su Yue
af431dcb24 btrfs: return EUCLEAN if extent_inline_ref type is invalid
If type of extent_inline_ref found is not expected, filesystem may have
been corrupted, should return EUCLEAN instead of EINVAL.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Goldwyn Rodrigues
e4af400a9c btrfs: Use iocb to derive pos instead of passing a separate parameter
struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.

generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
[ rename to btrfs_buffered_write ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:40 +02:00
Su Yue
893bf4b115 btrfs: print more details when checking tree block finds a problem
For easier debugging, print eb->start if level is invalid.  Also make
clear if bytenr found is not expected.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Nikolay Borisov
7b4284de93 btrfs: Streamline memory allocation failure handling in btrfs_add_delayed_tree_ref
Currently the function uses 2 goto labels to properly handle allocation
failures. This could be simplified by simply re-arranging the code so
that allocations are the in the beginning of the function. This allows
to use simple return statements. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Qu Wenruo
4379444654 btrfs: Don't remove block group that still has pinned down bytes
[BUG]
Under certain KVM load and LTP tests, it is possible to hit the
following calltrace if quota is enabled:

BTRFS critical (device vda2): unable to find logical 8820195328 length 4096
BTRFS critical (device vda2): unable to find logical 8820195328 length 4096

WARNING: CPU: 0 PID: 49 at ../block/blk-core.c:172 blk_status_to_errno+0x1a/0x30
CPU: 0 PID: 49 Comm: kworker/u2:1 Not tainted 4.12.14-15-default #1 SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
task: ffff9f827b340bc0 task.stack: ffffb4f8c0304000
RIP: 0010:blk_status_to_errno+0x1a/0x30
Call Trace:
 submit_extent_page+0x191/0x270 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 __do_readpage+0x2d2/0x810 [btrfs]
 ? btrfs_create_repair_bio+0x130/0x130 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 __extent_read_full_page+0xe7/0x100 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 read_extent_buffer_pages+0x1ab/0x2d0 [btrfs]
 ? run_one_async_done+0xc0/0xc0 [btrfs]
 btree_read_extent_buffer_pages+0x94/0xf0 [btrfs]
 read_tree_block+0x31/0x60 [btrfs]
 read_block_for_search.isra.35+0xf0/0x2e0 [btrfs]
 btrfs_search_slot+0x46b/0xa00 [btrfs]
 ? kmem_cache_alloc+0x1a8/0x510
 ? btrfs_get_token_32+0x5b/0x120 [btrfs]
 find_parent_nodes+0x11d/0xeb0 [btrfs]
 ? leaf_space_used+0xb8/0xd0 [btrfs]
 ? btrfs_leaf_free_space+0x49/0x90 [btrfs]
 ? btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots_safe+0x93/0x100 [btrfs]
 btrfs_find_all_roots+0x45/0x60 [btrfs]
 btrfs_qgroup_trace_extent_post+0x20/0x40 [btrfs]
 btrfs_add_delayed_data_ref+0x1a3/0x1d0 [btrfs]
 btrfs_alloc_reserved_file_extent+0x38/0x40 [btrfs]
 insert_reserved_file_extent.constprop.71+0x289/0x2e0 [btrfs]
 btrfs_finish_ordered_io+0x2f4/0x7f0 [btrfs]
 ? pick_next_task_fair+0x2cd/0x530
 ? __switch_to+0x92/0x4b0
 btrfs_worker_helper+0x81/0x300 [btrfs]
 process_one_work+0x1da/0x3f0
 worker_thread+0x2b/0x3f0
 ? process_one_work+0x3f0/0x3f0
 kthread+0x11a/0x130
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x35/0x40

BTRFS critical (device vda2): unable to find logical 8820195328 length 16384
BTRFS: error (device vda2) in btrfs_finish_ordered_io:3023: errno=-5 IO failure
BTRFS info (device vda2): forced readonly
BTRFS error (device vda2): pending csums is 2887680

[CAUSE]
It's caused by race with block group auto removal:

- There is a meta block group X, which has only one tree block
  The tree block belongs to fs tree 257.
- In current transaction, some operation modified fs tree 257
  The tree block gets COWed, so the block group X is empty, and marked
  as unused, queued to be deleted.
- Some workload (like fsync) wakes up cleaner_kthread()
  Which will call btrfs_delete_unused_bgs() to remove unused block
  groups.
  So block group X along its chunk map get removed.
- Some delalloc work finished for fs tree 257
  Quota needs to get the original reference of the extent, which will
  read tree blocks of commit root of 257.
  Then since the chunk map gets removed, the above warning gets
  triggered.

[FIX]
Just let btrfs_delete_unused_bgs() skip block group which still has
pinned bytes.

However there is a minor side effect: currently we only queue empty
blocks at update_block_group(), and such empty block group with pinned
bytes won't go through update_block_group() again, such block group
won't be removed, until it gets new extent allocated and removed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Geert Uytterhoeven
bc931c0ef8 btrfs: Refactor count handling in btrfs_unpin_free_ino
With gcc 4.1.2:

    fs/btrfs/inode-map.c: In function ‘btrfs_unpin_free_ino’:
    fs/btrfs/inode-map.c:241: warning: ‘count’ may be used uninitialized in this function

While this warning is a false-positive, it can easily be killed by
refactoring the code.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Arnd Bergmann
d3c6be6fda btrfs: use timespec64 for i_otime
While the regular inode timestamps all use timespec64 now, the i_otime
field is btrfs specific and still needs to be converted to correctly
represent times beyond 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:39 +02:00
Arnd Bergmann
afd48513f0 btrfs: use monotonic time for transaction handling
The transaction times were changed to ktime_get_real_seconds to avoid
the y2038 overflow, but they still have a minor problem when they go
backwards or jump due to settimeofday() or leap seconds.

This changes the transaction handling to instead use ktime_get_seconds(),
which returns a CLOCK_MONOTONIC timestamp that has neither of those
problems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Qu Wenruo
e41ca58974 btrfs: Get rid of the confusing btrfs_file_extent_inline_len
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
bc877d285c btrfs: Deduplicate extent_buffer init code
When a new extent buffer is allocated there are a few mandatory fields
which need to be set in order for the buffer to be sane: level,
generation, bytenr, backref_rev, owner and FSID/UUID. Currently this
is open coded in the callers of btrfs_alloc_tree_block, meaning it's
fairly high in the abstraction hierarchy of operations. This patch
solves this by simply moving this init code in btrfs_init_new_buffer,
since this is the function which initializes a newly allocated
extent buffer. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Qu Wenruo
9912bbf644 btrfs: check-integrity: Fix NULL pointer dereference for degraded mount
Commit f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
changed how btrfsic indexes device state.

Now we need to access device->bdev->bd_dev, while for degraded mount
it's completely possible to have device->bdev as NULL, thus it will
trigger a NULL pointer dereference at mount time.

Fix it by checking if the device is degraded before accessing
device->bdev->bd_dev.

There are a lot of other places accessing device->bdev->bd_dev, however
the other call sites have either checked device->bdev, or the
device->bdev is passed from btrfsic_map_block(), so it won't cause harm.

Fixes: f8f84b2dfd ("btrfs: index check-integrity state hash by a dev_t")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
43a7e99db6 btrfs: Remove fs_info from btrfs_force_chunk_alloc
It can be referenced from the passed transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:38 +02:00
Nikolay Borisov
c83488afc5 btrfs: Remove fs_info from btrfs_inc_block_group_ro
It can be referenced from the passed bg cache.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
61da2abfca btrfs: Remove fs_info from btrfs_alloc_logged_file_extent
It can be referenced from trans since the function is always called
within a valid transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
87cc7a8a2a btrfs: Remove fs_info from remove_extent_backref
It can be referenced directly from the transaction handle since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
5fac7f9ee1 btrfs: Remove fs_info from run_one_delayed_ref
It can be referenced from the passed transaction handle, since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
a639cdeba3 btrfs: Remove fs_info from insert_inline_extent_backref
It can be referenced from the passed transaction handle, since it's
always valid.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:37 +02:00
Nikolay Borisov
3c4da6574e btrfs: Remove fs_info from exclude_super_stripes
It can be referenced from the passed block group.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
9e715da860 btrfs: Remove fs_info from free_excluded_extents
It can be referenced from the passed block group.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
451a2c1303 btrfs: Remove fs_info from check_system_chunk
It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
c216b2039a btrfs: Remove fs_info from btrfs_alloc_chunk
It can be referenced from trans since the function is always called
within a transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
01458828bb btrfs: Remove fs_info from do_chunk_alloc
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:36 +02:00
Nikolay Borisov
f97806f2ee btrfs: Remove fs_info from run_delayed_tree_ref
It can always be referneced from the passed transaction handle since
it's always valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
f9871eddd9 btrfs: Remove fs_info from cleanup_ref_head
fs_info can be refenreced from the transaction handle, since it's always
valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
c4d56d4a16 btrfs: Remove unused fs_info from cleanup_extent_op
The argument is no longer used so remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
20b9a2d670 btrfs: Remove fs_info from run_delayed_extent_op
This function is always called with a valid transaction handle so
fs_info can be referenced from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
2bf98ef35f btrfs: Remove fs_info from run_delayed_data_ref
This function is always called with a valid transaction from where
fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
2590d0f155 btrfs: Remove fs_info argument from __btrfs_inc_extent_ref
This function already takes a transaction which holds a reference to
the fs_info struct. Use that reference and remove the extra arg. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:35 +02:00
Nikolay Borisov
ef89b8245b btrfs: Remove fs_info from alloc_reserved_file_extent
fs_info can be referenced from the transaction handle, which is always
valid. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
e72cb9235d btrfs: Remove fs_info from __btrfs_free_extent
This function is always called with a valid transaction handle so we
can reference the fs_info from there. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
5a98ec0141 btrfs: Remove fs_info from btrfs_remove_block_group
This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
e7e02096d9 btrfs: Remove fs_info from btrfs_make_block_group
This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
88a979c615 btrfs: Remove fs_info from btrfs_add_delayed_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:34 +02:00
Nikolay Borisov
44e1c47d5c btrfs: Remove fs_info from btrfs_add_delayed_tree_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
fbe4801b26 btrfs: Remove fs_info from lookup_extent_backref
This argument is unused. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
bd1d53ef35 btrfs: Remove fs_info argument from lookup_extent_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
b8582eeabb btrfs: Remove fs_info argument from lookup_tree_block_ref
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
61a18f1c66 btrfs: Remove fs_info argument from update_inline_extent_backref
This function always uses the leaf's extent_buffer which already
contains a reference to the fs_info. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:33 +02:00
Nikolay Borisov
867cc1fbeb btrfs: Remove fs_info from lookup_inline_extent_backref
This function is always called with a valid transaction handle from
where the fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
b167fa9152 btrfs: Remove fs_info from fixup_low_keys
This argument is unused. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
e9f6290d59 btrfs: Remove fs_info from remove_extent_data_ref
This function is always called with a valid transaction from where the
fs_info can be referenced. No functional change.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
375934105c btrfs: Remove fs_info argument from insert_extent_backref
This function is always called with a valid transaction handle from
where fs_info can be referenced. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
62b895af40 btrfs: Remove fs_info from insert_extent_data_ref
This function is always called with a valid transaction handle from
where fs_info can be referenced. So remove the redundant argument.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:32 +02:00
Nikolay Borisov
10728404c6 btrfs: Remove fs_info from insert_tree_block_ref
This function is always called with a valid transaction so there is no
need to duplicate the fs_info, we can reference it directly from the
trans handle. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
edf57cbf2b btrfs: Fix a C compliance issue
The C programming language does not allow to use preprocessor statements
inside macro arguments (pr_info() is defined as a macro). Hence rework
the pr_info() statement in btrfs_print_mod_info() such that it becomes
compliant. This patch allows tools like sparse to analyze the BTRFS
source code.

Fixes: 62e855771d ("btrfs: convert printk(KERN_* to use pr_* calls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
acd43e3cdf btrfs: Annotate fall-through when parsing mount option
This patch avoids that the compiler complains that a fall-through
annotation is missing when building with W=1.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Bart Van Assche
bece2e8239 btrfs: Fix misleading indentation reported by smatch
This patch avoids that building the BTRFS source code with smatch
triggers complaints about inconsistent indenting.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Nikolay Borisov
a9ecb653b0 btrfs: Streamline log_extent_csums a bit
Currently this function takes the root as an argument only to get the
log_root from it. Simplify this by directly passing the log root from
the caller. Also eliminate the fs_info local variable, since it's used
only once, so directly reference it from the transaction handle.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
David Sterba
ca5788aba3 btrfs: remove remaing full_sync logic from btrfs_sync_file
The logic to check if the inode is already in the log can now be
simplified since we always wait for the ordered extents to complete
before deciding whether the inode needs to be logged. The big comment
about it can go away too.

CC: Filipe Manana <fdmanana@suse.com>
Suggested-by: Filipe Manana <fdmanana@suse.com>
[ code and changelog copied from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:31 +02:00
Josef Bacik
5636cf7d6d btrfs: remove the logged extents infrastructure
This is no longer used anywhere, remove all of it.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Josef Bacik
a2120a473a btrfs: clean up the left over logged_list usage
We no longer use this list we've passed around so remove it everywhere.
Also remove the extra checks for ordered/filemap errors as this is
handled higher up now that we're waiting on ordered_extents before
getting to the tree log code.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Josef Bacik
e7175a6927 btrfs: remove the wait ordered logic in the log_one_extent path
Since we are waiting on all ordered extents at the start of the fsync()
path we don't need to wait on any logged ordered extents, and we don't
need to look up the checksums on the ordered extents as they will
already be on disk prior to getting here.  Rework this so we're only
looking up and copying the on-disk checksums for the extent range we
care about.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Josef Bacik
b5e6c3e170 btrfs: always wait on ordered extents at fsync time
There's a priority inversion that exists currently with btrfs fsync.  In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second.  However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Nikolay Borisov
16d1c062c7 btrfs: Fix comment in lookup_inline_extent_backref
The comment wrongfully states that the owner parameter is the level of
the parent block. In fact owner is the level of the current block and
by adding 1 to it we can eventually get to the parent/root.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:30 +02:00
Nikolay Borisov
bd3c685ed9 btrfs: Document __btrfs_inc_extent_ref
Here is a doc-only patch which tires to deobfuscate the terra-incognita
that arguments for delayed refs are.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Qu Wenruo
9bebe665c3 btrfs: scrub: Remove unused copy_nocow_pages and its callchain
Since commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages
for device replace") the function is not used and we can remove all
functions down the call chain.

There was an optimization that reused inode pages to speed up device
replace, but broke when there was nodatasum and compressed page. The
potential performance gain is small so we don't loose much by removing
it and using scrub_pages same as the other pages.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Allen Pais
a944442c2b btrfs: replace get_seconds with new 64bit time API
The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:29 +02:00
Christoph Hellwig
e8693bcfa0 aio: allow direct aio poll comletions for keyed wakeups
If we get a keyed wakeup for a aio poll waitqueue and wake can acquire the
ctx_lock without spinning we can just complete the iocb straight from the
wakeup callback to avoid a context switch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:39 +02:00
Christoph Hellwig
bfe4037e72 aio: implement IOCB_CMD_POLL
Simple one-shot poll through the io_submit() interface.  To poll for
a file descriptor the application should submit an iocb of type
IOCB_CMD_POLL.  It will poll the fd for the events specified in the
the first 32 bits of the aio_buf field of the iocb.

Unlike poll or epoll without EPOLLONESHOT this interface always works
in one shot mode, that is once the iocb is completed, it will have to be
resubmitted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:33 +02:00
Christoph Hellwig
9018ccc453 aio: add a iocb refcount
This is needed to prevent races caused by the way the ->poll API works.
To avoid introducing overhead for other users of the iocbs we initialize
it to zero and only do refcount operations if it is non-zero in the
completion path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:28 +02:00
Christoph Hellwig
7dda712818 timerfd: add support for keyed wakeups
This prepares timerfd for use with aio poll.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Avi Kivity <avi@scylladb.com>
2018-08-06 10:24:08 +02:00
Jens Axboe
05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
David S. Miller
c1c8626fce Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes, mostly trivial in nature.

The mlxsw conflict was resolving using the example
resolution at:

https://github.com/jpirko/linux_mlxsw/blob/combined_queue/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 13:04:31 -07:00
Gustavo A. R. Silva
7964410fcf fs: dcache: Use true and false for boolean values
Return statements in functions returning bool should use true or false
instead of an integer value.

This issue was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:52:44 -04:00
Al Viro
808aa6c5e3 Merge branch 'work.hpfs' into work.lookup 2018-08-05 15:51:10 -04:00
Al Viro
1401a0fc2d afs_try_auto_mntpt(): return NULL instead of ERR_PTR(-ENOENT)
simpler logics in callers that way

Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:50:59 -04:00
Al Viro
34b2a88fb4 afs_lookup(): switch to d_splice_alias()
->lookup() methods can (and should) use d_splice_alias() instead of
d_add().  Even if they are not going to be hit by open_by_handle(),
code does get copied around; besides, d_splice_alias() has better
calling conventions for use in ->lookup(), so the code gets simpler.

Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:44:14 -04:00
Al Viro
855371bd01 afs: switch dynroot lookups to d_splice_alias()
->lookup() methods can (and should) use d_splice_alias() instead of
d_add().  Even if they are not going to be hit by open_by_handle(),
code does get copied around...

Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-05 15:41:16 -04:00
Linus Torvalds
60f5a21736 - Fix JFS usercopy whitelist (it needed to cover neighboring field too) for
"overflow" inline inode data.
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAltlv70WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJj3eD/9+goWbs9U62tlO2cIqT5lgTFX9
 3vvVpqNJ3atw/fU6SZoa0Z2nRa9TIpZEfEljgARGhCyd2p2MplLJalWo6bq/gUUA
 5aWQbVxhyVXUp8kh8m3OnZsAZz658Y5geLMk8vakXdyQ//PF43wO0cZyOFYdG3ec
 sYuWA318UKaIsxqB9tT/K8YBRjjBgJ8wjgtpoSAr+FUhgg9Qqp3NL4fjr6SOEQrv
 XWXLVLLhyUo8kQJ29E/VyzIfLysgiA67O4ClW+DEyD6rr/ZK9XOeG6yv3vwLGSHI
 06/4BXMJce23iGIYf57Jz7b5tfAaZntdzNqUiTW6up0TuvG09auwjv/hKkwfjQv2
 fNf0TVnYOCN4ZWbCm4FTEU5q31u+pZDHiRxOTJG9EbuBVEsFiV5X9B9VTLoA+dWK
 SqWmE2X9YdONt9q6K8TGpxxrdQnQXGKOOtEY23KoF04XGYQtutIfj7cdFCDDX/eC
 AjcOIfV3u8dHWjc/wKIYS5XRUzbkdpEeNOaiCQ3RN79JN2UrA0/w7lAxZTICmYCs
 HtMtaFTER5weKQGzTzg0SP3M95qKPdWhlIq5lspdhGedonZAKBNfGahMltTH2UoY
 vIb1qGT8uz8tQSzGsu0uvrR92ZIJZ0qUVjB/nhE9HxTz26xqtDDlLXpdIXTGrnU4
 hM+Omud//MNgNbsmrg==
 =pwq8
 -----END PGP SIGNATURE-----

Merge tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull usercopy whitelisting fix from Kees Cook:
 "Bart Massey discovered that the usercopy whitelist for JFS was
  incomplete: the inline inode data may intentionally "overflow" into
  the neighboring "extended area", so the size of the whitelist needed
  to be raised to include the neighboring field"

* tag 'usercopy-fix-v4.18-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  jfs: Fix usercopy whitelist for inline inode data
2018-08-04 18:34:55 -07:00
Linus Torvalds
f639bef55d Changes since last update:
- Fix incorrect shifting in the iomap bmap functions.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltj7lcACgkQ+H93GTRK
 tOutzw//U9yOrUhuOTFrkmA2x+CB1ArncFZysUTXDFUsLRrMp0dwNaRV0WafyypO
 tCr4nvqgwg2VTFf21zWblOSAajstgJzH+x3FdxTsU5Gktd/BH9gCHSDV0lORnFhp
 jbqNOsBRPJjNIBzaIIAnyDGMuOF/MeWpGsbGF4eDcsgj58lr098rir7grEVgCkKM
 +eCMtqJwUXUZ2bo/grBWvZXnLms+8LMPYeRQMJSnw59DVunwxRdSbWumJwkPt9DD
 mz3Upa/qFJCZw2n9kluU1b/tnrpxkNWYuSjzp9iu0cMdo52HF+yDNHriUPCHa4PB
 3KfyGrhOvg+iC2pUAJ5mI3Dpv+NNAk7j/+4mOVtlUYIf0mi5gszIjNHbLiONKVH5
 5x8G59wM/ae6a1WcHe7tJRma7r984G+JLTIxAQnWaWhFq5AsYMOplk4oq6X7Lmir
 wAR7laDN0Xgl3WCFj6SXaKcnuzZFsE4A3SnZILtlgO/WxAujUyFEnICx1cO4Q9OY
 64txWii6ora9kdBtalolxjfGHwyScbhx6FiBQdKIznxGgBQR89X8hzxghdp7KTIx
 kxkC1hAM3KXXtokArjad2hgd8QG23jUshyBwKpfHnEwL75GiZQ3qtPdDm/oJlVWV
 ItnRSt+tGIT/fsT5szNPmLKgtIW4kQHflVuih0KR4IsGPMmZ8dU=
 =3omy
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs bugfix from Darrick Wong:
 "One more patch for 4.18 to fix a coding error in the iomap_bmap()
  function introduced in -rc1: fix incorrect shifting"

* tag 'xfs-4.18-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: fix iomap_bmap position calculation
2018-08-04 18:30:58 -07:00
zhong jiang
863c37fcb1 ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
The err is not used after initalization. So just remove the variable.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-08-04 17:34:07 -04:00
Kees Cook
961b33c244 jfs: Fix usercopy whitelist for inline inode data
Bart Massey reported what turned out to be a usercopy whitelist false
positive in JFS when symlink contents exceeded 128 bytes. The inline
inode data (i_inline) is actually designed to overflow into the "extended
area" following it (i_inline_ea) when needed. So the whitelist needed to
be expanded to include both i_inline and i_inline_ea (the whole size
of which is calculated internally using IDATASIZE, 256, instead of
sizeof(i_inline), 128).

$ cd /mnt/jfs
$ touch $(perl -e 'print "B" x 250')
$ ln -s B* b
$ ls -l >/dev/null

[  249.436410] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'jfs_ip' (offset 616, size 250)!

Reported-by: Bart Massey <bart.massey@gmail.com>
Fixes: 8d2704d382 ("jfs: Define usercopy region in jfs_ip slab cache")
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: jfs-discussion@lists.sourceforge.net
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-08-04 07:53:46 -07:00
Geliang Tang
1021bcf44d pstore: add zstd compression support
This patch added the 6th compression algorithm support for pstore: zstd.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-08-03 18:12:18 -07:00
Al Viro
c7b15a8657 jfs: don't bother with make_bad_inode() in ialloc()
We hit that when inumber allocation has failed.  In that case
the in-core inode is not hashed and since its ->i_nlink is 1
the only place where jfs checks is_bad_inode() won't be reached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:33 -04:00
Al Viro
d8e78da868 adfs: don't put inodes into icache
We never look them up in there; inode_fake_hash() will make them appear
hashed for mark_inode_dirty() purposes.  And don't leave them around
until memory pressure kicks them out - we never look them up again.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:33 -04:00
Al Viro
5bef915104 new helper: inode_fake_hash()
open-coded in a quite a few places...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:32 -04:00
Miklos Szeredi
e950564b97 vfs: don't evict uninitialized inode
iput() ends up calling ->evict() on new inode, which is not yet initialized
by owning fs.  So use destroy_inode() instead.

Add to sb->s_inodes list only if inode is not in I_CREATING state (meaning
that it wasn't allocated with new_inode(), which already does the
insertion).

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 80ea09a002 ("vfs: factor out inode_insert5()")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:32 -04:00
Al Viro
a6cbedfa87 jfs: switch to discard_new_inode()
we don't want open-by-handle to pick an in-core inode that
has failed setup halfway through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:31 -04:00
Al Viro
2e5afe54e0 ext2: make sure that partially set up inodes won't be returned by ext2_iget()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:31 -04:00
Al Viro
5c1a68a358 udf: switch to discard_new_inode()
we don't want open-by-handle to pick an in-core inode that
has failed setup halfway through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:30 -04:00
Al Viro
dd54992776 ufs: switch to discard_new_inode()
we don't want open-by-handle to pick an in-core inode that
has failed setup halfway through.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:30 -04:00
Al Viro
32955c5422 btrfs: switch to discard_new_inode()
Make sure that no partially set up inodes can be returned by
open-by-handle.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 16:03:29 -04:00
Al Viro
c2b6d621c4 new primitive: discard_new_inode()
We don't want open-by-handle picking half-set-up in-core
struct inode from e.g. mkdir() having failed halfway through.
In other words, we don't want such inodes returned by iget_locked()
on their way to extinction.  However, we can't just have them
unhashed - otherwise open-by-handle immediately *after* that would've
ended up creating a new in-core inode over the on-disk one that
is in process of being freed right under us.

	Solution: new flag (I_CREATING) set by insert_inode_locked() and
removed by unlock_new_inode() and a new primitive (discard_new_inode())
to be used by such halfway-through-setup failure exits instead of
unlock_new_inode() / iput() combinations.  That primitive unlocks new
inode, but leaves I_CREATING in place.

	iget_locked() treats finding an I_CREATING inode as failure
(-ESTALE, once we sort out the error propagation).
	insert_inode_locked() treats the same as instant -EBUSY.
	ilookup() treats those as icache miss.

[Fix by Dan Carpenter <dan.carpenter@oracle.com> folded in]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-03 15:55:30 -04:00
David Howells
eb9950eb31 rxrpc: Push iov_iter up from rxrpc_kernel_recv_data() to caller
Push iov_iter up from rxrpc_kernel_recv_data() to its caller to allow
non-contiguous iovs to be passed down, thereby permitting file reading to
be simplified in the AFS filesystem in a future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 12:46:20 -07:00
Linus Torvalds
310810ae19 NFS client bugfixes for Linux 4.18
Highlights include:
 
 Bugfixes:
 - Fix a NFSv4 file locking regression
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbZFG/AAoJEA4mA3inWBJcyo8P/0MN6eRpDU2VCZheM7DgsN2L
 a7W/Phc1PjHoaqksQB4Zb3uMYQQCWAyvE/VB5nRJSF1ZQbKvv7kgkyqX+PxEG8LI
 Pp3igaVXzqkRj3eA33G1HXZCrymjf7sTb4CEdMgSUBe3wLXjrFPP4Og0RJ8YGhCG
 QBG0ZENwcseUk5bmbSpp9Ac60URy1si1pD1nB3z+zSQT2ViA8QHgzg3Hpwgm+0R7
 WG/QzIFoGoJ8J9sDX/tXdkwUT2Z3yusxXcwK7Be5dtUcu1codf+EaPxRNG55myRW
 J2fY+KIocgQK8lo3w9ok1sGTyN+YkS8eIQqTeZzg0Gty/LX7bwH/3ScCeQtbu9RH
 nAR2OJQkc/wJ8sJojmUmDnBgskvgWzdfxfxRGQwlnRMD0W3t0LUDCeIUZ/1OL69l
 4pgvFLaR5MRD/DS4sSftKcOpgH5KDTlfuUXA+PamELLAk93FWJEZTVI4hmUR02+h
 /0QoRE6FAraQ7IY9TuLd/Jj3wWmqvataL6JGuWSdmhd35PbRxxBun+5zCyj62BAM
 /h0SjrCMD+dhotcdiekHINNbNYRG6ukbswgP6zCtuq4icTCW8SMVNyI3mXUVQwF3
 hAc3FylKpdGkgSrK3unLnBSeBgGwnCy1PYtusx0MgJf/qhdPYsl0bwgZhcR1U01y
 WfyGrwoNhLEmxL6+zECQ
 =gMVX
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfix from Trond Myklebust:
 "Fix a NFSv4 file locking regression"

* tag 'nfs-for-4.18-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix _nfs4_do_setlk()
2018-08-03 10:42:01 -07:00
Huang Chong
a0e336ba3e xfs: fix a comment in xfs_log_reserve
Fix the comment in xfs_log_reserve to avoid confusing.

Signed-of-by: Huang Chong <huang.chong@zte.com.cn>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-03 08:17:54 -07:00
Darrick J. Wong
1f31c98d65 xfs: only validate summary counts on primary superblock
Skip the summary counter checks for secondary superblocks and inprogress
primary superblocks because mkfs has always written those out with
zeroed summary counters.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-03 08:17:35 -07:00
Andreas Gruenbacher
21e2156f3c gfs2: Get rid of gfs2_ea_strlen
Function gfs2_ea_strlen is only called from ea_list_i, so inline it
there.  Remove the duplicate switch statement and the creative use of
memcpy to set a null byte.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-08-03 13:20:02 +01:00
Thomas Bianchi
c2b6e1591b xfs: substitute spaces with tabs
Inside xfs_attr_shortform_list removes spaces at the beginnig of the line
and replaces with tabs.
Issue found by checkpatch.

ERROR: code indent should use tabs where possible

Signed-off-by: Thomas Bianchi <thomas.bianchi8@gmail.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
9d9e623385 xfs: fold dfops into the transaction
struct xfs_defer_ops has now been reduced to a single list_head. The
external dfops mechanism is unused and thus everywhere a (permanent)
transaction is accessible the associated dfops structure is as well.

Remove the xfs_defer_ops structure and fold the list_head into the
transaction. Also remove the last remnant of external dfops in
xfs_trans_dup().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
c03edc9e49 xfs: always defer agfl block frees
The AGFL fixup code conditionally defers block frees from the free
list based on whether the current transaction has an associated
xfs_defer_ops structure. Now that dfops is embedded in the
transaction and the internal dfops is used unconditionally, this
invariant is always true.

Remove the now dead logic to check for ->t_dfops in
xfs_alloc_fix_freelist() and unconditionally defer AGFL block frees.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
0f37d1780c xfs: pass transaction to xfs_defer_add()
The majority of remaining references to struct xfs_defer_ops in XFS
are associated with xfs_defer_add(). At this point, there are no
more external xfs_defer_ops users left. All instances of
xfs_defer_ops are embedded in the transaction, which means we can
safely pass the transaction down to the dfops add interface.

Update xfs_defer_add() to receive the transaction as a parameter.
Various subsystems implement wrappers to allocate and construct the
context specific data structures for the associated deferred
operation type. Update these to also carry the transaction down as
needed and clean up unused dfops parameters along the way.

This removes most of the remaining references to struct
xfs_defer_ops throughout the code and facilitates removal of the
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[darrick: fix unused variable warnings with ftrace disabled]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
1ae093cbea xfs: replace xfs_defer_ops ->dop_pending with on-stack list
The xfs_defer_ops ->dop_pending list is used to track active
deferred operations once intents are logged. These items must be
aborted in the event of an error. The list is populated as intents
are logged and items are removed as they complete (or are aborted).

Now that xfs_defer_finish() cancels on error, there is no need to
ever access ->dop_pending outside of xfs_defer_finish(). The list is
only ever populated after xfs_defer_finish() begins and is either
completed or cancelled before it returns.

Remove ->dop_pending from xfs_defer_ops and replace it with a local
list in the xfs_defer_finish() path. Pass the local list to the
various helpers now that it is not accessible via dfops. Note that
we have to check for NULL in the abort case as the final tx roll
occurs outside of the scope of the new local list (once the dfops
has completed and thus drained the list).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
9b1f4e9831 xfs: cancel dfops on xfs_defer_finish() error
The current semantics of xfs_defer_finish() require the caller to
call xfs_defer_cancel() on error. This is slightly inconsistent with
transaction commit error handling where a failed commit cleans up
the transaction before returning.

More significantly, the only requirement for exposure of
->dop_pending outside of xfs_defer_finish() is so that
xfs_defer_cancel() can drain it on error. Since the only recourse of
xfs_defer_finish() errors is cancellation, mirror the transaction
logic and cancel remaining dfops before returning from
xfs_defer_finish() with an error.

Beside simplifying xfs_defer_finish() semantics, this ensures that
xfs_defer_finish() always returns with an empty ->dop_pending and
thus facilitates removal of the list from xfs_defer_ops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
60f31a609e xfs: clean out superfluous dfops dop params/vars
The dfops code still passes around the xfs_defer_ops pointer
superfluously in a few places. Clean this up wherever the
transaction will suffice.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
7dbddbaccd xfs: drop dop param from xfs_defer_op_type ->finish_item() callback
The dfops infrastructure ->finish_item() callback passes the
transaction and dfops as separate parameters. Since dfops is always
part of a transaction, the latter parameter is no longer necessary.
Remove it from the various callbacks.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
a8198666fb xfs: automatic dfops inode relogging
Inodes that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While inodes are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for inodes with ili_lock_flags == 0.

Replace the xfs_defer_ijoin() infrastructure with such detection and
automatic relogging of held inodes. This eliminates the need for the
per-dfops inode list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:14 -07:00
Brian Foster
82ff27bc52 xfs: automatic dfops buffer relogging
Buffers that are held across deferred operations are explicitly
joined to the dfops structure to ensure appropriate relogging.
While buffers are currently joined explicitly, we can detect the
conditions that require relogging at dfops finish time by inspecting
the transaction item list for held buffers.

Replace the xfs_defer_bjoin() infrastructure with such detection and
automatic relogging of held buffers. This eliminates the need for
the per-dfops buffer list, replaced by an on-stack variant in
xfs_defer_trans_roll().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
488c919a5b xfs: add missing defer ijoins for held inodes
Log items that require relogging during deferred operations
processing are explicitly joined to the associated dfops via the
xfs_defer_*join() helpers. These calls imply that the associated
object is "held" by the transaction such that when rolled, the item
can be immediately joined to a follow up transaction. For buffers,
this means the buffer remains locked and held after each roll. For
inodes, this means that the inode remains locked.

Failure to join a held item to the dfops structure means the
associated object pins the tail of the log while dfops processing
completes, because the item never relogs and is not unlocked or
released until deferred processing completes.

Currently, all buffers that are held in transactions (XFS_BLI_HOLD)
with deferred operations are explicitly joined to the dfops. This is
not the case for inodes, however, as various contexts defer
operations to transactions with held inodes without explicit joins
to the associated dfops (and thus not relogging).

While this is not a catastrophic problem, it is not ideal. Given
that we want to eventually relog such items automatically during
dfops processing, start by explicitly adding these missing
xfs_defer_ijoin() calls. A call is added everywhere an inode is
joined to a transaction without transferring lock ownership and
said transaction runs deferred operations.

All xfs_defer_ijoin() calls will eventually be replaced by automatic
dfops inode relogging. This patch essentially implements the
behavior change that would otherwise occur due to automatic inode
dfops relogging.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
1214f1cf66 xfs: replace dop_low with transaction flag
The dop_low field enables the low free space allocation mode when a
previous allocation has detected difficulty allocating blocks. It
has historically been part of the xfs_defer_ops structure, which
means if enabled, it remains enabled across a set of transactions
until the deferred operations have completed and the dfops is reset.

Now that the dfops is embedded in the transaction, we can save a bit
more space by using a transaction flag rather than a standalone
boolean. Drop the ->dop_low field and replace it with a transaction
flag that is set at the same points, carried across rolling
transactions and cleared on completion of deferred operations. This
essentially emulates the behavior of ->dop_low and so should not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
ce356d6477 xfs: pass transaction to dfops reset/move helpers
All callers pass ->t_dfops of the associated transactions. Refactor
the helpers to receive the transactions and facilitate further
cleanups between xfs_defer_ops and xfs_trans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
7279aa13b8 xfs: remove unused __xfs_defer_cancel() internal helper
With no more external dfops users, there is no need for an
xfs_defer_ops cancel wrapper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
fbfa977d25 xfs: use transaction for intent recovery instead of raw dfops
Log intent recovery is the last user of an external (on-stack)
dfops. The pattern exists because the dfops is used to collect
additional deferred operations queued during the whole recovery
sequence. The dfops is finished with a new transaction after intent
recovery completes.

We already have a mechanism to create an empty, container-like
transaction to support the scrub infrastructure. We can reuse that
mechanism here to drop the final user of external dfops. This
facilitates folding dfops state (i.e., dop_low) into the
transaction, the elimination of now unused external dfops support
and also eliminates the only caller of __xfs_defer_cancel().

Replace the on-stack dfops with an empty transaction and pass it
around to the various helpers that queue and finish deferred
operations during intent recovery.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Brian Foster
98719051e7 xfs: refactor internal dfops initialization
The current transaction allocation code conditionally initializes
the ->t_dfops indirection pointer. Transaction commit/cancel check
the validity of the pointer to determine whether to finish/cancel
the internal dfops.

This disallows the ability to use the internal dfops list as a
temporary container (via xfs_trans_alloc_empty()). Refactor
transaction allocation to always initialize ->t_dfops and check
permanent reservation state on transaction commit/cancel.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 23:05:13 -07:00
Mike Rapoport
31e810aa10 userfaultfd: remove uffd flags from vma->vm_flags if UFFD_EVENT_FORK fails
The fix in commit 0cbb4b4f4c ("userfaultfd: clear the
vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
that were copied from the parent process VMA.

As the result, there is an inconsistency between the values of
vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
in userfaultfd_release().

Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
failure resolves the issue.

Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
Fixes: 0cbb4b4f4c ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 16:03:40 -07:00
Eric Sandeen
79b3dbe4ad fs: fix iomap_bmap position calculation
The position calculation in iomap_bmap() shifts bno the wrong way,
so we don't progress properly and end up re-mapping block zero
over and over, yielding an unchanging physical block range as the
logical block advances:

# filefrag -Be file
 ext:   logical_offset:     physical_offset: length:   expected: flags:
   0:      0..       0:      21..        21:      1:             merged
   1:      1..       1:      21..        21:      1:         22: merged
Discontinuity: Block 1 is at 21 (was 22)
   2:      2..       2:      21..        21:      1:         22: merged
Discontinuity: Block 2 is at 21 (was 22)
   3:      3..       3:      21..        21:      1:         22: merged

This breaks the FIBMAP interface for anyone using it (XFS), which
in turn breaks LILO, zipl, etc.

Bug-actually-spotted-by: Darrick J. Wong <darrick.wong@oracle.com>
Fixes: 89eb1906a9 ("iomap: add an iomap-based bmap implementation")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02 13:09:27 -07:00
Bart Van Assche
2afc9166f7 scsi: sysfs: Introduce sysfs_{un,}break_active_protection()
Introduce these two functions and export them such that the next patch
can add calls to these functions from the SCSI core.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-08-02 15:53:15 -04:00
Chengguang Xu
8687a3e2c7 ceph: add additional offset check in ceph_write_iter()
If the offset is larger or equal to both real file size and
max file size, then return -EFBIG.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:28 +02:00
Chengguang Xu
0671e9968d ceph: add additional range check in ceph_fallocate()
If the range is larger than both real file size and limit of
max file size, then return -EFBIG.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:28 +02:00
Chengguang Xu
719784ba70 ceph: add new field max_file_size in ceph_fs_client
In order to not bother to VFS and other specific filesystems,
we decided to do offset validation inside ceph kernel client,
so just simply set sb->s_maxbytes to MAX_LFS_FILESIZE so that
it can successfully pass VFS check. We add new field max_file_size
in ceph_fs_client to store real file size limit and doing proper
check based on it.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:27 +02:00
Ilya Dryomov
6daca13d2e libceph: add authorizer challenge
When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply).  This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge).  The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:24 +02:00
Souptick Joarder
24499847e4 ceph: adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite
and fault handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:20 +02:00
Arnd Bergmann
0ed1e90a09 ceph: use timespec64 for r_stamp
The ceph_mds_request stamp still uses the deprecated timespec structure,
this converts it over as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:19 +02:00
Arnd Bergmann
fac02ddf91 libceph: use timespec64 for r_mtime
The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.

[ Remove now redundant ts variable in writepage_nounlock(). ]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:14 +02:00
Arnd Bergmann
9bbeab41ce ceph: use timespec64 for inode timestamp
Since the vfs structures are all using timespec64, we can now
change the internal representation, using ceph_encode_timespec64 and
ceph_decode_timespec64.

In case of ceph_aux_inode however, we need to avoid doing a memcmp()
on uninitialized padding data, so the members of the i_mtime field get
copied individually into 64-bit integers.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Arnd Bergmann
63ecae7e43 ceph: stop using current_kernel_time()
ceph_mdsc_create_request() is one of the last callers of the
deprecated current_kernel_time() as well as timespec_trunc().

This changes it to use the timespec64 based interfaces instead,
though we still need to convert the result until we are ready to
change over req->r_stamp.

The output of the two functions, ktime_get_coarse_real_ts64() and
current_kernel_time() is the same coarse-granular timestamp,
the only difference here is that ktime_get_coarse_real_ts64()
doesn't overflow in 2038.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu
67fcd15140 ceph: add d_drop for some error cases in ceph_symlink()
When file num exceeds quota limit, should call d_drop to drop
dentry from cache as well.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu
0459871c49 ceph: add d_drop for some error cases in ceph_mknod()
When file num exceeds quota limit or fails from ceph_per_init_acls()
should call d_drop to drop dentry from cache as well.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Chengguang Xu
61ad36d47d ceph: return errors from posix_acl_equiv_mode() correctly
In order to return correct error code should replace variable ret
using err in error case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Yan, Zheng
dfeb84d4ad ceph: fix incorrect use of strncpy
GCC8 prints following warning:

 fs/ceph/mds_client.c:3683:2: warning: ‘strncpy’ output may be truncated
 copying 64 bytes from a string of length 64 [-Wstringop-truncation]

[ Change to strscpy() while at it. ]

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Ilya Dryomov
2f56b6bae7 libceph: amend "bad option arg" error message
Don't mention "mount" -- in the rbd case it is "mapping".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu
93d35c754d ceph: restore ctime as well in the case of restoring old mode
It's better to restore ctime as well in the case of restoring old mode
in ceph_set_acl().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu
f017754d69 ceph: add retry logic for error -ERANGE in ceph_get_acl()
When the size of acl extended attribution is larger than pre-allocated
value buffer size, we will hit error '-ERANGE' and it's probabaly caused
by concurrent get/set acl from different clients. In this case, current
logic just sets acl to NULL so that we cannot get proper information but
the operation looks successful.

This patch adds retry logic for error -ERANGE and return -EIO if fail
from the retry. Additionally, print real errno when failing from
__ceph_getxattr().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Phillip Lougher
a3f94cb99a Squashfs: Compute expected length from inode size rather than block length
Previously in squashfs_readpage() when copying data into the page
cache, it used the length of the datablock read from the filesystem
(after decompression).  However, if the filesystem has been corrupted
this data block may be short, which will leave pages unfilled.

The fix for this is to compute the expected number of bytes to copy
from the inode size, and use this to detect if the block is short.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 09:34:02 -07:00
Linus Torvalds
71755ee535 squashfs: more metadata hardening
The squashfs fragment reading code doesn't actually verify that the
fragment is inside the fragment table.  The end result _is_ verified to
be inside the image when actually reading the fragment data, but before
that is done, we may end up taking a page fault because the fragment
table itself might not even exist.

Another report from Anatoly and his endless squashfs image fuzzing.

Reported-by: Анатолий Тросиненко <anatoly.trosinenko@gmail.com>
Acked-by:: Phillip Lougher <phillip.lougher@gmail.com>,
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-02 09:32:23 -07:00
Liu Song
bc71652346 ext4: improve code readability in ext4_iget()
Merge the duplicated complex conditions to improve code readability.

Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
2018-08-02 00:11:16 -04:00
Jeremy Cline
1a5d5e5d51 ext4: fix spectre gadget in ext4_mb_regular_allocator()
'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the
derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to
index arrays which makes it a potential spectre gadget. Fix this by
sanitizing the value assigned to 'ac->ac2_order'.  This covers the
following accesses found with the help of smatch:

* fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential
  spectre issue 'grp->bb_counters' [w] (local cap)

* fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue
  'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap)

* fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue
  'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap)

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-02 00:03:40 -04:00
Al Viro
c971e6a006 kill d_instantiate_no_diralias()
The only user is fuse_create_new_entry(), and there it's used to
mitigate the same mkdir/open-by-handle race as in nfs_mkdir().
The same solution applies - unhash the mkdir argument, then
call d_splice_alias() and if that returns a reference to preexisting
alias, dput() and report success.  ->mkdir() argument left unhashed
negative with the preexisting alias moved in the right place is just
fine from the ->mkdir() callers point of view.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-01 23:18:53 -04:00
Trond Myklebust
6ea76bf513 NFSv4: Fix _nfs4_do_setlk()
The patch to fix the case where a lock request was interrupted ended up
changing default handling of errors such as NFS4ERR_DENIED and caused the
client to immediately resend the lock request. Let's do a partial revert
of that request so that the default is now to exit, but change the way
we handle resends to take into account the fact that the user may have
interrupted the request.

Reported-by: Kenneth Johansson <ken@kenjo.org>
Fixes: a3cf9bca2a ("NFSv4: Don't add a new lock on an interrupted wait..")
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2018-08-01 23:17:06 -04:00
Christoph Hellwig
006477f40d kconfig: move the "Executable file formats" menu to fs/Kconfig.binfmt
No need to have this in the top-level Kconfig.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-08-02 08:06:55 +09:00
Chao Yu
82cf4f132e f2fs: fix to active page in lru list for read path
If config CONFIG_F2FS_FAULT_INJECTION is on, for both read or write path
we will call find_lock_page() to get the page, but for read path, it
missed to passing FGP_ACCESSED to allocator to active the page in LRU
list, result in being reclaimed in advance incorrectly, fix it.

Reported-by: Xianrong Zhou <zhouxianrong@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
18767e6263 f2fs: don't keep meta pages used for block migration
For migration of encrypted inode's block, we load data of encrypted block
into meta inode's page cache, after checkpoint, those all intermediate
pages should be clean, and no one will read them again, so let's just
release them for more memory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
4ddc1b28aa f2fs: fix to restrict mount condition when without CONFIG_QUOTA
Like quota_ino feature, we need to reject mounting RDWR with image
which enables project_quota feature when there is no CONFIG_QUOTA
be set in kernel.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
00960c2cd8 f2fs: quota: do not mount as RDWR without QUOTA if quota feature enabled
If quota feature is enabled, quota is on by default. However, if
CONFIG_QUOTA is not built in kernel, dquot entries will not get updated,
which leads to quota inconsistency.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
76cf05d79c f2fs: quota: fix incorrect comments
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Sheng Yong
955ac6e523 f2fs: quota: decrease the lock granularity of statfs_project
According to fs/quota/dquot.c, `dq_data_lock' protects mem_dqinfo
structures and modifications of dquot pointers in the inode, and
`dquot->dq_dqb_lock' protects data from dq_dqb.

We should use dquot->dq_dqb_lock in statfs_project instead of
dq_dat_lock.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
970e348d98 f2fs: add proc entry to show victim_secmap bitmap
This patch adds a new proc entry to show victim_secmap information in
more detail, which is very helpful to know the get_victim candidate
status clearly, and helpful to debug problems (e.g., some sections can
not gc all of its blocks, since some blocks belong to atomic file,
leaving victim_secmap with section bit setting, in extrem case, this
will lead all bytes of victim_secmap setting with 0xff).

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
fd8c8caf7e f2fs: let checkpoint flush dnode page of regular
Fsyncer will wait on all dnode pages of regular writeback before flushing,
if there are async dnode pages blocked by IO scheduler, it may decrease
fsync's performance.

In this patch, we choose to let f2fs_balance_fs_bg() to trigger checkpoint
to flush these dnode pages of regular, so async IO of dnode page can be
elimitnated, making fsyncer only need to wait for sync IO.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
ad6672bbc5 f2fs: issue discard align to section in LFS mode
For the case when sbi->segs_per_sec > 1 with lfs mode, take
section:segment = 5 for example, if the section prefree_map is
...previous section | current section (1 1 0 1 1) | next section...,
then the start = x, end = x + 1, after start = start_segno +
sbi->segs_per_sec, start = x + 5, then it will skip x + 3 and x + 4, but
their bitmap is still set, which will cause duplicated
f2fs_issue_discard of this same section in the next write_checkpoint:

round 1: section bitmap : 1 1 1 1 1, all valid, prefree_map: 0 0 0 0 0
then rm data block NO.2, block NO.2 becomes invalid, prefree_map: 0 0 1 0 0
write_checkpoint: section bitmap: 1 1 0 1 1, prefree_map: 0 0 0 0 0,
prefree of NO.2 is cleared, and no discard issued

round 2: rm data block NO.0, NO.1, NO.3, NO.4
all invalid, but prefree bit of NO.2 is set and cleared in round 1, then
prefree_map: 1 1 0 1 1
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1, no
valid blocks of this section, so discard issued, but this time prefree
bit of NO.3 and NO.4 is skipped due to start = start_segno + sbi->segs_per_sec;

round 3:
write_checkpoint: section bitmap: 0 0 0 0 0, prefree_map: 0 0 0 1 1 ->
0 0 0 0 0, no valid blocks of this section, so discard issued,
this time prefree bit of NO.3 and NO.4 is cleared, but the discard of
this section is sent again...

To fix this problem, we can align the start and end value to section
boundary for fstrim and real-time discard operation, and decide to issue
discard only when the whole section is invalid, which can issue discard
aligned to section size as much as possible and avoid redundant discard.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Jaegeuk Kim
455e3a5887 f2fs: don't allow any writes on aborted atomic writes
In order to prevent abusing atomic writes by abnormal users, we've added a
threshold, 20% over memory footprint, which disallows further atomic writes.
Previously, however, SQLite doesn't know the files became normal, so that
it could write stale data and commit on revoked normal database file.

Once f2fs detects such the abnormal behavior, this patch tries to avoid further
writes in write_begin().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
797c1cb56b f2fs: restrict setting up inode.i_advise
In order to give advise to f2fs to recognize hot/cold file, it is possible
that we can set specific bit in inode.i_advise through setxattr(), but
there are several bits which are used internally, such as encrypt_bit,
keep_size_bit, they should never be changed through setxattr().

So that this patch 1) adds FADVISE_MODIFIABLE_BITS to filter modifiable
bits user given, 2) supports to clear {hot,cold}_file bits.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlei He
e6b0b159cf f2fs: fix wrong kernel message when recover fsync data on ro fs
This patch fix wrong message info for recover fsync data
on readonly fs.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
059c0648c6 f2fs: clean up ioctl interface naming
Romve redundant prefix 'f2fs_' in the middle of f2fs_ioc_f2fs_write_checkpoint().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
2079f115e7 f2fs: clean up with f2fs_is_{atomic,volatile}_file()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
5b72d5e0df f2fs: clean up with f2fs_encrypted_inode()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
80551d1773 f2fs: clean up with get_current_nat_page
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
6122003a1a f2fs: kill EXT_TREE_VEC_SIZE
Since commit 201ef5e080 ("f2fs: improve shrink performance of extent nodes"),
there is no user of EXT_TREE_VEC_SIZE, just kill it for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Hyunchul Lee
5d3ce4f701 f2fs: avoid duplicated permission check for "trusted." xattrs
Because xattr_permission already checks CAP_SYS_ADMIN
capability, we don't need to check it.

Signed-off-by: Hyunchul Lee <cheol.lee@lge.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
7735730d39 f2fs: fix to propagate error from __get_meta_page()
If caller of __get_meta_page() can handle error, let's propagate error
from __get_meta_page().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
18dd6470c2 f2fs: fix to do sanity check with i_extra_isize
If inode.i_extra_isize was fuzzed to an abnormal value, when
calculating inline data size, the result will overflow, result
in accessing invalid memory area when operating inline data.

Let's do sanity check with i_extra_isize during inode loading
for fixing.

https://bugzilla.kernel.org/show_bug.cgi?id=200421

- Reproduce

- POC (poc.c)
    #define _GNU_SOURCE
    #include <sys/types.h>
    #include <sys/mount.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/xattr.h>

    #include <dirent.h>
    #include <errno.h>
    #include <error.h>
    #include <fcntl.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #include <linux/falloc.h>
    #include <linux/loop.h>

    static void activity(char *mpoint) {

      char *foo_bar_baz;
      char *foo_baz;
      char *xattr;
      int err;

      err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);
      err = asprintf(&foo_baz, "%s/foo/baz", mpoint);
      err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);

      rename(foo_bar_baz, foo_baz);

      char buf2[113];
      memset(buf2, 0, sizeof(buf2));
      listxattr(xattr, buf2, sizeof(buf2));
      removexattr(xattr, "user.mime_type");

    }

    int main(int argc, char *argv[]) {
      activity(argv[1]);
      return 0;
    }

- Kernel message
Umount the image will leave the following message
[ 2910.995489] F2FS-fs (loop0): Mounted with checkpoint version = 2
[ 2918.416465] ==================================================================
[ 2918.416807] BUG: KASAN: slab-out-of-bounds in f2fs_iget+0xcb9/0x1a80
[ 2918.417009] Read of size 4 at addr ffff88018efc2068 by task a.out/1229

[ 2918.417311] CPU: 1 PID: 1229 Comm: a.out Not tainted 4.17.0+ #1
[ 2918.417314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2918.417323] Call Trace:
[ 2918.417366]  dump_stack+0x71/0xab
[ 2918.417401]  print_address_description+0x6b/0x290
[ 2918.417407]  kasan_report+0x28e/0x390
[ 2918.417411]  ? f2fs_iget+0xcb9/0x1a80
[ 2918.417415]  f2fs_iget+0xcb9/0x1a80
[ 2918.417422]  ? f2fs_lookup+0x2e7/0x580
[ 2918.417425]  f2fs_lookup+0x2e7/0x580
[ 2918.417433]  ? __recover_dot_dentries+0x400/0x400
[ 2918.417447]  ? legitimize_path.isra.29+0x5a/0xa0
[ 2918.417453]  __lookup_slow+0x11c/0x220
[ 2918.417457]  ? may_delete+0x2a0/0x2a0
[ 2918.417475]  ? deref_stack_reg+0xe0/0xe0
[ 2918.417479]  ? __lookup_hash+0xb0/0xb0
[ 2918.417483]  lookup_slow+0x3e/0x60
[ 2918.417488]  walk_component+0x3ac/0x990
[ 2918.417492]  ? generic_permission+0x51/0x1e0
[ 2918.417495]  ? inode_permission+0x51/0x1d0
[ 2918.417499]  ? pick_link+0x3e0/0x3e0
[ 2918.417502]  ? link_path_walk+0x4b1/0x770
[ 2918.417513]  ? _raw_spin_lock_irqsave+0x25/0x50
[ 2918.417518]  ? walk_component+0x990/0x990
[ 2918.417522]  ? path_init+0x2e6/0x580
[ 2918.417526]  path_lookupat+0x13f/0x430
[ 2918.417531]  ? trailing_symlink+0x3a0/0x3a0
[ 2918.417534]  ? do_renameat2+0x270/0x7b0
[ 2918.417538]  ? __kasan_slab_free+0x14c/0x190
[ 2918.417541]  ? do_renameat2+0x270/0x7b0
[ 2918.417553]  ? kmem_cache_free+0x85/0x1e0
[ 2918.417558]  ? do_renameat2+0x270/0x7b0
[ 2918.417563]  filename_lookup+0x13c/0x280
[ 2918.417567]  ? filename_parentat+0x2b0/0x2b0
[ 2918.417572]  ? kasan_unpoison_shadow+0x31/0x40
[ 2918.417575]  ? kasan_kmalloc+0xa6/0xd0
[ 2918.417593]  ? strncpy_from_user+0xaa/0x1c0
[ 2918.417598]  ? getname_flags+0x101/0x2b0
[ 2918.417614]  ? path_listxattr+0x87/0x110
[ 2918.417619]  path_listxattr+0x87/0x110
[ 2918.417623]  ? listxattr+0xc0/0xc0
[ 2918.417637]  ? mm_fault_error+0x1b0/0x1b0
[ 2918.417654]  do_syscall_64+0x73/0x160
[ 2918.417660]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2918.417676] RIP: 0033:0x7f2f3a3480d7
[ 2918.417677] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[ 2918.417732] RSP: 002b:00007fff4095b7d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000c2
[ 2918.417744] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2f3a3480d7
[ 2918.417746] RDX: 0000000000000071 RSI: 00007fff4095b810 RDI: 000000000126a0c0
[ 2918.417749] RBP: 00007fff4095b890 R08: 000000000126a010 R09: 0000000000000000
[ 2918.417751] R10: 00000000000001ab R11: 0000000000000206 R12: 00000000004005e0
[ 2918.417753] R13: 00007fff4095b990 R14: 0000000000000000 R15: 0000000000000000

[ 2918.417853] Allocated by task 329:
[ 2918.418002]  kasan_kmalloc+0xa6/0xd0
[ 2918.418007]  kmem_cache_alloc+0xc8/0x1e0
[ 2918.418023]  mempool_init_node+0x194/0x230
[ 2918.418027]  mempool_init+0x12/0x20
[ 2918.418042]  bioset_init+0x2bd/0x380
[ 2918.418052]  blk_alloc_queue_node+0xe9/0x540
[ 2918.418075]  dm_create+0x2c0/0x800
[ 2918.418080]  dev_create+0xd2/0x530
[ 2918.418083]  ctl_ioctl+0x2a3/0x5b0
[ 2918.418087]  dm_ctl_ioctl+0xa/0x10
[ 2918.418092]  do_vfs_ioctl+0x13e/0x8c0
[ 2918.418095]  ksys_ioctl+0x66/0x70
[ 2918.418098]  __x64_sys_ioctl+0x3d/0x50
[ 2918.418102]  do_syscall_64+0x73/0x160
[ 2918.418106]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 2918.418204] Freed by task 0:
[ 2918.418301] (stack is not available)

[ 2918.418521] The buggy address belongs to the object at ffff88018efc0000
                which belongs to the cache biovec-max of size 8192
[ 2918.418894] The buggy address is located 104 bytes to the right of
                8192-byte region [ffff88018efc0000, ffff88018efc2000)
[ 2918.419257] The buggy address belongs to the page:
[ 2918.419431] page:ffffea00063bf000 count:1 mapcount:0 mapping:ffff8801f2242540 index:0x0 compound_mapcount: 0
[ 2918.419702] flags: 0x17fff8000008100(slab|head)
[ 2918.419879] raw: 017fff8000008100 dead000000000100 dead000000000200 ffff8801f2242540
[ 2918.420101] raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000
[ 2918.420322] page dumped because: kasan: bad access detected

[ 2918.420599] Memory state around the buggy address:
[ 2918.420764]  ffff88018efc1f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.420975]  ffff88018efc1f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.421194] >ffff88018efc2000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 2918.421406]                                                           ^
[ 2918.421627]  ffff88018efc2080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 2918.421838]  ffff88018efc2100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 2918.422046] ==================================================================
[ 2918.422264] Disabling lock debugging due to kernel taint
[ 2923.901641] BUG: unable to handle kernel paging request at ffff88018f0db000
[ 2923.901884] PGD 22226a067 P4D 22226a067 PUD 222273067 PMD 18e642063 PTE 800000018f0db061
[ 2923.902120] Oops: 0003 [#1] SMP KASAN PTI
[ 2923.902274] CPU: 1 PID: 1231 Comm: umount Tainted: G    B             4.17.0+ #1
[ 2923.902490] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2923.902761] RIP: 0010:__memset+0x24/0x30
[ 2923.902906] Code: 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 2923.903446] RSP: 0018:ffff88018ddf7ae0 EFLAGS: 00010206
[ 2923.903622] RAX: 0000000000000000 RBX: ffff8801d549d888 RCX: 1ffffffffffdaffb
[ 2923.903833] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88018f0daffc
[ 2923.904062] RBP: ffff88018efc206c R08: 1ffff10031df840d R09: ffff88018efc206c
[ 2923.904273] R10: ffffffffffffe1ee R11: ffffed0031df65fa R12: 0000000000000000
[ 2923.904485] R13: ffff8801d549dc98 R14: 00000000ffffc3db R15: ffffea00063bec80
[ 2923.904693] FS:  00007fa8b2f8a840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 2923.904937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2923.910080] CR2: ffff88018f0db000 CR3: 000000018f892000 CR4: 00000000000006e0
[ 2923.914930] Call Trace:
[ 2923.919724]  f2fs_truncate_inline_inode+0x114/0x170
[ 2923.924487]  f2fs_truncate_blocks+0x11b/0x7c0
[ 2923.929178]  ? f2fs_truncate_data_blocks+0x10/0x10
[ 2923.933834]  ? dqget+0x670/0x670
[ 2923.938437]  ? f2fs_destroy_extent_tree+0xd6/0x270
[ 2923.943107]  ? __radix_tree_lookup+0x2f/0x150
[ 2923.947772]  f2fs_truncate+0xd4/0x1a0
[ 2923.952491]  f2fs_evict_inode+0x5ab/0x610
[ 2923.957204]  evict+0x15f/0x280
[ 2923.961898]  __dentry_kill+0x161/0x250
[ 2923.966634]  shrink_dentry_list+0xf3/0x250
[ 2923.971897]  shrink_dcache_parent+0xa9/0x100
[ 2923.976561]  ? shrink_dcache_sb+0x1f0/0x1f0
[ 2923.981177]  ? wait_for_completion+0x8a/0x210
[ 2923.985781]  ? migrate_swap_stop+0x2d0/0x2d0
[ 2923.990332]  do_one_tree+0xe/0x40
[ 2923.994735]  shrink_dcache_for_umount+0x3a/0xa0
[ 2923.999077]  generic_shutdown_super+0x3e/0x1c0
[ 2924.003350]  kill_block_super+0x4b/0x70
[ 2924.007619]  deactivate_locked_super+0x65/0x90
[ 2924.011812]  cleanup_mnt+0x5c/0xa0
[ 2924.015995]  task_work_run+0xce/0xf0
[ 2924.020174]  exit_to_usermode_loop+0x115/0x120
[ 2924.024293]  do_syscall_64+0x12f/0x160
[ 2924.028479]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2924.032709] RIP: 0033:0x7fa8b2868487
[ 2924.036888] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[ 2924.045750] RSP: 002b:00007ffc39824d58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2924.050190] RAX: 0000000000000000 RBX: 00000000008ea030 RCX: 00007fa8b2868487
[ 2924.054604] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000008f4360
[ 2924.058940] RBP: 00000000008f4360 R08: 0000000000000000 R09: 0000000000000014
[ 2924.063186] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007fa8b2d7183c
[ 2924.067418] R13: 0000000000000000 R14: 00000000008ea210 R15: 00007ffc39824fe0
[ 2924.071534] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper pata_acpi floppy
[ 2924.098044] CR2: ffff88018f0db000
[ 2924.102520] ---[ end trace a8e0d899985faf31 ]---
[ 2924.107012] RIP: 0010:__memset+0x24/0x30
[ 2924.111448] Code: 90 90 90 90 90 90 66 66 90 66 90 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
[ 2924.120724] RSP: 0018:ffff88018ddf7ae0 EFLAGS: 00010206
[ 2924.125312] RAX: 0000000000000000 RBX: ffff8801d549d888 RCX: 1ffffffffffdaffb
[ 2924.129931] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88018f0daffc
[ 2924.134537] RBP: ffff88018efc206c R08: 1ffff10031df840d R09: ffff88018efc206c
[ 2924.139175] R10: ffffffffffffe1ee R11: ffffed0031df65fa R12: 0000000000000000
[ 2924.143825] R13: ffff8801d549dc98 R14: 00000000ffffc3db R15: ffffea00063bec80
[ 2924.148500] FS:  00007fa8b2f8a840(0000) GS:ffff8801f3b00000(0000) knlGS:0000000000000000
[ 2924.153247] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2924.158003] CR2: ffff88018f0db000 CR3: 000000018f892000 CR4: 00000000000006e0
[ 2924.164641] BUG: Bad rss-counter state mm:00000000fa04621e idx:0 val:4
[ 2924.170007] BUG: Bad rss-counter
tate mm:00000000fa04621e idx:1 val:2

- Location
https://elixir.bootlin.com/linux/v4.18-rc3/source/fs/f2fs/inline.c#L78
	memset(addr + from, 0, MAX_INLINE_DATA(inode) - from);
Here the length can be negative.

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
66415cee3d f2fs: blk_finish_plug of submit_bio in lfs mode
Expand the blk_finish_plug action from blkzoned to normal lfs mode,
since plug will cause the out-of-order IO submission, which is not
friendly to flash in lfs mode.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Yunlong Song
3611ce9911 f2fs: do not set free of current section
For the case when sbi->segs_per_sec > 1, take section:segment = 5 for
example, if segment 1 is just used and allocate new segment 2, and the
blocks of segment 1 is invalidated, at this time, the previous code will
use __set_test_and_free to free the free_secmap and free_sections++,
this is not correct since it is still a current section, so fix it.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Daniel Rosenberg
36b877af79 f2fs: Keep alloc_valid_block_count in sync
If we attempt to request more blocks than we have room for, we try to
instead request as much as we can, however, alloc_valid_block_count
is not decremented to match the new value, allowing it to drift higher
until the next checkpoint. This always decrements it when the requested
amount cannot be fulfilled.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
20ee438232 f2fs: issue small discard by LBA order
For small granularity discard which size is smaller than 64KB, if we
issue those kind of discards orderly by size, their IOs will be spread
into entire logical address, so that in FTL, L2P table will be updated
randomly, result bad wear rate in the table.

In this patch, we choose to issue small discard by LBA order, by this
way, we can expect that L2P table updates from adjacent discard IOs can
be merged in the cache, so it can reduce lifetime wearing of flash.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
522d1711d6 f2fs: stop issuing discard immediately if there is queued IO
For background discard policy, even if there is queued user IO, still
we will check max_requests times for next discard entry, it is unneeded,
let's just stop this round submission immediately.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
4c6b56c002 f2fs: clean up with IS_INODE()
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
2482c4325d f2fs: detect bug_on in f2fs_wait_discard_bios
Add bug_on to detect potential non-empty discard wait list.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Randy Dunlap
cb15d1e43d f2fs: fix defined but not used build warnings
Fix build warnings in f2fs when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.

../fs/f2fs/sysfs.c:519:12: warning: 'segment_info_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:546:12: warning: 'segment_bits_seq_show' defined but not used [-Wunused-function]
../fs/f2fs/sysfs.c:570:12: warning: 'iostat_info_seq_show' defined but not used [-Wunused-function]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
a39e536583 f2fs: enable real-time discard by default
f2fs is focused on flash based storage, so let's enable real-time
discard by default, if user don't want to enable it, 'nodiscard'
mount option should be used on mount.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
82902c06bd f2fs: fix to detect looped node chain correctly
Below dmesg was printed when testing generic/388 of fstest:

F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22
F2FS-fs (zram1): Mounted with checkpoint version = 22300d0e
F2FS-fs (zram1): find_fsync_dnodes: detect looped node chain, blkaddr:526615, next:526616
F2FS-fs (zram1): Cannot recover all fsync data errno=-22

The reason is that we initialize free_blocks with free blocks of
filesystem, so if filesystem is full, free_blocks can be zero,
below condition will be true, so that, it will fail recovery.

if (++loop_cnt >= free_blocks ||
	blkaddr == next_blkaddr_of_node(page))

To fix this issue, initialize free_blocks with correct value which
includes over-privision blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:36 -07:00
Chao Yu
c9b60788fc f2fs: fix to do sanity check with block address in main area
This patch add to do sanity check with below field:
- cp_pack_total_block_count
- blkaddr of data/node
- extent info

- Overview
BUG() in verify_block_addr() when writing to a corrupted f2fs image

- Reproduce (4.18 upstream kernel)

- POC (poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);

  int fd = open(foo_bar_baz, O_RDWR | O_TRUNC, 0777);
  if (fd >= 0) {
    write(fd, (char *)buf, sizeof(buf));
    fdatasync(fd);
    close(fd);
  }
}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

- Kernel message
[  689.349473] F2FS-fs (loop0): Mounted with checkpoint version = 3
[  699.728662] WARNING: CPU: 0 PID: 1309 at fs/f2fs/segment.c:2860 f2fs_inplace_write_data+0x232/0x240
[  699.728670] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  699.729056] CPU: 0 PID: 1309 Comm: a.out Not tainted 4.18.0-rc1+ #4
[  699.729064] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.729074] RIP: 0010:f2fs_inplace_write_data+0x232/0x240
[  699.729076] Code: ff e9 cf fe ff ff 49 8d 7d 10 e8 39 45 ad ff 4d 8b 7d 10 be 04 00 00 00 49 8d 7f 48 e8 07 49 ad ff 45 8b 7f 48 e9 fb fe ff ff <0f> 0b f0 41 80 4d 48 04 e9 65 fe ff ff 90 66 66 66 66 90 55 48 8d
[  699.729130] RSP: 0018:ffff8801f43af568 EFLAGS: 00010202
[  699.729139] RAX: 000000000000003f RBX: ffff8801f43af7b8 RCX: ffffffffb88c9113
[  699.729142] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8802024e5540
[  699.729144] RBP: ffff8801f43af590 R08: 0000000000000009 R09: ffffffffffffffe8
[  699.729147] R10: 0000000000000001 R11: ffffed0039b0596a R12: ffff8802024e5540
[  699.729149] R13: ffff8801f0335500 R14: ffff8801e3e7a700 R15: ffff8801e1ee4450
[  699.729154] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.729156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.729159] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.729171] Call Trace:
[  699.729192]  f2fs_do_write_data_page+0x2e2/0xe00
[  699.729203]  ? f2fs_should_update_outplace+0xd0/0xd0
[  699.729238]  ? memcg_drain_all_list_lrus+0x280/0x280
[  699.729269]  ? __radix_tree_replace+0xa3/0x120
[  699.729276]  __write_data_page+0x5c7/0xe30
[  699.729291]  ? kasan_check_read+0x11/0x20
[  699.729310]  ? page_mapped+0x8a/0x110
[  699.729321]  ? page_mkclean+0xe9/0x160
[  699.729327]  ? f2fs_do_write_data_page+0xe00/0xe00
[  699.729331]  ? invalid_page_referenced_vma+0x130/0x130
[  699.729345]  ? clear_page_dirty_for_io+0x332/0x450
[  699.729351]  f2fs_write_cache_pages+0x4ca/0x860
[  699.729358]  ? __write_data_page+0xe30/0xe30
[  699.729374]  ? percpu_counter_add_batch+0x22/0xa0
[  699.729380]  ? kasan_check_write+0x14/0x20
[  699.729391]  ? _raw_spin_lock+0x17/0x40
[  699.729403]  ? f2fs_mark_inode_dirty_sync.part.18+0x16/0x30
[  699.729413]  ? iov_iter_advance+0x113/0x640
[  699.729418]  ? f2fs_write_end+0x133/0x2e0
[  699.729423]  ? balance_dirty_pages_ratelimited+0x239/0x640
[  699.729428]  f2fs_write_data_pages+0x329/0x520
[  699.729433]  ? generic_perform_write+0x250/0x320
[  699.729438]  ? f2fs_write_cache_pages+0x860/0x860
[  699.729454]  ? current_time+0x110/0x110
[  699.729459]  ? f2fs_preallocate_blocks+0x1ef/0x370
[  699.729464]  do_writepages+0x37/0xb0
[  699.729468]  ? f2fs_write_cache_pages+0x860/0x860
[  699.729472]  ? do_writepages+0x37/0xb0
[  699.729478]  __filemap_fdatawrite_range+0x19a/0x1f0
[  699.729483]  ? delete_from_page_cache_batch+0x4e0/0x4e0
[  699.729496]  ? __vfs_write+0x2b2/0x410
[  699.729501]  file_write_and_wait_range+0x66/0xb0
[  699.729506]  f2fs_do_sync_file+0x1f9/0xd90
[  699.729511]  ? truncate_partial_data_page+0x290/0x290
[  699.729521]  ? __sb_end_write+0x30/0x50
[  699.729526]  ? vfs_write+0x20f/0x260
[  699.729530]  f2fs_sync_file+0x9a/0xb0
[  699.729534]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.729548]  vfs_fsync_range+0x68/0x100
[  699.729554]  ? __fget_light+0xc9/0xe0
[  699.729558]  do_fsync+0x3d/0x70
[  699.729562]  __x64_sys_fdatasync+0x24/0x30
[  699.729585]  do_syscall_64+0x78/0x170
[  699.729595]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  699.729613] RIP: 0033:0x7f9bf930d800
[  699.729615] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 49 bf 2c 00 00 75 10 b8 4b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 78 01 00 48 89 04 24
[  699.729668] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.729673] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.729675] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.729678] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.729680] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.729683] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000
[  699.729687] ---[ end trace 4ce02f25ff7d3df5 ]---
[  699.729782] ------------[ cut here ]------------
[  699.729785] kernel BUG at fs/f2fs/segment.h:654!
[  699.731055] invalid opcode: 0000 [#1] SMP KASAN PTI
[  699.732104] CPU: 0 PID: 1309 Comm: a.out Tainted: G        W         4.18.0-rc1+ #4
[  699.733684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.735611] RIP: 0010:f2fs_submit_page_bio+0x29b/0x730
[  699.736649] Code: 54 49 8d bd 18 04 00 00 e8 b2 59 af ff 41 8b 8d 18 04 00 00 8b 45 b8 41 d3 e6 44 01 f0 4c 8d 73 14 41 39 c7 0f 82 37 fe ff ff <0f> 0b 65 8b 05 2c 04 77 47 89 c0 48 0f a3 05 52 c1 d5 01 0f 92 c0
[  699.740524] RSP: 0018:ffff8801f43af508 EFLAGS: 00010283
[  699.741573] RAX: 0000000000000000 RBX: ffff8801f43af7b8 RCX: ffffffffb88a7cef
[  699.743006] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8801e3e7a64c
[  699.744426] RBP: ffff8801f43af558 R08: ffffed003e066b55 R09: ffffed003e066b55
[  699.745833] R10: 0000000000000001 R11: ffffed003e066b54 R12: ffffea0007876940
[  699.747256] R13: ffff8801f0335500 R14: ffff8801e3e7a600 R15: 0000000000000001
[  699.748683] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.750293] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.751462] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.752874] Call Trace:
[  699.753386]  ? f2fs_inplace_write_data+0x93/0x240
[  699.754341]  f2fs_inplace_write_data+0xd2/0x240
[  699.755271]  f2fs_do_write_data_page+0x2e2/0xe00
[  699.756214]  ? f2fs_should_update_outplace+0xd0/0xd0
[  699.757215]  ? memcg_drain_all_list_lrus+0x280/0x280
[  699.758209]  ? __radix_tree_replace+0xa3/0x120
[  699.759164]  __write_data_page+0x5c7/0xe30
[  699.760002]  ? kasan_check_read+0x11/0x20
[  699.760823]  ? page_mapped+0x8a/0x110
[  699.761573]  ? page_mkclean+0xe9/0x160
[  699.762345]  ? f2fs_do_write_data_page+0xe00/0xe00
[  699.763332]  ? invalid_page_referenced_vma+0x130/0x130
[  699.764374]  ? clear_page_dirty_for_io+0x332/0x450
[  699.765347]  f2fs_write_cache_pages+0x4ca/0x860
[  699.766276]  ? __write_data_page+0xe30/0xe30
[  699.767161]  ? percpu_counter_add_batch+0x22/0xa0
[  699.768112]  ? kasan_check_write+0x14/0x20
[  699.768951]  ? _raw_spin_lock+0x17/0x40
[  699.769739]  ? f2fs_mark_inode_dirty_sync.part.18+0x16/0x30
[  699.770885]  ? iov_iter_advance+0x113/0x640
[  699.771743]  ? f2fs_write_end+0x133/0x2e0
[  699.772569]  ? balance_dirty_pages_ratelimited+0x239/0x640
[  699.773680]  f2fs_write_data_pages+0x329/0x520
[  699.774603]  ? generic_perform_write+0x250/0x320
[  699.775544]  ? f2fs_write_cache_pages+0x860/0x860
[  699.776510]  ? current_time+0x110/0x110
[  699.777299]  ? f2fs_preallocate_blocks+0x1ef/0x370
[  699.778279]  do_writepages+0x37/0xb0
[  699.779026]  ? f2fs_write_cache_pages+0x860/0x860
[  699.779978]  ? do_writepages+0x37/0xb0
[  699.780755]  __filemap_fdatawrite_range+0x19a/0x1f0
[  699.781746]  ? delete_from_page_cache_batch+0x4e0/0x4e0
[  699.782820]  ? __vfs_write+0x2b2/0x410
[  699.783597]  file_write_and_wait_range+0x66/0xb0
[  699.784540]  f2fs_do_sync_file+0x1f9/0xd90
[  699.785381]  ? truncate_partial_data_page+0x290/0x290
[  699.786415]  ? __sb_end_write+0x30/0x50
[  699.787204]  ? vfs_write+0x20f/0x260
[  699.787941]  f2fs_sync_file+0x9a/0xb0
[  699.788694]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.789572]  vfs_fsync_range+0x68/0x100
[  699.790360]  ? __fget_light+0xc9/0xe0
[  699.791128]  do_fsync+0x3d/0x70
[  699.791779]  __x64_sys_fdatasync+0x24/0x30
[  699.792614]  do_syscall_64+0x78/0x170
[  699.793371]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  699.794406] RIP: 0033:0x7f9bf930d800
[  699.795134] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 49 bf 2c 00 00 75 10 b8 4b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be 78 01 00 48 89 04 24
[  699.798960] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.800483] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.801923] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.803373] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.804798] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.806233] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000
[  699.807667] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  699.817079] ---[ end trace 4ce02f25ff7d3df6 ]---
[  699.818068] RIP: 0010:f2fs_submit_page_bio+0x29b/0x730
[  699.819114] Code: 54 49 8d bd 18 04 00 00 e8 b2 59 af ff 41 8b 8d 18 04 00 00 8b 45 b8 41 d3 e6 44 01 f0 4c 8d 73 14 41 39 c7 0f 82 37 fe ff ff <0f> 0b 65 8b 05 2c 04 77 47 89 c0 48 0f a3 05 52 c1 d5 01 0f 92 c0
[  699.822919] RSP: 0018:ffff8801f43af508 EFLAGS: 00010283
[  699.823977] RAX: 0000000000000000 RBX: ffff8801f43af7b8 RCX: ffffffffb88a7cef
[  699.825436] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8801e3e7a64c
[  699.826881] RBP: ffff8801f43af558 R08: ffffed003e066b55 R09: ffffed003e066b55
[  699.828292] R10: 0000000000000001 R11: ffffed003e066b54 R12: ffffea0007876940
[  699.829750] R13: ffff8801f0335500 R14: ffff8801e3e7a600 R15: 0000000000000001
[  699.831192] FS:  00007f9bf97f5700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  699.832793] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  699.833981] CR2: 00007f9bf925d170 CR3: 00000001f0c34000 CR4: 00000000000006f0
[  699.835556] ==================================================================
[  699.837029] BUG: KASAN: stack-out-of-bounds in update_stack_state+0x38c/0x3e0
[  699.838462] Read of size 8 at addr ffff8801f43af970 by task a.out/1309

[  699.840086] CPU: 0 PID: 1309 Comm: a.out Tainted: G      D W         4.18.0-rc1+ #4
[  699.841603] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  699.843475] Call Trace:
[  699.843982]  dump_stack+0x7b/0xb5
[  699.844661]  print_address_description+0x70/0x290
[  699.845607]  kasan_report+0x291/0x390
[  699.846351]  ? update_stack_state+0x38c/0x3e0
[  699.853831]  __asan_load8+0x54/0x90
[  699.854569]  update_stack_state+0x38c/0x3e0
[  699.855428]  ? __read_once_size_nocheck.constprop.7+0x20/0x20
[  699.856601]  ? __save_stack_trace+0x5e/0x100
[  699.857476]  unwind_next_frame.part.5+0x18e/0x490
[  699.858448]  ? unwind_dump+0x290/0x290
[  699.859217]  ? clear_page_dirty_for_io+0x332/0x450
[  699.860185]  __unwind_start+0x106/0x190
[  699.860974]  __save_stack_trace+0x5e/0x100
[  699.861808]  ? __save_stack_trace+0x5e/0x100
[  699.862691]  ? unlink_anon_vmas+0xba/0x2c0
[  699.863525]  save_stack_trace+0x1f/0x30
[  699.864312]  save_stack+0x46/0xd0
[  699.864993]  ? __alloc_pages_slowpath+0x1420/0x1420
[  699.865990]  ? flush_tlb_mm_range+0x15e/0x220
[  699.866889]  ? kasan_check_write+0x14/0x20
[  699.867724]  ? __dec_node_state+0x92/0xb0
[  699.868543]  ? lock_page_memcg+0x85/0xf0
[  699.869350]  ? unlock_page_memcg+0x16/0x80
[  699.870185]  ? page_remove_rmap+0x198/0x520
[  699.871048]  ? mark_page_accessed+0x133/0x200
[  699.871930]  ? _cond_resched+0x1a/0x50
[  699.872700]  ? unmap_page_range+0xcd4/0xe50
[  699.873551]  ? rb_next+0x58/0x80
[  699.874217]  ? rb_next+0x58/0x80
[  699.874895]  __kasan_slab_free+0x13c/0x1a0
[  699.875734]  ? unlink_anon_vmas+0xba/0x2c0
[  699.876563]  kasan_slab_free+0xe/0x10
[  699.877315]  kmem_cache_free+0x89/0x1e0
[  699.878095]  unlink_anon_vmas+0xba/0x2c0
[  699.878913]  free_pgtables+0x101/0x1b0
[  699.879677]  exit_mmap+0x146/0x2a0
[  699.880378]  ? __ia32_sys_munmap+0x50/0x50
[  699.881214]  ? kasan_check_read+0x11/0x20
[  699.882052]  ? mm_update_next_owner+0x322/0x380
[  699.882985]  mmput+0x8b/0x1d0
[  699.883602]  do_exit+0x43a/0x1390
[  699.884288]  ? mm_update_next_owner+0x380/0x380
[  699.885212]  ? f2fs_sync_file+0x9a/0xb0
[  699.885995]  ? f2fs_do_sync_file+0xd90/0xd90
[  699.886877]  ? vfs_fsync_range+0x68/0x100
[  699.887694]  ? __fget_light+0xc9/0xe0
[  699.888442]  ? do_fsync+0x3d/0x70
[  699.889118]  ? __x64_sys_fdatasync+0x24/0x30
[  699.889996]  rewind_stack_do_exit+0x17/0x20
[  699.890860] RIP: 0033:0x7f9bf930d800
[  699.891585] Code: Bad RIP value.
[  699.892268] RSP: 002b:00007ffee3606c68 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[  699.893781] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9bf930d800
[  699.895220] RDX: 0000000000008000 RSI: 00000000006010a0 RDI: 0000000000000003
[  699.896643] RBP: 00007ffee3606ca0 R08: 0000000001503010 R09: 0000000000000000
[  699.898069] R10: 00000000000002e8 R11: 0000000000000246 R12: 0000000000400610
[  699.899505] R13: 00007ffee3606da0 R14: 0000000000000000 R15: 0000000000000000

[  699.901241] The buggy address belongs to the page:
[  699.902215] page:ffffea0007d0ebc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  699.903811] flags: 0x2ffff0000000000()
[  699.904585] raw: 02ffff0000000000 0000000000000000 ffffffff07d00101 0000000000000000
[  699.906125] raw: 0000000000000000 0000000000240000 00000000ffffffff 0000000000000000
[  699.907673] page dumped because: kasan: bad access detected

[  699.909108] Memory state around the buggy address:
[  699.910077]  ffff8801f43af800: 00 f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3 f3 00 00 00
[  699.911528]  ffff8801f43af880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  699.912953] >ffff8801f43af900: 00 00 00 00 00 00 00 00 f1 01 f4 f4 f4 f2 f2 f2
[  699.914392]                                                              ^
[  699.915758]  ffff8801f43af980: f2 00 f4 f4 00 00 00 00 f2 00 00 00 00 00 00 00
[  699.917193]  ffff8801f43afa00: 00 00 00 00 00 00 00 00 00 f3 f3 f3 00 00 00 00
[  699.918634] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.h#L644

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-01 11:52:32 -07:00
Linus Torvalds
cdbb65c4c7 squashfs metadata 2: electric boogaloo
Anatoly continues to find issues with fuzzed squashfs images.

This time, corrupt, missing, or undersized data for the page filling
wasn't checked for, because the squashfs_{copy,read}_cache() functions
did the squashfs_copy_data() call without checking the resulting data
size.

Which could result in the page cache pages being incompletely filled in,
and no error indication to the user space reading garbage data.

So make a helper function for the "fill in pages" case, because the
exact same incomplete sequence existed in two places.

[ I should have made a squashfs branch for these things, but I didn't
  intend to start doing them in the first place.

  My historical connection through cramfs is why I got into looking at
  these issues at all, and every time I (continue to) think it's a
  one-off.

  Because _this_ time is always the last time. Right?   - Linus ]

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-01 10:38:43 -07:00
Theodore Ts'o
7d95178c77 ext4: check for NUL characters in extended attribute's name
Extended attribute names are defined to be NUL-terminated, so the name
must not contain a NUL character.  This is important because there are
places when remove extended attribute, the code uses strlen to
determine the length of the entry.  That should probably be fixed at
some point, but code is currently really messy, so the simplest fix
for now is to simply validate that the extended attributes are sane.

https://bugzilla.kernel.org/show_bug.cgi?id=200401

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-01 12:36:52 -04:00
Wang Shilong
5ef2a69993 ext4: use ext4_warning() for sb_getblk failure
Out of memory should not be considered as critical errors; so replace
ext4_error() with ext4_warnig().

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-01 12:02:31 -04:00
Darrick J. Wong
56830d6cc1 xfs: check da node magic in _node_lookup_int
Before we start processing what we /think/ is a da3 node block, actually
check the magic to make sure that we're looking at a node block.  This
way we won't blow the asserts in _node_hdr_from_disk on corrupted
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:42:43 -07:00
Darrick J. Wong
611995db2c xfs: use a local variable for magic number in xfs_da3_node_lookup_int
Use a local variable for the block magic number checks instead of
abusing blk->magic.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:42:18 -07:00
Darrick J. Wong
0c60d3aa0e xfs: refactor log recovery check
Add a predicate to decide if the log is actively in recovery and use
that instead of open-coding a pagf_init check in the attr leaf verifier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-08-01 07:40:48 -07:00
Darrick J. Wong
ff23f4af7e xfs: move extent busy tree initialization to xfs_initialize_perag
Move the per-AG busy extent tree initialization to the per-ag structure
initialization since we don't want online repair to leak the old tree.
We only deconstruct the tree at unmount time, so this should be safe.
This also enables us to eliminate the commented out initialization in
the xfsprogs libxfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-31 13:18:09 -07:00
Christoph Hellwig
e666aa37f4 xfs: avoid COW fork extent lookups in writeback if the fork didn't change
Used the per-fork sequence counter to avoid lookups in the writeback code
unless the COW fork actually changed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31 13:18:09 -07:00
Christoph Hellwig
745b3f76d1 xfs: maintain a sequence count for inode fork manipulations
Add a simple 32-bit unsigned integer as the sequence count for
modifications to the extent list in the inode fork.  This will be
used to optimize away extent list lookups in the writeback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
9e037cb797 xfs: check for unknown v5 feature bits in superblock write verifier
Make sure we never try to write the superblock with unknown feature bits
set.  We checked those at mount time, so if they're set now then memory
is corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
69775fd15d xfs: verify icount in superblock write
Add a helper predicate to check the inode count for sanity, then use it
in the superblock write verifier to inspect sb_icount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Bill O'Donnell
8756a5af18 libxfs: add more bounds checking to sb sanity checks
Current sb verifier doesn't check bounds on sb_fdblocks and sb_ifree.
Add sanity checks for these parameters.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
[darrick: port to refactored sb validation predicates]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
eca383fcd6 xfs: refactor superblock verifiers
Split the superblock verifier into the common checks, the read-time
checks, and the write-time check functions.  No functional changes, but
we're setting up to add more write-only checks.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-07-31 13:18:09 -07:00
Darrick J. Wong
86d969b425 xfs: refactor the xrep_extent_list into xfs_bitmap
As mentioned previously, the xrep_extent_list basically implements a
bitmap with two functions: set and disjoint union.  Rename all these
functions to xfs_bitmap to shorten the name and make it more obvious
what we're doing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-31 13:18:08 -07:00
Bill Baker
0f90be132c NFSv4 client live hangs after live data migration recovery
After a live data migration event at the NFS server, the client may send
I/O requests to the wrong server, causing a live hang due to repeated
recovery events.  On the wire, this will appear as an I/O request failing
with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly.
NFS4ERR_BADSSESSION is returned because the session ID being used was
issued by the other server and is not valid at the old server.

The failure is caused by async worker threads having cached the transport
(xprt) in the rpc_task structure.  After the migration recovery completes,
the task is redispatched and the task resends the request to the wrong
server based on the old value still present in tk_xprt.

The solution is to recompute the tk_xprt field of the rpc_task structure
so that the request goes to the correct server.

Signed-off-by: Bill Baker <bill.baker@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Helen Chao <helen.chao@oracle.com>
Fixes: fb43d17210 ("SUNRPC: Use the multipath iterator to assign a ...")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:40 -04:00
Olga Kornievskaia
32cd3ee511 NFSv4.0 fix client reference leak in callback
If there is an error during processing of a callback message, it leads
to refrence leak on the client structure and eventually an unclean
superblock.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:40 -04:00
Dan Carpenter
379ebf0796 NFS: silence a harmless uninitialized variable warning
kstrtoul() can return -ERANGE so Smatch complains that "num" can be
uninitialized.  We check that it's within bounds so it's not a huge
deal.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:40 -04:00
Dave Wysochanski
016583d703 sunrpc: Change rpc_print_iostats to rpc_clnt_show_stats and handle rpc_clnt clones
The existing rpc_print_iostats has a few shortcomings.  First, the naming
is not consistent with other functions in the kernel that display stats.
Second, it is really displaying stats for an rpc_clnt structure as it
displays both xprt stats and per-op stats.  Third, it does not handle
rpc_clnt clones, which is important for the one in-kernel tree caller
of this function, the NFS client's nfs_show_stats function.

Fix all of the above by renaming the rpc_print_iostats to
rpc_clnt_show_stats and looping through any rpc_clnt clones via
cl_parent.

Once this interface is fixed, this addresses a problem with NFSv4.
Before this patch, the /proc/self/mountstats always showed incorrect
counts for NFSv4 lease and session related opcodes such as SEQUENCE,
RENEW, SETCLIENTID, CREATE_SESSION, etc.  These counts were always 0
even though many ops would go over the wire.  The reason for this is
there are multiple rpc_clnt structures allocated for any given NFSv4
mount, and inside nfs_show_stats() we callled into rpc_print_iostats()
which only handled one of them, nfs_server->client.  Fix these counts
by calling sunrpc's new rpc_clnt_show_stats() function, which handles
cloned rpc_clnt structs and prints the stats together.

Note that one side-effect of the above is that multiple mounts from
the same NFS server will show identical counts in the above ops due
to the fact the one rpc_clnt (representing the NFSv4 client state)
is shared across mounts.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-31 12:53:35 -04:00
Zubin Mithra
5248ee8560 tracefs: Annotate tracefs_ops with __ro_after_init
tracefs_ops is initialized inside tracefs_create_instance_dir and not
modified after. tracefs_create_instance_dir allows for initialization
only once, and is called from create_trace_instances(marked __init),
which is called from tracer_init_tracefs(marked __init). Also, mark
tracefs_create_instance_dir as __init.

Link: http://lkml.kernel.org/r/20180725171901.4468-1-zsm@chromium.org

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Zubin Mithra <zsm@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-07-31 11:32:44 -04:00
Linus Torvalds
d512584780 squashfs: more metadata hardening
Anatoly reports another squashfs fuzzing issue, where the decompression
parameters themselves are in a compressed block.

This causes squashfs_read_data() to be called in order to read the
decompression options before the decompression stream having been set
up, making squashfs go sideways.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Acked-by: Phillip Lougher <phillip.lougher@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-30 17:29:17 -07:00
Mauro Carvalho Chehab
d21c249b26 media: dvb/audio.h: get rid of unused APIs
There are a number of other ioctls that aren't used anywhere
inside the Kernel tree.

Get rid of them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-07-30 16:21:49 -04:00
Mauro Carvalho Chehab
b41e44b4cb media: dvb/video.h: get rid of unused APIs
There are a number of other ioctls that aren't used anywhere
inside the Kernel tree.

Get rid of them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-07-30 15:43:47 -04:00
Dan Carpenter
0914bb965e pnfs/blocklayout: off by one in bl_map_stripe()
"dev->nr_children" is the number of children which were parsed
successfully in bl_parse_stripe().  It could be all of them and then, in
that case, it is equal to v->stripe.volumes_count.  Either way, the >
should be >= so that we don't go beyond the end of what we're supposed
to.

Fixes: 5c83746a0c ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:40 -04:00
Calum Mackay
23a88ade71 nfs: Referrals not inheriting proto setting from parent
Commit 530ea42192 ("nfs: Referrals should use the same proto setting
as their parent") encloses the fix with #ifdef CONFIG_SUNRPC_XPRT_RDMA.

CONFIG_SUNRPC_XPRT_RDMA is a tristate option, so it should be tested
with #if IS_ENABLED().

Fixes: 530ea42192 ("nfs: Referrals should use the same proto setting as their parent")
Reported-by: Helen Chao <helen.chao@oracle.com>
Tested-by: Helen Chao <helen.chao@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Bill Baker <bill.baker@oracle.com>
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:40 -04:00
Jeff Layton
8b199e58d4 nfs: initiate returning delegation when reclaiming one that's been recalled
When reclaiming a delegation via CLAIM_PREVIOUS open, the server can
indicate that the delegation has been recalled since it was issued by
setting the "recalled" flag in the delegation.

Ensure that we respect the flag by initiating a delegation return when
it is set.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:40 -04:00
Souptick Joarder
01a368441f fs: nfs: Adding new return type vm_fault_t
Use new return type vm_fault_t for fault handler
in struct vm_operations_struct. For now, this is
just documenting that the function returns a
VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become
a distinct type.

see commit 1c8f422059 ("mm: change return type to
vm_fault_t") for reference.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:40 -04:00
Chengguang Xu
12b289cfac nfs: add error check in nfs_idmap_prepare_message()
Even though the caller of nfs_idmap_prepare_message() checks return
code in their side but it's better to add an error check for match_int()
so that we can avoid unnecessary operations when bad int arg is
detected.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-30 13:19:40 -04:00
Huaisheng Ye
86ed913b0e filesystem-dax: Do not request kaddr and pfn when not required
Some functions within fs/dax don't need to get local pointer kaddr
or variable pfn from direct_access. Using NULL instead of having to
pass in useless pointer or variable that caller then just throw away.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-30 09:40:42 -07:00
Christoph Hellwig
51d6269030 xfs: introduce a new xfs_inode_has_cow_data helper
We have a few places that already check if an inode has actual data in
the COW fork to avoid work on reflink inodes that do not actually have
outstanding COW blocks.  There are a few more places that can avoid
working if doing the same check, so add a documented helper for this
condition and use it in all places where it makes sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Christoph Hellwig
3ba738df25 xfs: remove the xfs_ifork_t typedef
We only have a few more callers left, so seize the opportunity and kill
it off.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Christoph Hellwig
1216b58b35 xfs: simplify xfs_idata_realloc
Streamline the code and take advantage of the fact that kmem_realloc
through krealloc will be have like a normal allocation if passing in a
NULL old pointer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Christoph Hellwig
fcacbc3f51 xfs: remove if_real_bytes
The field is only used for asserts, and to track if we really need to do
realloc when growing the inode fork data.  But the krealloc function
already performs this check internally, so there is no need to keep track
of the real allocation size.

This will free space in the inode fork for keeping a sequence counter of
changes to the extent list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-30 07:57:48 -07:00
Greg Kroah-Hartman
d2fc88a61b Merge 4.18-rc7 into driver-core-next
We need the driver core changes in here as well for testing.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-30 10:08:09 +02:00
Darrick J. Wong
bc270b53e6 xfs: move the repair extent list into its own file
Move the xrep_extent_list code into a separate file.  Logically, this
data structure is really just a clumsy bitmap, and in the next patch
we'll make this more obvious.  No functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-29 22:37:09 -07:00
Darrick J. Wong
ebcbef3a61 xfs: pass transaction lock while setting up agresv on cyclic metadata
Pass a tranaction pointer through to all helpers that calculate the
per-AG block reservation.  Online repair will use this to reinitialize
per-ag reservations while it still holds all the AG headers locked to
the repair transaction.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-29 22:37:08 -07:00
Wang Shilong
9af0b3d125 ext4: fix race when setting the bitmap corrupted flag
Whenever we hit block or inode bitmap corruptions we set
bit and then reduce this block group free inode/clusters
counter to expose right available space.

However some of ext4_mark_group_bitmap_corrupted() is called
inside group spinlock, some are not, this could make it happen
that we double reduce one block group free counters from system.

Always hold group spinlock for it could fix it, but it looks
a little heavy, we could use test_and_set_bit() to fix race
problems here.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-07-29 17:27:45 -04:00
Eric Sandeen
f39b3f45db ext4: reset error code in ext4_find_entry in fallback
When ext4_find_entry() falls back to "searching the old fashioned
way" due to a corrupt dx dir, it needs to reset the error code
to NULL so that the nonstandard ERR_BAD_DX_DIR code isn't returned
to userspace.

https://bugzilla.kernel.org/show_bug.cgi?id=199947

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@yandex.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-07-29 17:13:42 -04:00
Ross Zwisler
430657b6be ext4: handle layout changes to pinned DAX mappings
Follow the lead of xfs_break_dax_layouts() and add synchronization between
operations in ext4 which remove blocks from an inode (hole punch, truncate
down, etc.) and pages which are pinned due to DAX DMA operations.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2018-07-29 17:00:22 -04:00
Ross Zwisler
cdbf8897cb dax: dax_layout_busy_page() warn on !exceptional
Inodes using DAX should only ever have exceptional entries in their page
caches.  Make this clear by warning if the iteration in
dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a
comment for the pagevec_release() call which only deals with struct page
pointers.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-07-29 16:59:16 -04:00
Linus Torvalds
3cfb6772d4 Some miscellaneous ext4 fixes for 4.18; one fix is for a regression
introduced in 4.18-rc4.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlteF34ACgkQ8vlZVpUN
 gaMQugf+LjlbbncSEuPxZ+C3CnSGkEzjrg8IRylZA2uf04Z5Bax8K5gqvXLx7ZtF
 Qz3vzmrYpaUV8UiaMy0SGLCRWebwoxPEN7ZX3/W1PfeymP3wQ4DLw37059AzLfsq
 Vzh9w3N1At1plUee7iJ2MDBU830Q0a917jjnpZ+M0AtQx/BzP8QEISuzp4JWICqe
 NbJDVybMWoW2YOSpMPiihxSFqCDx5rMyAJ1vllboopZK+XAjpQ/visnLh3aT3o71
 7cTPl9gI2rbwYbJk8kM5fmXhWqSARHARV1bpZNOUnCAUU1E2Se7aETjggQ0QzJE/
 mIc7wCzFLrrY8+iakwdhb5Aw3qOPyg==
 =ZdXo
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Some miscellaneous ext4 fixes for 4.18; one fix is for a regression
  introduced in 4.18-rc4.

  Sorry for the late-breaking pull. I was originally going to wait for
  the next merge window, but Eric Whitney found a regression introduced
  in 4.18-rc4, so I decided to push out the regression plus the other
  fixes now. (The other commits have been baking in linux-next since
  early July)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix check to prevent initializing reserved inodes
  ext4: check for allocation block validity with block group locked
  ext4: fix inline data updates with checksums enabled
  ext4: clear mmp sequence number when remounting read-only
  ext4: fix false negatives *and* false positives in ext4_check_descriptors()
2018-07-29 13:13:45 -07:00
Gustavo A. R. Silva
62bbdd9974 ext4: use swap macro in mext_page_double_lock
Make use of the swap macro and remove unnecessary variable *tmp*.
This makes the code easier to read and maintain.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 16:11:59 -04:00
Tyler Hicks
d175339027 sysfs: Fix regression when adding a file to an existing group
Commit 5f81880d52 ("sysfs, kobject: allow creating kobject belonging
to arbitrary users") incorrectly changed the argument passed as the
parent parameter when calling sysfs_add_file_mode_ns(). This caused some
sysfs attribute files to not be added correctly to certain groups.

Fixes: 5f81880d52 ("sysfs, kobject: allow creating kobject belonging to arbitrary users")
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-29 13:11:28 -07:00
Chengguang Xu
21ac738ede ext4: check allocation failure when duplicating "data" in ext4_remount()
There is no check for allocation failure when duplicating
"data" in ext4_remount(). Check for failure and return
error -ENOMEM in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2018-07-29 15:51:54 -04:00
Junichi Uekawa
7f144fd046 ext4: fix warning message in ext4_enable_quotas()
Output the warning message before we clobber type and be -1 all the time.
The error message would now be

[    1.519791] EXT4-fs warning (device vdb): ext4_enable_quotas:5402:
Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix.

Signed-off-by: Junichi Uekawa <uekawa@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2018-07-29 15:51:52 -04:00
Arnd Bergmann
6a0678a79b ext4: super: extend timestamps to 40 bits
The inode timestamps use 34 bits in ext4, but the various timestamps in
the superblock are limited to 32 bits. If every user accesses these as
'unsigned', then this is good until year 2106, but it seems better to
extend this a bit further in the process of removing the deprecated
get_seconds() function.

This adds another byte for each timestamp in the superblock, making
them long enough to store timestamps beyond what is in the inodes,
which seems good enough here (in ocfs2, they are already 64-bit wide,
which is appropriate for a new layout).

I did not modify e2fsprogs, which obviously needs the same change to
actually interpret future timestamps correctly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:51:48 -04:00
Arnd Bergmann
b42d1d6b5b jbd2: replace current_kernel_time64 with ktime equivalent
jbd2 is one of the few callers of current_kernel_time64(), which
is a wrapper around ktime_get_coarse_real_ts64(). This calls the
latter directly for consistency with the rest of the kernel that
is moving to the ktime_get_ family of time accessors.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:51:47 -04:00
Arnd Bergmann
7b62b29320 ext4: use timespec64 for all inode times
This is the last missing piece for the inode times on 32-bit systems:
now that VFS interfaces use timespec64, we just need to stop truncating
the tv_sec values for y2038 compatibililty.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:51:00 -04:00
Arnd Bergmann
5ffff83432 ext4: use ktime_get_real_seconds for i_dtime
We only care about the low 32-bit for i_dtime as explained in commit
b5f515735b ("ext4: avoid Y2038 overflow in recently_deleted()"), so
the use of get_seconds() is correct here, but that function is getting
removed in the process of the y2038 fixes, so let's use the modern
ktime_get_real_seconds() here.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:50:00 -04:00
Arnd Bergmann
af123b3718 ext4: use 64-bit timestamps for mmp_time
The mmp_time field is 64 bits wide, which is good, but calling
get_seconds() results in a 32-bit value on 32-bit architectures. Using
ktime_get_real_seconds() instead returns 64 bits everywhere.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:49:00 -04:00
Arnd Bergmann
a4d2aadca1 ext4: sysfs: print ext4_super_block fields as little-endian
While working on extended rand for last_error/first_error timestamps,
I noticed that the endianess is wrong; we access the little-endian
fields in struct ext4_super_block as native-endian when we print them.

This adds a special case in ext4_attr_show() and ext4_attr_store()
to byteswap the superblock fields if needed.

In older kernels, this code was part of super.c, it got moved to
sysfs.c in linux-4.4.

Cc: stable@vger.kernel.org
Fixes: 52c198c682 ("ext4: add sysfs entry showing whether the fs contains errors")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:48:00 -04:00
Linus Torvalds
01cfb7937a squashfs: be more careful about metadata corruption
Anatoly Trosinenko reports that a corrupted squashfs image can cause a
kernel oops.  It turns out that squashfs can end up being confused about
negative fragment lengths.

The regular squashfs_read_data() does check for negative lengths, but
squashfs_read_metadata() did not, and the fragment size code just
blindly trusted the on-disk value.  Fix both the fragment parsing and
the metadata reading code.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-29 12:44:46 -07:00
Theodore Ts'o
5012284700 ext4: fix check to prevent initializing reserved inodes
Commit 8844618d8a: "ext4: only look at the bg_flags field if it is
valid" will complain if block group zero does not have the
EXT4_BG_INODE_ZEROED flag set.  Unfortunately, this is not correct,
since a freshly created file system has this flag cleared.  It gets
almost immediately after the file system is mounted read-write --- but
the following somewhat unlikely sequence will end up triggering a
false positive report of a corrupted file system:

   mkfs.ext4 /dev/vdc
   mount -o ro /dev/vdc /vdc
   mount -o remount,rw /dev/vdc

Instead, when initializing the inode table for block group zero, test
to make sure that itable_unused count is not too large, since that is
the case that will result in some or all of the reserved inodes
getting cleared.

This fixes the failures reported by Eric Whiteney when running
generic/230 and generic/231 in the the nojournal test case.

Fixes: 8844618d8a ("ext4: only look at the bg_flags field if it is valid")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-07-29 15:34:00 -04:00
Chao Yu
10d255c354 f2fs: fix to skip GC if type in SSA and SIT is inconsistent
If segment type in SSA and SIT is inconsistent, we will encounter below
BUG_ON during GC, to avoid this panic, let's just skip doing GC on such
segment.

The bug is triggered with image reported in below link:

https://bugzilla.kernel.org/show_bug.cgi?id=200223

[  388.060262] ------------[ cut here ]------------
[  388.060268] kernel BUG at /home/y00370721/git/devf2fs/gc.c:989!
[  388.061172] invalid opcode: 0000 [#1] SMP
[  388.061773] Modules linked in: f2fs(O) bluetooth ecdh_generic xt_tcpudp iptable_filter ip_tables x_tables lp ttm drm_kms_helper drm intel_rapl sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel fb_sys_fops ppdev aes_x86_64 syscopyarea crypto_simd sysfillrect parport_pc joydev sysimgblt glue_helper parport cryptd i2c_piix4 serio_raw mac_hid btrfs hid_generic usbhid hid raid6_pq psmouse pata_acpi floppy
[  388.064247] CPU: 7 PID: 4151 Comm: f2fs_gc-7:0 Tainted: G           O    4.13.0-rc1+ #26
[  388.065306] Hardware name: Xen HVM domU, BIOS 4.1.2_115-900.260_ 11/06/2015
[  388.066058] task: ffff880201583b80 task.stack: ffffc90004d7c000
[  388.069948] RIP: 0010:do_garbage_collect+0xcc8/0xcd0 [f2fs]
[  388.070766] RSP: 0018:ffffc90004d7fc68 EFLAGS: 00010202
[  388.071783] RAX: ffff8801ed227000 RBX: 0000000000000001 RCX: ffffea0007b489c0
[  388.072700] RDX: ffff880000000000 RSI: 0000000000000001 RDI: ffffea0007b489c0
[  388.073607] RBP: ffffc90004d7fd58 R08: 0000000000000003 R09: ffffea0007b489dc
[  388.074619] R10: 0000000000000000 R11: 0052782ab317138d R12: 0000000000000018
[  388.075625] R13: 0000000000000018 R14: ffff880211ceb000 R15: ffff880211ceb000
[  388.076687] FS:  0000000000000000(0000) GS:ffff880214fc0000(0000) knlGS:0000000000000000
[  388.083277] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  388.084536] CR2: 0000000000e18c60 CR3: 00000001ecf2e000 CR4: 00000000001406e0
[  388.085748] Call Trace:
[  388.086690]  ? find_next_bit+0xb/0x10
[  388.088091]  f2fs_gc+0x1a8/0x9d0 [f2fs]
[  388.088888]  ? lock_timer_base+0x7d/0xa0
[  388.090213]  ? try_to_del_timer_sync+0x44/0x60
[  388.091698]  gc_thread_func+0x342/0x4b0 [f2fs]
[  388.092892]  ? wait_woken+0x80/0x80
[  388.094098]  kthread+0x109/0x140
[  388.095010]  ? f2fs_gc+0x9d0/0x9d0 [f2fs]
[  388.096043]  ? kthread_park+0x60/0x60
[  388.097281]  ret_from_fork+0x25/0x30
[  388.098401] Code: ff ff 48 83 e8 01 48 89 44 24 58 e9 27 f8 ff ff 48 83 e8 01 e9 78 fc ff ff 48 8d 78 ff e9 17 fb ff ff 48 83 ef 01 e9 4d f4 ff ff <0f> 0b 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55
[  388.100864] RIP: do_garbage_collect+0xcc8/0xcd0 [f2fs] RSP: ffffc90004d7fc68
[  388.101810] ---[ end trace 81c73d6e6b7da61d ]---

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Chao Yu
4b270a8cc5 f2fs: try grabbing node page lock aggressively in sync scenario
In synchronous scenario, like in checkpoint(), we are going to flush
dirty node pages to device synchronously, we can easily failed
writebacking node page due to trylock_page() failure, especially in
condition of intensive lock competition, which can cause long latency
of checkpoint(). So let's use lock_page() in synchronous scenario to
avoid this issue.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Sahitya Tummala
dc1328027b f2fs: show the fsync_mode=nobarrier mount option
This patch shows the fsync_mode=nobarrier mount option in
f2fs_show_options().

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Yunlei He
68c43a235e f2fs: check the right return value of memory alloc function
This patch check the right return value of memory alloc function

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Guenter Roeck
b138547818 f2fs: Replace strncpy with memcpy
gcc 8.1.0 complains:

fs/f2fs/namei.c: In function 'f2fs_update_extension_list':
fs/f2fs/namei.c:257:3: warning:
	'strncpy' output truncated before terminating nul copying
	as many bytes from a string as its length
fs/f2fs/namei.c:249:3: warning:
	'strncpy' output truncated before terminating nul copying
	as many bytes from a string as its length

Using strncpy() is indeed less than perfect since the length of data to
be copied has already been determined with strlen(). Replace strncpy()
with memcpy() to address the warning and optimize the code a little.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Gao Xiang
2d3a58566f f2fs: avoid the global name 'fault_name'
Non-prefix global name 'fault_name' will pollute global
namespace, fix it.

Refer to:
https://lists.01.org/pipermail/kbuild-all/2018-June/049660.html

To: Jaegeuk Kim <jaegeuk@kernel.org>
To: Chao Yu <yuchao0@huawei.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-kernel@vger.kernel.org
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Chao Yu
4dbe38dc38 f2fs: fix to do sanity check with reserved blkaddr of inline inode
As Wen Xu reported in bugzilla, after image was injected with random data
by fuzzing, inline inode would contain invalid reserved blkaddr, then
during inline conversion, we will encounter illegal memory accessing
reported by KASAN, the root cause of this is when writing out converted
inline page, we will use invalid reserved blkaddr to update sit bitmap,
result in accessing memory beyond sit bitmap boundary.

In order to fix this issue, let's do sanity check with reserved block
address of inline inode to avoid above condition.

https://bugzilla.kernel.org/show_bug.cgi?id=200179

[ 1428.846352] BUG: KASAN: use-after-free in update_sit_entry+0x80/0x7f0
[ 1428.846618] Read of size 4 at addr ffff880194483540 by task a.out/2741

[ 1428.846855] CPU: 0 PID: 2741 Comm: a.out Tainted: G        W         4.17.0+ #1
[ 1428.846858] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 1428.846860] Call Trace:
[ 1428.846868]  dump_stack+0x71/0xab
[ 1428.846875]  print_address_description+0x6b/0x290
[ 1428.846881]  kasan_report+0x28e/0x390
[ 1428.846888]  ? update_sit_entry+0x80/0x7f0
[ 1428.846898]  update_sit_entry+0x80/0x7f0
[ 1428.846906]  f2fs_allocate_data_block+0x6db/0xc70
[ 1428.846914]  ? f2fs_get_node_info+0x14f/0x590
[ 1428.846920]  do_write_page+0xc8/0x150
[ 1428.846928]  f2fs_outplace_write_data+0xfe/0x210
[ 1428.846935]  ? f2fs_do_write_node_page+0x170/0x170
[ 1428.846941]  ? radix_tree_tag_clear+0xff/0x130
[ 1428.846946]  ? __mod_node_page_state+0x22/0xa0
[ 1428.846951]  ? inc_zone_page_state+0x54/0x100
[ 1428.846956]  ? __test_set_page_writeback+0x336/0x5d0
[ 1428.846964]  f2fs_convert_inline_page+0x407/0x6d0
[ 1428.846971]  ? f2fs_read_inline_data+0x3b0/0x3b0
[ 1428.846978]  ? __get_node_page+0x335/0x6b0
[ 1428.846987]  f2fs_convert_inline_inode+0x41b/0x500
[ 1428.846994]  ? f2fs_convert_inline_page+0x6d0/0x6d0
[ 1428.847000]  ? kasan_unpoison_shadow+0x31/0x40
[ 1428.847005]  ? kasan_kmalloc+0xa6/0xd0
[ 1428.847024]  f2fs_file_mmap+0x79/0xc0
[ 1428.847029]  mmap_region+0x58b/0x880
[ 1428.847037]  ? arch_get_unmapped_area+0x370/0x370
[ 1428.847042]  do_mmap+0x55b/0x7a0
[ 1428.847048]  vm_mmap_pgoff+0x16f/0x1c0
[ 1428.847055]  ? vma_is_stack_for_current+0x50/0x50
[ 1428.847062]  ? __fsnotify_update_child_dentry_flags.part.1+0x160/0x160
[ 1428.847068]  ? do_sys_open+0x206/0x2a0
[ 1428.847073]  ? __fget+0xb4/0x100
[ 1428.847079]  ksys_mmap_pgoff+0x278/0x360
[ 1428.847085]  ? find_mergeable_anon_vma+0x50/0x50
[ 1428.847091]  do_syscall_64+0x73/0x160
[ 1428.847098]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1428.847102] RIP: 0033:0x7fb1430766ba
[ 1428.847103] Code: 89 f5 41 54 49 89 fc 55 53 74 35 49 63 e8 48 63 da 4d 89 f9 49 89 e8 4d 63 d6 48 89 da 4c 89 ee 4c 89 e7 b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 56 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00
[ 1428.847162] RSP: 002b:00007ffc651d9388 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
[ 1428.847167] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fb1430766ba
[ 1428.847170] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
[ 1428.847173] RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000000
[ 1428.847176] R10: 0000000000008002 R11: 0000000000000246 R12: 0000000000000000
[ 1428.847179] R13: 0000000000001000 R14: 0000000000008002 R15: 0000000000000000

[ 1428.847252] Allocated by task 2683:
[ 1428.847372]  kasan_kmalloc+0xa6/0xd0
[ 1428.847380]  kmem_cache_alloc+0xc8/0x1e0
[ 1428.847385]  getname_flags+0x73/0x2b0
[ 1428.847390]  user_path_at_empty+0x1d/0x40
[ 1428.847395]  vfs_statx+0xc1/0x150
[ 1428.847401]  __do_sys_newlstat+0x7e/0xd0
[ 1428.847405]  do_syscall_64+0x73/0x160
[ 1428.847411]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 1428.847466] Freed by task 2683:
[ 1428.847566]  __kasan_slab_free+0x137/0x190
[ 1428.847571]  kmem_cache_free+0x85/0x1e0
[ 1428.847575]  filename_lookup+0x191/0x280
[ 1428.847580]  vfs_statx+0xc1/0x150
[ 1428.847585]  __do_sys_newlstat+0x7e/0xd0
[ 1428.847590]  do_syscall_64+0x73/0x160
[ 1428.847596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 1428.847648] The buggy address belongs to the object at ffff880194483300
                which belongs to the cache names_cache of size 4096
[ 1428.847946] The buggy address is located 576 bytes inside of
                4096-byte region [ffff880194483300, ffff880194484300)
[ 1428.848234] The buggy address belongs to the page:
[ 1428.848366] page:ffffea0006512000 count:1 mapcount:0 mapping:ffff8801f3586380 index:0x0 compound_mapcount: 0
[ 1428.848606] flags: 0x17fff8000008100(slab|head)
[ 1428.848737] raw: 017fff8000008100 dead000000000100 dead000000000200 ffff8801f3586380
[ 1428.848931] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
[ 1428.849122] page dumped because: kasan: bad access detected

[ 1428.849305] Memory state around the buggy address:
[ 1428.849436]  ffff880194483400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.849620]  ffff880194483480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.849804] >ffff880194483500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.849985]                                            ^
[ 1428.850120]  ffff880194483580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.850303]  ffff880194483600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1428.850498] ==================================================================

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:08 -07:00
Chao Yu
e34438c903 f2fs: fix to do sanity check with node footer and iblocks
This patch adds to do sanity check with below fields of inode to
avoid reported panic.
- node footer
- iblocks

https://bugzilla.kernel.org/show_bug.cgi?id=200223

- Overview
BUG() triggered in f2fs_truncate_inode_blocks() when un-mounting a mounted f2fs image after writing to it

- Reproduce

- POC (poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);

  // open / write / read
  int fd = open(foo_bar_baz, O_RDWR | O_TRUNC, 0777);
  if (fd >= 0) {
    write(fd, (char *)buf, 517);
    write(fd, (char *)buf, sizeof(buf));
    close(fd);
  }

}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

- Kernel meesage
[  552.479723] F2FS-fs (loop0): Mounted with checkpoint version = 2
[  556.451891] ------------[ cut here ]------------
[  556.451899] kernel BUG at fs/f2fs/node.c:987!
[  556.452920] invalid opcode: 0000 [#1] SMP KASAN PTI
[  556.453936] CPU: 1 PID: 1310 Comm: umount Not tainted 4.18.0-rc1+ #4
[  556.455213] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  556.457140] RIP: 0010:f2fs_truncate_inode_blocks+0x4a7/0x6f0
[  556.458280] Code: e8 ae ea ff ff 41 89 c7 c1 e8 1f 84 c0 74 0a 41 83 ff fe 0f 85 35 ff ff ff 81 85 b0 fe ff ff fb 03 00 00 e9 f7 fd ff ff 0f 0b <0f> 0b e8 62 b7 9a 00 48 8b bd a0 fe ff ff e8 56 54 ae ff 48 8b b5
[  556.462015] RSP: 0018:ffff8801f292f808 EFLAGS: 00010286
[  556.463068] RAX: ffffed003e73242d RBX: ffff8801f292f958 RCX: ffffffffb88b81bc
[  556.464479] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8801f3992164
[  556.465901] RBP: ffff8801f292f980 R08: ffffed003e73242d R09: ffffed003e73242d
[  556.467311] R10: 0000000000000001 R11: ffffed003e73242c R12: 00000000fffffc64
[  556.468706] R13: ffff8801f3992000 R14: 0000000000000058 R15: 00000000ffff8801
[  556.470117] FS:  00007f8029297840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  556.471702] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  556.472838] CR2: 000055f5f57305d8 CR3: 00000001f18b0000 CR4: 00000000000006e0
[  556.474265] Call Trace:
[  556.474782]  ? f2fs_alloc_nid_failed+0xf0/0xf0
[  556.475686]  ? truncate_nodes+0x980/0x980
[  556.476516]  ? pagecache_get_page+0x21f/0x2f0
[  556.477412]  ? __asan_loadN+0xf/0x20
[  556.478153]  ? __get_node_page+0x331/0x5b0
[  556.478992]  ? reweight_entity+0x1e6/0x3b0
[  556.479826]  f2fs_truncate_blocks+0x55e/0x740
[  556.480709]  ? f2fs_truncate_data_blocks+0x20/0x20
[  556.481689]  ? __radix_tree_lookup+0x34/0x160
[  556.482630]  ? radix_tree_lookup+0xd/0x10
[  556.483445]  f2fs_truncate+0xd4/0x1a0
[  556.484206]  f2fs_evict_inode+0x5ce/0x630
[  556.485032]  evict+0x16f/0x290
[  556.485664]  iput+0x280/0x300
[  556.486300]  dentry_unlink_inode+0x165/0x1e0
[  556.487169]  __dentry_kill+0x16a/0x260
[  556.487936]  dentry_kill+0x70/0x250
[  556.488651]  shrink_dentry_list+0x125/0x260
[  556.489504]  shrink_dcache_parent+0xc1/0x110
[  556.490379]  ? shrink_dcache_sb+0x200/0x200
[  556.491231]  ? bit_wait_timeout+0xc0/0xc0
[  556.492047]  do_one_tree+0x12/0x40
[  556.492743]  shrink_dcache_for_umount+0x3f/0xa0
[  556.493656]  generic_shutdown_super+0x43/0x1c0
[  556.494561]  kill_block_super+0x52/0x80
[  556.495341]  kill_f2fs_super+0x62/0x70
[  556.496105]  deactivate_locked_super+0x6f/0xa0
[  556.497004]  deactivate_super+0x5e/0x80
[  556.497785]  cleanup_mnt+0x61/0xa0
[  556.498492]  __cleanup_mnt+0x12/0x20
[  556.499218]  task_work_run+0xc8/0xf0
[  556.499949]  exit_to_usermode_loop+0x125/0x130
[  556.500846]  do_syscall_64+0x138/0x170
[  556.501609]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  556.502659] RIP: 0033:0x7f8028b77487
[  556.503384] Code: 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 c9 2b 00 f7 d8 64 89 01 48
[  556.507137] RSP: 002b:00007fff9f2e3598 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  556.508637] RAX: 0000000000000000 RBX: 0000000000ebd030 RCX: 00007f8028b77487
[  556.510069] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000ec41e0
[  556.511481] RBP: 0000000000ec41e0 R08: 0000000000000000 R09: 0000000000000014
[  556.512892] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f802908083c
[  556.514320] R13: 0000000000000000 R14: 0000000000ebd210 R15: 00007fff9f2e3820
[  556.515745] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  556.529276] ---[ end trace 4ce02f25ff7d3df5 ]---
[  556.530340] RIP: 0010:f2fs_truncate_inode_blocks+0x4a7/0x6f0
[  556.531513] Code: e8 ae ea ff ff 41 89 c7 c1 e8 1f 84 c0 74 0a 41 83 ff fe 0f 85 35 ff ff ff 81 85 b0 fe ff ff fb 03 00 00 e9 f7 fd ff ff 0f 0b <0f> 0b e8 62 b7 9a 00 48 8b bd a0 fe ff ff e8 56 54 ae ff 48 8b b5
[  556.535330] RSP: 0018:ffff8801f292f808 EFLAGS: 00010286
[  556.536395] RAX: ffffed003e73242d RBX: ffff8801f292f958 RCX: ffffffffb88b81bc
[  556.537824] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff8801f3992164
[  556.539290] RBP: ffff8801f292f980 R08: ffffed003e73242d R09: ffffed003e73242d
[  556.540709] R10: 0000000000000001 R11: ffffed003e73242c R12: 00000000fffffc64
[  556.542131] R13: ffff8801f3992000 R14: 0000000000000058 R15: 00000000ffff8801
[  556.543579] FS:  00007f8029297840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  556.545180] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  556.546338] CR2: 000055f5f57305d8 CR3: 00000001f18b0000 CR4: 00000000000006e0
[  556.547809] ==================================================================
[  556.549248] BUG: KASAN: stack-out-of-bounds in arch_tlb_gather_mmu+0x52/0x170
[  556.550672] Write of size 8 at addr ffff8801f292fd10 by task umount/1310

[  556.552338] CPU: 1 PID: 1310 Comm: umount Tainted: G      D           4.18.0-rc1+ #4
[  556.553886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  556.555756] Call Trace:
[  556.556264]  dump_stack+0x7b/0xb5
[  556.556944]  print_address_description+0x70/0x290
[  556.557903]  kasan_report+0x291/0x390
[  556.558649]  ? arch_tlb_gather_mmu+0x52/0x170
[  556.559537]  __asan_store8+0x57/0x90
[  556.560268]  arch_tlb_gather_mmu+0x52/0x170
[  556.561110]  tlb_gather_mmu+0x12/0x40
[  556.561862]  exit_mmap+0x123/0x2a0
[  556.562555]  ? __ia32_sys_munmap+0x50/0x50
[  556.563384]  ? exit_aio+0x98/0x230
[  556.564079]  ? __x32_compat_sys_io_submit+0x260/0x260
[  556.565099]  ? taskstats_exit+0x1f4/0x640
[  556.565925]  ? kasan_check_read+0x11/0x20
[  556.566739]  ? mm_update_next_owner+0x322/0x380
[  556.567652]  mmput+0x8b/0x1d0
[  556.568260]  do_exit+0x43a/0x1390
[  556.568937]  ? mm_update_next_owner+0x380/0x380
[  556.569855]  ? deactivate_super+0x5e/0x80
[  556.570668]  ? cleanup_mnt+0x61/0xa0
[  556.571395]  ? __cleanup_mnt+0x12/0x20
[  556.572156]  ? task_work_run+0xc8/0xf0
[  556.572917]  ? exit_to_usermode_loop+0x125/0x130
[  556.573861]  rewind_stack_do_exit+0x17/0x20
[  556.574707] RIP: 0033:0x7f8028b77487
[  556.575428] Code: Bad RIP value.
[  556.576106] RSP: 002b:00007fff9f2e3598 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[  556.577599] RAX: 0000000000000000 RBX: 0000000000ebd030 RCX: 00007f8028b77487
[  556.579020] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000ec41e0
[  556.580422] RBP: 0000000000ec41e0 R08: 0000000000000000 R09: 0000000000000014
[  556.581833] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f802908083c
[  556.583252] R13: 0000000000000000 R14: 0000000000ebd210 R15: 00007fff9f2e3820

[  556.584983] The buggy address belongs to the page:
[  556.585961] page:ffffea0007ca4bc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  556.587540] flags: 0x2ffff0000000000()
[  556.588296] raw: 02ffff0000000000 0000000000000000 dead000000000200 0000000000000000
[  556.589822] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  556.591359] page dumped because: kasan: bad access detected

[  556.592786] Memory state around the buggy address:
[  556.593753]  ffff8801f292fc00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  556.595191]  ffff8801f292fc80: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 00
[  556.596613] >ffff8801f292fd00: 00 00 f3 00 00 00 00 f3 f3 00 00 00 00 f4 f4 f4
[  556.598044]                          ^
[  556.598797]  ffff8801f292fd80: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
[  556.600225]  ffff8801f292fe00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 f4 f4
[  556.601647] ==================================================================

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/node.c#L987
		case NODE_DIND_BLOCK:
			err = truncate_nodes(&dn, nofs, offset[1], 3);
			cont = 0;
			break;

		default:
			BUG(); <---
		}

Reported-by Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:26:06 -07:00
Yunlei He
e15d54d500 f2fs: Allocate and stat mem used by free nid bitmap more accurately
This patch used f2fs_bitmap_size macro to calculate mem used by
free nid bitmap, and stat used mem including aligned part.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:23:26 -07:00
Chao Yu
9dc956b2c8 f2fs: fix to do sanity check with user_block_count
This patch fixs to do sanity check with user_block_count.

- Overview
Divide zero in utilization when mount() a corrupted f2fs image

- Reproduce (4.18 upstream kernel)

- Kernel message
[  564.099503] F2FS-fs (loop0): invalid crc value
[  564.101991] divide error: 0000 [#1] SMP KASAN PTI
[  564.103103] CPU: 1 PID: 1298 Comm: f2fs_discard-7: Not tainted 4.18.0-rc1+ #4
[  564.104584] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  564.106624] RIP: 0010:issue_discard_thread+0x248/0x5c0
[  564.107692] Code: ff ff 48 8b bd e8 fe ff ff 41 8b 9d 4c 04 00 00 e8 cd b8 ad ff 41 8b 85 50 04 00 00 31 d2 48 8d 04 80 48 8d 04 80 48 c1 e0 02 <48> f7 f3 83 f8 50 7e 16 41 c7 86 7c ff ff ff 01 00 00 00 41 c7 86
[  564.111686] RSP: 0018:ffff8801f3117dc0 EFLAGS: 00010206
[  564.112775] RAX: 0000000000000384 RBX: 0000000000000000 RCX: ffffffffb88c1e03
[  564.114250] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e3aa4850
[  564.115706] RBP: ffff8801f3117f00 R08: 1ffffffff751a1d0 R09: fffffbfff751a1d0
[  564.117177] R10: 0000000000000001 R11: fffffbfff751a1d0 R12: 00000000fffffffc
[  564.118634] R13: ffff8801e3aa4400 R14: ffff8801f3117ed8 R15: ffff8801e2050000
[  564.120094] FS:  0000000000000000(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  564.121748] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  564.122923] CR2: 000000000202b078 CR3: 00000001f11ac000 CR4: 00000000000006e0
[  564.124383] Call Trace:
[  564.124924]  ? __issue_discard_cmd+0x480/0x480
[  564.125882]  ? __sched_text_start+0x8/0x8
[  564.126756]  ? __kthread_parkme+0xcb/0x100
[  564.127620]  ? kthread_blkcg+0x70/0x70
[  564.128412]  kthread+0x180/0x1d0
[  564.129105]  ? __issue_discard_cmd+0x480/0x480
[  564.130029]  ? kthread_associate_blkcg+0x150/0x150
[  564.131033]  ret_from_fork+0x35/0x40
[  564.131794] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  564.141798] ---[ end trace 4ce02f25ff7d3df5 ]---
[  564.142773] RIP: 0010:issue_discard_thread+0x248/0x5c0
[  564.143885] Code: ff ff 48 8b bd e8 fe ff ff 41 8b 9d 4c 04 00 00 e8 cd b8 ad ff 41 8b 85 50 04 00 00 31 d2 48 8d 04 80 48 8d 04 80 48 c1 e0 02 <48> f7 f3 83 f8 50 7e 16 41 c7 86 7c ff ff ff 01 00 00 00 41 c7 86
[  564.147776] RSP: 0018:ffff8801f3117dc0 EFLAGS: 00010206
[  564.148856] RAX: 0000000000000384 RBX: 0000000000000000 RCX: ffffffffb88c1e03
[  564.150424] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e3aa4850
[  564.151906] RBP: ffff8801f3117f00 R08: 1ffffffff751a1d0 R09: fffffbfff751a1d0
[  564.153463] R10: 0000000000000001 R11: fffffbfff751a1d0 R12: 00000000fffffffc
[  564.154915] R13: ffff8801e3aa4400 R14: ffff8801f3117ed8 R15: ffff8801e2050000
[  564.156405] FS:  0000000000000000(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  564.158070] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  564.159279] CR2: 000000000202b078 CR3: 00000001f11ac000 CR4: 00000000000006e0
[  564.161043] ==================================================================
[  564.162587] BUG: KASAN: stack-out-of-bounds in from_kuid_munged+0x1d/0x50
[  564.163994] Read of size 4 at addr ffff8801f3117c84 by task f2fs_discard-7:/1298

[  564.165852] CPU: 1 PID: 1298 Comm: f2fs_discard-7: Tainted: G      D           4.18.0-rc1+ #4
[  564.167593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  564.169522] Call Trace:
[  564.170057]  dump_stack+0x7b/0xb5
[  564.170778]  print_address_description+0x70/0x290
[  564.171765]  kasan_report+0x291/0x390
[  564.172540]  ? from_kuid_munged+0x1d/0x50
[  564.173408]  __asan_load4+0x78/0x80
[  564.174148]  from_kuid_munged+0x1d/0x50
[  564.174962]  do_notify_parent+0x1f5/0x4f0
[  564.175808]  ? send_sigqueue+0x390/0x390
[  564.176639]  ? css_set_move_task+0x152/0x340
[  564.184197]  do_exit+0x1290/0x1390
[  564.184950]  ? __issue_discard_cmd+0x480/0x480
[  564.185884]  ? mm_update_next_owner+0x380/0x380
[  564.186829]  ? __sched_text_start+0x8/0x8
[  564.187672]  ? __kthread_parkme+0xcb/0x100
[  564.188528]  ? kthread_blkcg+0x70/0x70
[  564.189333]  ? kthread+0x180/0x1d0
[  564.190052]  ? __issue_discard_cmd+0x480/0x480
[  564.190983]  rewind_stack_do_exit+0x17/0x20

[  564.192190] The buggy address belongs to the page:
[  564.193213] page:ffffea0007cc45c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  564.194856] flags: 0x2ffff0000000000()
[  564.195644] raw: 02ffff0000000000 0000000000000000 dead000000000200 0000000000000000
[  564.197247] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  564.198826] page dumped because: kasan: bad access detected

[  564.200299] Memory state around the buggy address:
[  564.201306]  ffff8801f3117b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  564.202779]  ffff8801f3117c00: 00 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 f3 f3
[  564.204252] >ffff8801f3117c80: f3 f3 f3 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
[  564.205742]                    ^
[  564.206424]  ffff8801f3117d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  564.207908]  ffff8801f3117d80: f3 f3 f3 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00
[  564.209389] ==================================================================
[  564.231795] F2FS-fs (loop0): Mounted with checkpoint version = 2

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.h#L586
	return div_u64((u64)valid_user_blocks(sbi) * 100,
					sbi->user_block_count);
Missing checks on sbi->user_block_count.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-28 18:23:23 -07:00
Linus Torvalds
eb181a814c for-linus-20180727
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltbc20QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgh4D/9GYQcjk9qLVFxkv5ucAUvCuxEL6gjsMf4W
 M/QdxVIrwh3zpvsH++2IXXn+xH+UjujMA5NkzhsSr4+hsSO2iAGOYMJbroNfhsTD
 onvQQ6NTaHPu/+PZs0otVK4KMWHwZGWOV6YU00TWTfRgzRmGEsSMe91oeBIXVv9w
 v6d09twaLSY0lUkAAbcdu5fuFBtXu4Bxy60qyHEKkAdWWHEUYaZLrODhVjoGg2V4
 KdAWS5X4A6kJMcPcoOvG6RFtpf71boaip9o/DRLUWhGdIQnI38UgSCUmz1XMYnik
 Sq8r74vqCm8IhIOLTlxnPrMHHbKv7JZhY3Ow9fxnS6HZRNI0aPX31Yml6NULqnWh
 MsQh+6gZXd3xC1O7txEQn4a15Lk0OLXa8HJcIn5ADNxqz5/r/g0mPUG9HmPSIalO
 ISFF/9UKQFcAd0RjHR+bEEH2VMznz59UWKfdOsmwFZtZSCmR1ucj0xAKDj+oP1JS
 ZsgZ09K2GezrL4GEueocISo9ACIWgDWH8T7/bTxlBok0IYbybAfmOe+MZInL1Tf4
 pklmoXm3ntgV3Pq8Ptk05LYyIgAaUIltuSiR3AFaXIADX0wNtV0ZgysIWgHf3BSA
 18j+I1yPG1IwBdM8xNwxi56xMQR84uY5tsIyafbfj+laRI2nH5OIYjNZnrKpm957
 4xZUgIECBA==
 =2ogY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180727' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Bigger than usual at this time, mostly due to the O_DIRECT corruption
  issue and the fact that I was on vacation last week. This contains:

   - NVMe pull request with two fixes for the FC code, and two target
     fixes (Christoph)

   - a DIF bio reset iteration fix (Greg Edwards)

   - two nbd reply and requeue fixes (Josef)

   - SCSI timeout fixup (Keith)

   - a small series that fixes an issue with bio_iov_iter_get_pages(),
     which ended up causing corruption for larger sized O_DIRECT writes
     that ended up racing with buffered writes (Martin Wilck)"

* tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
  block: reset bi_iter.bi_done after splitting bio
  block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
  blkdev: __blkdev_direct_IO_simple: fix leak in error case
  block: bio_iov_iter_get_pages: fix size of last iovec
  nvmet: only check for filebacking on -ENOTBLK
  nvmet: fixup crash on NULL device path
  scsi: set timed out out mq requests to complete
  blk-mq: export setting request completion state
  nvme: if_ready checks to fail io to deleting controller
  nvmet-fc: fix target sgl list on large transfers
  nbd: handle unexpected replies better
  nbd: don't requeue the same request twice.
2018-07-27 12:51:00 -07:00
Linus Torvalds
864af0d40c Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "11 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  kvm, mm: account shadow page tables to kmemcg
  zswap: re-check zswap_is_full() after do zswap_shrink()
  include/linux/eventfd.h: include linux/errno.h
  mm: fix vma_is_anonymous() false-positives
  mm: use vma_init() to initialize VMAs on stack and data segments
  mm: introduce vma_init()
  mm: fix exports that inadvertently make put_page() EXPORT_SYMBOL_GPL
  ipc/sem.c: prevent queue.status tearing in semop
  mm: disallow mappings that conflict for devm_memremap_pages()
  kasan: only select SLUB_DEBUG with SYSFS=y
  delayacct: fix crash in delayacct_blkio_end() after delayacct init failure
2018-07-27 10:30:47 -07:00
Linus Torvalds
f636d300cd Changes since last update:
- Fix some uninitialized variable errors
 - Fix an incorrect check in metadata verifiers
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAltYpkoACgkQ+H93GTRK
 tOsMAA//Tyt2rjGGrvtPUiI9xhDDbYM+Eds19IWhye9LyNQCHXdmrCicsBvoEyCC
 5XSAT5lofeLNIbiTS88aC0b4sr2LLban6YsTBHGTlRxUTrnCSpCCDIgXJswxLjmT
 jivIumvKL3sxgmXubwe6gnjoLCNGIy3JrdCu4vFf6JGWAj6U5HyZ5hjtj74nuPtg
 w6BMEptJIOmQwGzSjQY76dQ5ekliVuOtYISY6gRAfVPVvwURgIzZdQPi4qV5Kw/d
 n2nA6rvMBUcMUSVvXWS1ryOWsy4HrB9LXzbr5Kb0NgaVKnAqSCYGIGMJSEsiO/7Y
 P83Doo6N8fYh8QEUOLqJ76XTkkrzoo3fvo7IZXUGMERXx90UliEAI/k6hWy6awtT
 cCQatAcOp+8r5PvMJ9ZIivAwDId06PwpuDntOATIamGkNEo4vo0LO189fQP+i8RD
 LIbEcLcGOHVjjTZgGqJCfDWVPiFtG8ZdZp9bvmpW9aREzMGl/tXnvI2QsSwZu+lU
 87efBqztYGm4U4D5grdV/ynbT1E4E9ggtI2pVHG2ipJnZ+UeTiOCw68lDcUDT0JA
 lU2fPUKzUR3v+U6s26AJFKcX2HCG4G75cJozBuH82xcPnUT0m3PMde0ZhFzVnvg4
 w8T+bIS0Q/f310SSAitu1qfG5cx2f6I5j107jhldvcibRmqEZLE=
 =Ovtv
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix some uninitialized variable errors

 - Fix an incorrect check in metadata verifiers

* tag 'xfs-4.18-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: properly handle free inodes in extent hint validators
  xfs: Initialize variables in xfs_alloc_get_rec before using them
2018-07-27 09:25:09 -07:00
Mauro Carvalho Chehab
d0dd962d8a media: dvb: get rid of VIDEO_SET_SPU_PALETTE
No upstream drivers use it. It doesn't make any sense to have
a compat32 code for something that nobody uses upstream.

Reported-by: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-07-27 06:41:35 -04:00
Chao Yu
76d56d4ab4 f2fs: fix to do sanity check with extra_attr feature
If FI_EXTRA_ATTR is set in inode by fuzzing, inode.i_addr[0] will be
parsed as inode.i_extra_isize, then in __recover_inline_status, inline
data address will beyond boundary of page, result in accessing invalid
memory.

So in this condition, during reading inode page, let's do sanity check
with EXTRA_ATTR feature of fs and extra_attr bit of inode, if they're
inconsistent, deny to load this inode.

- Overview
Out-of-bound access in f2fs_iget() when mounting a corrupted f2fs image

- Reproduce

The following message will be got in KASAN build of 4.18 upstream kernel.
[  819.392227] ==================================================================
[  819.393901] BUG: KASAN: slab-out-of-bounds in f2fs_iget+0x736/0x1530
[  819.395329] Read of size 4 at addr ffff8801f099c968 by task mount/1292

[  819.397079] CPU: 1 PID: 1292 Comm: mount Not tainted 4.18.0-rc1+ #4
[  819.397082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  819.397088] Call Trace:
[  819.397124]  dump_stack+0x7b/0xb5
[  819.397154]  print_address_description+0x70/0x290
[  819.397159]  kasan_report+0x291/0x390
[  819.397163]  ? f2fs_iget+0x736/0x1530
[  819.397176]  check_memory_region+0x139/0x190
[  819.397182]  __asan_loadN+0xf/0x20
[  819.397185]  f2fs_iget+0x736/0x1530
[  819.397197]  f2fs_fill_super+0x1b4f/0x2b40
[  819.397202]  ? f2fs_fill_super+0x1b4f/0x2b40
[  819.397208]  ? f2fs_commit_super+0x1b0/0x1b0
[  819.397227]  ? set_blocksize+0x90/0x140
[  819.397241]  mount_bdev+0x1c5/0x210
[  819.397245]  ? f2fs_commit_super+0x1b0/0x1b0
[  819.397252]  f2fs_mount+0x15/0x20
[  819.397256]  mount_fs+0x60/0x1a0
[  819.397267]  ? alloc_vfsmnt+0x309/0x360
[  819.397272]  vfs_kern_mount+0x6b/0x1a0
[  819.397282]  do_mount+0x34a/0x18c0
[  819.397300]  ? lockref_put_or_lock+0xcf/0x160
[  819.397306]  ? copy_mount_string+0x20/0x20
[  819.397318]  ? memcg_kmem_put_cache+0x1b/0xa0
[  819.397324]  ? kasan_check_write+0x14/0x20
[  819.397334]  ? _copy_from_user+0x6a/0x90
[  819.397353]  ? memdup_user+0x42/0x60
[  819.397359]  ksys_mount+0x83/0xd0
[  819.397365]  __x64_sys_mount+0x67/0x80
[  819.397388]  do_syscall_64+0x78/0x170
[  819.397403]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  819.397422] RIP: 0033:0x7f54c667cb9a
[  819.397424] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[  819.397483] RSP: 002b:00007ffd8f46cd08 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[  819.397496] RAX: ffffffffffffffda RBX: 0000000000dfa030 RCX: 00007f54c667cb9a
[  819.397498] RDX: 0000000000dfa210 RSI: 0000000000dfbf30 RDI: 0000000000e02ec0
[  819.397501] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  819.397503] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 0000000000e02ec0
[  819.397505] R13: 0000000000dfa210 R14: 0000000000000000 R15: 0000000000000003

[  819.397866] Allocated by task 139:
[  819.398702]  save_stack+0x46/0xd0
[  819.398705]  kasan_kmalloc+0xad/0xe0
[  819.398709]  kasan_slab_alloc+0x11/0x20
[  819.398713]  kmem_cache_alloc+0xd1/0x1e0
[  819.398717]  dup_fd+0x50/0x4c0
[  819.398740]  copy_process.part.37+0xbed/0x32e0
[  819.398744]  _do_fork+0x16e/0x590
[  819.398748]  __x64_sys_clone+0x69/0x80
[  819.398752]  do_syscall_64+0x78/0x170
[  819.398756]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  819.399097] Freed by task 159:
[  819.399743]  save_stack+0x46/0xd0
[  819.399747]  __kasan_slab_free+0x13c/0x1a0
[  819.399750]  kasan_slab_free+0xe/0x10
[  819.399754]  kmem_cache_free+0x89/0x1e0
[  819.399757]  put_files_struct+0x132/0x150
[  819.399761]  exit_files+0x62/0x70
[  819.399766]  do_exit+0x47b/0x1390
[  819.399770]  do_group_exit+0x86/0x130
[  819.399774]  __x64_sys_exit_group+0x2c/0x30
[  819.399778]  do_syscall_64+0x78/0x170
[  819.399782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  819.400115] The buggy address belongs to the object at ffff8801f099c680
                which belongs to the cache files_cache of size 704
[  819.403234] The buggy address is located 40 bytes to the right of
                704-byte region [ffff8801f099c680, ffff8801f099c940)
[  819.405689] The buggy address belongs to the page:
[  819.406709] page:ffffea0007c26700 count:1 mapcount:0 mapping:ffff8801f69a3340 index:0xffff8801f099d380 compound_mapcount: 0
[  819.408984] flags: 0x2ffff0000008100(slab|head)
[  819.409932] raw: 02ffff0000008100 ffffea00077fb600 0000000200000002 ffff8801f69a3340
[  819.411514] raw: ffff8801f099d380 0000000080130000 00000001ffffffff 0000000000000000
[  819.413073] page dumped because: kasan: bad access detected

[  819.414539] Memory state around the buggy address:
[  819.415521]  ffff8801f099c800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  819.416981]  ffff8801f099c880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  819.418454] >ffff8801f099c900: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[  819.419921]                                                           ^
[  819.421265]  ffff8801f099c980: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[  819.422745]  ffff8801f099ca00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  819.424206] ==================================================================
[  819.425668] Disabling lock debugging due to kernel taint
[  819.457463] F2FS-fs (loop0): Mounted with checkpoint version = 3

The kernel still mounts the image. If you run the following program on the mounted folder mnt,

(poc.c)

static void activity(char *mpoint) {

  char *foo_bar_baz;
  int err;

  static int buf[8192];
  memset(buf, 0, sizeof(buf));

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);
    int fd = open(foo_bar_baz, O_RDONLY, 0);
  if (fd >= 0) {
      read(fd, (char *)buf, 11);
      close(fd);
  }
}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

You can get kernel crash:
[  819.457463] F2FS-fs (loop0): Mounted with checkpoint version = 3
[  918.028501] BUG: unable to handle kernel paging request at ffffed0048000d82
[  918.044020] PGD 23ffee067 P4D 23ffee067 PUD 23fbef067 PMD 0
[  918.045207] Oops: 0000 [#1] SMP KASAN PTI
[  918.046048] CPU: 0 PID: 1309 Comm: poc Tainted: G    B             4.18.0-rc1+ #4
[  918.047573] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  918.049552] RIP: 0010:check_memory_region+0x5e/0x190
[  918.050565] Code: f8 49 c1 e8 03 49 89 db 49 c1 eb 03 4d 01 cb 4d 01 c1 4d 8d 63 01 4c 89 c8 4d 89 e2 4d 29 ca 49 83 fa 10 7f 3d 4d 85 d2 74 32 <41> 80 39 00 75 23 48 b8 01 00 00 00 00 fc ff df 4d 01 d1 49 01 c0
[  918.054322] RSP: 0018:ffff8801e3a1f258 EFLAGS: 00010202
[  918.055400] RAX: ffffed0048000d82 RBX: ffff880240006c11 RCX: ffffffffb8867d14
[  918.056832] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880240006c10
[  918.058253] RBP: ffff8801e3a1f268 R08: 1ffff10048000d82 R09: ffffed0048000d82
[  918.059717] R10: 0000000000000001 R11: ffffed0048000d82 R12: ffffed0048000d83
[  918.061159] R13: ffff8801e3a1f390 R14: 0000000000000000 R15: ffff880240006c08
[  918.062614] FS:  00007fac9732c700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  918.064246] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  918.065412] CR2: ffffed0048000d82 CR3: 00000001df77a000 CR4: 00000000000006f0
[  918.066882] Call Trace:
[  918.067410]  __asan_loadN+0xf/0x20
[  918.068149]  f2fs_find_target_dentry+0xf4/0x270
[  918.069083]  ? __get_node_page+0x331/0x5b0
[  918.069925]  f2fs_find_in_inline_dir+0x24b/0x310
[  918.070881]  ? f2fs_recover_inline_data+0x4c0/0x4c0
[  918.071905]  ? unwind_next_frame.part.5+0x34f/0x490
[  918.072901]  ? unwind_dump+0x290/0x290
[  918.073695]  ? is_bpf_text_address+0xe/0x20
[  918.074566]  __f2fs_find_entry+0x599/0x670
[  918.075408]  ? kasan_unpoison_shadow+0x36/0x50
[  918.076315]  ? kasan_kmalloc+0xad/0xe0
[  918.077100]  ? memcg_kmem_put_cache+0x55/0xa0
[  918.077998]  ? f2fs_find_target_dentry+0x270/0x270
[  918.079006]  ? d_set_d_op+0x30/0x100
[  918.079749]  ? __d_lookup_rcu+0x69/0x2e0
[  918.080556]  ? __d_alloc+0x275/0x450
[  918.081297]  ? kasan_check_write+0x14/0x20
[  918.082135]  ? memset+0x31/0x40
[  918.082820]  ? fscrypt_setup_filename+0x1ec/0x4c0
[  918.083782]  ? d_alloc_parallel+0x5bb/0x8c0
[  918.084640]  f2fs_find_entry+0xe9/0x110
[  918.085432]  ? __f2fs_find_entry+0x670/0x670
[  918.086308]  ? kasan_check_write+0x14/0x20
[  918.087163]  f2fs_lookup+0x297/0x590
[  918.087902]  ? f2fs_link+0x2b0/0x2b0
[  918.088646]  ? legitimize_path.isra.29+0x61/0xa0
[  918.089589]  __lookup_slow+0x12e/0x240
[  918.090371]  ? may_delete+0x2b0/0x2b0
[  918.091123]  ? __nd_alloc_stack+0xa0/0xa0
[  918.091944]  lookup_slow+0x44/0x60
[  918.092642]  walk_component+0x3ee/0xa40
[  918.093428]  ? is_bpf_text_address+0xe/0x20
[  918.094283]  ? pick_link+0x3e0/0x3e0
[  918.095047]  ? in_group_p+0xa5/0xe0
[  918.095771]  ? generic_permission+0x53/0x1e0
[  918.096666]  ? security_inode_permission+0x1d/0x70
[  918.097646]  ? inode_permission+0x7a/0x1f0
[  918.098497]  link_path_walk+0x2a2/0x7b0
[  918.099298]  ? apparmor_capget+0x3d0/0x3d0
[  918.100140]  ? walk_component+0xa40/0xa40
[  918.100958]  ? path_init+0x2e6/0x580
[  918.101695]  path_openat+0x1bb/0x2160
[  918.102471]  ? __save_stack_trace+0x92/0x100
[  918.103352]  ? save_stack+0xb5/0xd0
[  918.104070]  ? vfs_unlink+0x250/0x250
[  918.104822]  ? save_stack+0x46/0xd0
[  918.105538]  ? kasan_slab_alloc+0x11/0x20
[  918.106370]  ? kmem_cache_alloc+0xd1/0x1e0
[  918.107213]  ? getname_flags+0x76/0x2c0
[  918.107997]  ? getname+0x12/0x20
[  918.108677]  ? do_sys_open+0x14b/0x2c0
[  918.109450]  ? __x64_sys_open+0x4c/0x60
[  918.110255]  ? do_syscall_64+0x78/0x170
[  918.111083]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  918.112148]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  918.113204]  ? f2fs_empty_inline_dir+0x1e0/0x1e0
[  918.114150]  ? timespec64_trunc+0x5c/0x90
[  918.114993]  ? wb_io_lists_depopulated+0x1a/0xc0
[  918.115937]  ? inode_io_list_move_locked+0x102/0x110
[  918.116949]  do_filp_open+0x12b/0x1d0
[  918.117709]  ? may_open_dev+0x50/0x50
[  918.118475]  ? kasan_kmalloc+0xad/0xe0
[  918.119246]  do_sys_open+0x17c/0x2c0
[  918.119983]  ? do_sys_open+0x17c/0x2c0
[  918.120751]  ? filp_open+0x60/0x60
[  918.121463]  ? task_work_run+0x4d/0xf0
[  918.122237]  __x64_sys_open+0x4c/0x60
[  918.123001]  do_syscall_64+0x78/0x170
[  918.123759]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  918.124802] RIP: 0033:0x7fac96e3e040
[  918.125537] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d 09 27 2d 00 00 75 10 b8 02 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 7e e0 01 00 48 89 04 24
[  918.129341] RSP: 002b:00007fff1b37f848 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  918.130870] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fac96e3e040
[  918.132295] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000122d080
[  918.133748] RBP: 00007fff1b37f9b0 R08: 00007fac9710bbd8 R09: 0000000000000001
[  918.135209] R10: 000000000000069d R11: 0000000000000246 R12: 0000000000400c20
[  918.136650] R13: 00007fff1b37fab0 R14: 0000000000000000 R15: 0000000000000000
[  918.138093] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  918.147924] CR2: ffffed0048000d82
[  918.148619] ---[ end trace 4ce02f25ff7d3df5 ]---
[  918.149563] RIP: 0010:check_memory_region+0x5e/0x190
[  918.150576] Code: f8 49 c1 e8 03 49 89 db 49 c1 eb 03 4d 01 cb 4d 01 c1 4d 8d 63 01 4c 89 c8 4d 89 e2 4d 29 ca 49 83 fa 10 7f 3d 4d 85 d2 74 32 <41> 80 39 00 75 23 48 b8 01 00 00 00 00 fc ff df 4d 01 d1 49 01 c0
[  918.154360] RSP: 0018:ffff8801e3a1f258 EFLAGS: 00010202
[  918.155411] RAX: ffffed0048000d82 RBX: ffff880240006c11 RCX: ffffffffb8867d14
[  918.156833] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff880240006c10
[  918.158257] RBP: ffff8801e3a1f268 R08: 1ffff10048000d82 R09: ffffed0048000d82
[  918.159722] R10: 0000000000000001 R11: ffffed0048000d82 R12: ffffed0048000d83
[  918.161149] R13: ffff8801e3a1f390 R14: 0000000000000000 R15: ffff880240006c08
[  918.162587] FS:  00007fac9732c700(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  918.164203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  918.165356] CR2: ffffed0048000d82 CR3: 00000001df77a000 CR4: 00000000000006f0

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
01f9cf6db7 f2fs: fix to correct return value of f2fs_trim_fs
We should account trimmed block number from __wait_all_discard_cmd
in __issue_discard_cmd_range, otherwise trimmed blocks returned
by f2fs_trim_fs will be wrong, this patch fixes it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
c77ec61ca0 f2fs: fix to do sanity check with {sit,nat}_ver_bitmap_bytesize
This patch adds to do sanity check with {sit,nat}_ver_bitmap_bytesize
during mount, in order to avoid accessing across cache boundary with
this abnormal bitmap size.

- Overview
buffer overrun in build_sit_info() when mounting a crafted f2fs image

- Reproduce

- Kernel message
[  548.580867] F2FS-fs (loop0): Invalid log blocks per segment (8201)

[  548.580877] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[  548.584979] ==================================================================
[  548.586568] BUG: KASAN: use-after-free in kmemdup+0x36/0x50
[  548.587715] Read of size 64 at addr ffff8801e9c265ff by task mount/1295

[  548.589428] CPU: 1 PID: 1295 Comm: mount Not tainted 4.18.0-rc1+ #4
[  548.589432] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  548.589438] Call Trace:
[  548.589474]  dump_stack+0x7b/0xb5
[  548.589487]  print_address_description+0x70/0x290
[  548.589492]  kasan_report+0x291/0x390
[  548.589496]  ? kmemdup+0x36/0x50
[  548.589509]  check_memory_region+0x139/0x190
[  548.589514]  memcpy+0x23/0x50
[  548.589518]  kmemdup+0x36/0x50
[  548.589545]  f2fs_build_segment_manager+0x8fa/0x3410
[  548.589551]  ? __asan_loadN+0xf/0x20
[  548.589560]  ? f2fs_sanity_check_ckpt+0x1be/0x240
[  548.589566]  ? f2fs_flush_sit_entries+0x10c0/0x10c0
[  548.589587]  ? __put_user_ns+0x40/0x40
[  548.589604]  ? find_next_bit+0x57/0x90
[  548.589610]  f2fs_fill_super+0x194b/0x2b40
[  548.589617]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.589637]  ? set_blocksize+0x90/0x140
[  548.589651]  mount_bdev+0x1c5/0x210
[  548.589655]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.589667]  f2fs_mount+0x15/0x20
[  548.589672]  mount_fs+0x60/0x1a0
[  548.589683]  ? alloc_vfsmnt+0x309/0x360
[  548.589688]  vfs_kern_mount+0x6b/0x1a0
[  548.589699]  do_mount+0x34a/0x18c0
[  548.589710]  ? lockref_put_or_lock+0xcf/0x160
[  548.589716]  ? copy_mount_string+0x20/0x20
[  548.589728]  ? memcg_kmem_put_cache+0x1b/0xa0
[  548.589734]  ? kasan_check_write+0x14/0x20
[  548.589740]  ? _copy_from_user+0x6a/0x90
[  548.589744]  ? memdup_user+0x42/0x60
[  548.589750]  ksys_mount+0x83/0xd0
[  548.589755]  __x64_sys_mount+0x67/0x80
[  548.589781]  do_syscall_64+0x78/0x170
[  548.589797]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  548.589820] RIP: 0033:0x7f76fc331b9a
[  548.589821] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[  548.589880] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  548.589890] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[  548.589892] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[  548.589895] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  548.589897] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[  548.589900] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003

[  548.590242] The buggy address belongs to the page:
[  548.591243] page:ffffea0007a70980 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  548.592886] flags: 0x2ffff0000000000()
[  548.593665] raw: 02ffff0000000000 dead000000000100 dead000000000200 0000000000000000
[  548.595258] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  548.603713] page dumped because: kasan: bad access detected

[  548.605203] Memory state around the buggy address:
[  548.606198]  ffff8801e9c26480: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.607676]  ffff8801e9c26500: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.609157] >ffff8801e9c26580: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.610629]                                                                 ^
[  548.612088]  ffff8801e9c26600: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.613674]  ffff8801e9c26680: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[  548.615141] ==================================================================
[  548.616613] Disabling lock debugging due to kernel taint
[  548.622871] WARNING: CPU: 1 PID: 1295 at mm/page_alloc.c:4065 __alloc_pages_slowpath+0xe4a/0x1420
[  548.622878] Modules linked in: snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd mac_hid i2c_piix4 soundcore ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear 8139too crct10dif_pclmul crc32_pclmul qxl drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops ttm drm aes_x86_64 crypto_simd cryptd 8139cp glue_helper mii pata_acpi floppy
[  548.623217] CPU: 1 PID: 1295 Comm: mount Tainted: G    B             4.18.0-rc1+ #4
[  548.623219] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  548.623226] RIP: 0010:__alloc_pages_slowpath+0xe4a/0x1420
[  548.623227] Code: ff ff 01 89 85 c8 fe ff ff e9 91 fc ff ff 41 89 c5 e9 5c fc ff ff 0f 0b 89 f8 25 ff ff f7 ff 89 85 8c fe ff ff e9 d5 f2 ff ff <0f> 0b e9 65 f2 ff ff 65 8b 05 38 81 d2 47 f6 c4 01 74 1c 65 48 8b
[  548.623281] RSP: 0018:ffff8801f28c7678 EFLAGS: 00010246
[  548.623284] RAX: 0000000000000000 RBX: 00000000006040c0 RCX: ffffffffb82f73b7
[  548.623287] RDX: 1ffff1003e518eeb RSI: 000000000000000c RDI: 0000000000000000
[  548.623290] RBP: ffff8801f28c7880 R08: 0000000000000000 R09: ffffed0047fff2c5
[  548.623292] R10: 0000000000000001 R11: ffffed0047fff2c4 R12: ffff8801e88de040
[  548.623295] R13: 00000000006040c0 R14: 000000000000000c R15: ffff8801f28c7938
[  548.623299] FS:  00007f76fca51840(0000) GS:ffff8801f6f00000(0000) knlGS:0000000000000000
[  548.623302] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  548.623304] CR2: 00007f19b9171760 CR3: 00000001ed952000 CR4: 00000000000006e0
[  548.623317] Call Trace:
[  548.623325]  ? kasan_check_read+0x11/0x20
[  548.623330]  ? __zone_watermark_ok+0x92/0x240
[  548.623336]  ? get_page_from_freelist+0x1c3/0x1d90
[  548.623347]  ? _raw_spin_lock_irqsave+0x2a/0x60
[  548.623353]  ? warn_alloc+0x250/0x250
[  548.623358]  ? save_stack+0x46/0xd0
[  548.623361]  ? kasan_kmalloc+0xad/0xe0
[  548.623366]  ? __isolate_free_page+0x2a0/0x2a0
[  548.623370]  ? mount_fs+0x60/0x1a0
[  548.623374]  ? vfs_kern_mount+0x6b/0x1a0
[  548.623378]  ? do_mount+0x34a/0x18c0
[  548.623383]  ? ksys_mount+0x83/0xd0
[  548.623387]  ? __x64_sys_mount+0x67/0x80
[  548.623391]  ? do_syscall_64+0x78/0x170
[  548.623396]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  548.623401]  __alloc_pages_nodemask+0x3c5/0x400
[  548.623407]  ? __alloc_pages_slowpath+0x1420/0x1420
[  548.623412]  ? __mutex_lock_slowpath+0x20/0x20
[  548.623417]  ? kvmalloc_node+0x31/0x80
[  548.623424]  alloc_pages_current+0x75/0x110
[  548.623436]  kmalloc_order+0x24/0x60
[  548.623442]  kmalloc_order_trace+0x24/0xb0
[  548.623448]  __kmalloc_track_caller+0x207/0x220
[  548.623455]  ? f2fs_build_node_manager+0x399/0xbb0
[  548.623460]  kmemdup+0x20/0x50
[  548.623465]  f2fs_build_node_manager+0x399/0xbb0
[  548.623470]  f2fs_fill_super+0x195e/0x2b40
[  548.623477]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.623481]  ? set_blocksize+0x90/0x140
[  548.623486]  mount_bdev+0x1c5/0x210
[  548.623489]  ? f2fs_commit_super+0x1b0/0x1b0
[  548.623495]  f2fs_mount+0x15/0x20
[  548.623498]  mount_fs+0x60/0x1a0
[  548.623503]  ? alloc_vfsmnt+0x309/0x360
[  548.623508]  vfs_kern_mount+0x6b/0x1a0
[  548.623513]  do_mount+0x34a/0x18c0
[  548.623518]  ? lockref_put_or_lock+0xcf/0x160
[  548.623523]  ? copy_mount_string+0x20/0x20
[  548.623528]  ? memcg_kmem_put_cache+0x1b/0xa0
[  548.623533]  ? kasan_check_write+0x14/0x20
[  548.623537]  ? _copy_from_user+0x6a/0x90
[  548.623542]  ? memdup_user+0x42/0x60
[  548.623547]  ksys_mount+0x83/0xd0
[  548.623552]  __x64_sys_mount+0x67/0x80
[  548.623557]  do_syscall_64+0x78/0x170
[  548.623562]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  548.623566] RIP: 0033:0x7f76fc331b9a
[  548.623567] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[  548.623632] RSP: 002b:00007ffd4f0a0e48 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  548.623636] RAX: ffffffffffffffda RBX: 000000000146c030 RCX: 00007f76fc331b9a
[  548.623639] RDX: 000000000146c210 RSI: 000000000146df30 RDI: 0000000001474ec0
[  548.623641] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  548.623643] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001474ec0
[  548.623646] R13: 000000000146c210 R14: 0000000000000000 R15: 0000000000000003
[  548.623650] ---[ end trace 4ce02f25ff7d3df5 ]---
[  548.623656] F2FS-fs (loop0): Failed to initialize F2FS node manager
[  548.627936] F2FS-fs (loop0): Invalid log blocks per segment (8201)

[  548.627940] F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
[  548.635835] F2FS-fs (loop0): Failed to initialize F2FS node manager

- Location
https://elixir.bootlin.com/linux/v4.18-rc1/source/fs/f2fs/segment.c#L3578

	sit_i->sit_bitmap = kmemdup(src_bitmap, bitmap_size, GFP_KERNEL);

Buffer overrun happens when doing memcpy. I suspect there is missing (inconsistent) checks on bitmap_size.

Reported by Wen Xu (wen.xu@gatech.edu) from SSLab, Gatech.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
42bf546c1f f2fs: fix to do sanity check with secs_per_zone
As Wen Xu reported in below link:

https://bugzilla.kernel.org/show_bug.cgi?id=200183

- Overview
Divide zero in reset_curseg() when mounting a crafted f2fs image

- Reproduce

- Kernel message
[  588.281510] divide error: 0000 [#1] SMP KASAN PTI
[  588.282701] CPU: 0 PID: 1293 Comm: mount Not tainted 4.18.0-rc1+ #4
[  588.284000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[  588.286178] RIP: 0010:reset_curseg+0x94/0x1a0
[  588.298166] RSP: 0018:ffff8801e88d7940 EFLAGS: 00010246
[  588.299360] RAX: 0000000000000014 RBX: ffff8801e1d46d00 RCX: ffffffffb88bf60b
[  588.300809] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e1d46d64
[  588.305272] R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000000
[  588.306822] FS:  00007fad85008840(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  588.308456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  588.309623] CR2: 0000000001705078 CR3: 00000001f30f8000 CR4: 00000000000006f0
[  588.311085] Call Trace:
[  588.311637]  f2fs_build_segment_manager+0x103f/0x3410
[  588.316136]  ? f2fs_commit_super+0x1b0/0x1b0
[  588.317031]  ? set_blocksize+0x90/0x140
[  588.319473]  f2fs_mount+0x15/0x20
[  588.320166]  mount_fs+0x60/0x1a0
[  588.320847]  ? alloc_vfsmnt+0x309/0x360
[  588.321647]  vfs_kern_mount+0x6b/0x1a0
[  588.322432]  do_mount+0x34a/0x18c0
[  588.323175]  ? strndup_user+0x46/0x70
[  588.323937]  ? copy_mount_string+0x20/0x20
[  588.324793]  ? memcg_kmem_put_cache+0x1b/0xa0
[  588.325702]  ? kasan_check_write+0x14/0x20
[  588.326562]  ? _copy_from_user+0x6a/0x90
[  588.327375]  ? memdup_user+0x42/0x60
[  588.328118]  ksys_mount+0x83/0xd0
[  588.328808]  __x64_sys_mount+0x67/0x80
[  588.329607]  do_syscall_64+0x78/0x170
[  588.330400]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  588.331461] RIP: 0033:0x7fad848e8b9a
[  588.336022] RSP: 002b:00007ffd7c5b6be8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  588.337547] RAX: ffffffffffffffda RBX: 00000000016f8030 RCX: 00007fad848e8b9a
[  588.338999] RDX: 00000000016f8210 RSI: 00000000016f9f30 RDI: 0000000001700ec0
[  588.340442] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000013
[  588.341887] R10: 00000000c0ed0000 R11: 0000000000000206 R12: 0000000001700ec0
[  588.343341] R13: 00000000016f8210 R14: 0000000000000000 R15: 0000000000000003
[  588.354891] ---[ end trace 4ce02f25ff7d3df5 ]---
[  588.355862] RIP: 0010:reset_curseg+0x94/0x1a0
[  588.360742] RSP: 0018:ffff8801e88d7940 EFLAGS: 00010246
[  588.361812] RAX: 0000000000000014 RBX: ffff8801e1d46d00 RCX: ffffffffb88bf60b
[  588.363485] RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff8801e1d46d64
[  588.365213] RBP: ffff8801e88d7968 R08: ffffed003c32266f R09: ffffed003c32266f
[  588.366661] R10: 0000000000000001 R11: ffffed003c32266e R12: ffff8801f0337700
[  588.368110] R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000000
[  588.370057] FS:  00007fad85008840(0000) GS:ffff8801f6e00000(0000) knlGS:0000000000000000
[  588.372099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  588.373291] CR2: 0000000001705078 CR3: 00000001f30f8000 CR4: 00000000000006f0

- Location
https://elixir.bootlin.com/linux/latest/source/fs/f2fs/segment.c#L2147
        curseg->zone = GET_ZONE_FROM_SEG(sbi, curseg->segno);

If secs_per_zone is corrupted due to fuzzing test, it will cause divide
zero operation when using GET_ZONE_FROM_SEG macro, so we should do more
sanity check with secs_per_zone during mount to avoid this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
67fce70ba3 f2fs: disable f2fs_check_rb_tree_consistence
If there is millions of discard entries cached in rb tree, each
sanity check of it can cause very long latency as held cmd_lock
blocking other lock grabbers.

In other aspect, we have enabled the check very long time, as
we see, there is no such inconsistent condition caused by bugs.

But still we do not choose to kill it directly, instead, adding
an flag to disable the check now, if there is related code change,
we can reuse it to detect bugs.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
e1da7872f6 f2fs: introduce and spread verify_blkaddr
This patch introduces verify_blkaddr to check meta/data block address
with valid range to detect bug earlier.

In addition, once we encounter an invalid blkaddr, notice user to run
fsck to fix, and let the kernel panic.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Arnd Bergmann
24b81dfcb7 f2fs: use timespec64 for inode timestamps
The on-disk representation and the vfs both use 64-bit tv_sec values,
so let's change the last missing piece in the middle.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
6aead1617b f2fs: fix to wait on page writeback before updating page
In error path of f2fs_move_rehashed_dirents, inode page could be writeback
state, so we should wait on inode page writeback before updating it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
e2e59414aa f2fs: assign REQ_RAHEAD to bio for ->readpages
As Jens reported, we'd better assign REQ_RAHEAD to bio by the fact that
->readpages is called only from read-ahead.

In Documentation/filesystems/vfs.txt,

readpages: called by the VM to read pages associated with the address_space
  	object. This is essentially just a vector version of
  	readpage.  Instead of just one page, several pages are
  	requested.
	readpages is only used for read-ahead, so read errors are
  	ignored.  If anything goes wrong, feel free to give up.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Yunlei He
2a63531a61 f2fs: fix a hungtask problem caused by congestion_wait
This patch fix hungtask problem which can be reproduced as follow:

Thread 0~3:
while true
do
        touch /xxx/test/file_xxx
done

Thread 4 write a new checkpoint every three seconds.

In the meantime, fio start 16 threads for randwrite.

With my debug info, cycles num will exceed 1000 in function
f2fs_sync_dirty_inodes, and most of cycle will be dropped
into congestion_wait() and sleep more than 20ms. Cycles num
reduced to 3 with this patch.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Dan Carpenter
2a96d8ad94 f2fs: Fix uninitialized return in f2fs_ioc_shutdown()
"ret" can be uninitialized on the success path when "in ==
F2FS_GOING_DOWN_FULLSYNC".

Fixes: 60b2b4ee2b ("f2fs: Fix deadlock in shutdown ioctl")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
5a6154920f f2fs: don't issue discard commands in online discard is on
Actually, we don't need to issue discard commands, if discard is on, as
mentioned in the comment.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
e2374015f2 f2fs: fix to propagate return value of scan_nat_page()
As Anatoly Trosinenko reported in bugzilla:

How to reproduce:
1. Compile the 73fcb1a370 version of the kernel using the config attached
2. Unpack and mount the attached filesystem image as F2FS
3. The kernel will BUG() on mount (BUGs are explicitly enabled in config)

[    2.233612] F2FS-fs (sda): Found nat_bits in checkpoint
[    2.248422] ------------[ cut here ]------------
[    2.248857] kernel BUG at fs/f2fs/node.c:1967!
[    2.249760] invalid opcode: 0000 [#1] SMP NOPTI
[    2.250219] Modules linked in:
[    2.251848] CPU: 0 PID: 944 Comm: mount Not tainted 4.17.0-rc5+ #1
[    2.252331] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[    2.253305] RIP: 0010:build_free_nids+0x337/0x3f0
[    2.253672] RSP: 0018:ffffae7fc0857c50 EFLAGS: 00000246
[    2.254080] RAX: 00000000ffffffff RBX: 0000000000000123 RCX: 0000000000000001
[    2.254638] RDX: ffff9aa7063d5c00 RSI: 0000000000000122 RDI: ffff9aa705852e00
[    2.255190] RBP: ffff9aa705852e00 R08: 0000000000000001 R09: ffff9aa7059090c0
[    2.255719] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9aa705852e00
[    2.256242] R13: ffff9aa7063ad000 R14: ffff9aa705919000 R15: 0000000000000123
[    2.256809] FS:  00000000023078c0(0000) GS:ffff9aa707800000(0000) knlGS:0000000000000000
[    2.258654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.259153] CR2: 00000000005511ae CR3: 0000000005872000 CR4: 00000000000006f0
[    2.259801] Call Trace:
[    2.260583]  build_node_manager+0x5cd/0x600
[    2.260963]  f2fs_fill_super+0x66a/0x17c0
[    2.261300]  ? f2fs_commit_super+0xe0/0xe0
[    2.261622]  mount_bdev+0x16e/0x1a0
[    2.261899]  mount_fs+0x30/0x150
[    2.262398]  vfs_kern_mount.part.28+0x4f/0xf0
[    2.262743]  do_mount+0x5d0/0xc60
[    2.263010]  ? _copy_from_user+0x37/0x60
[    2.263313]  ? memdup_user+0x39/0x60
[    2.263692]  ksys_mount+0x7b/0xd0
[    2.263960]  __x64_sys_mount+0x1c/0x20
[    2.264268]  do_syscall_64+0x43/0xf0
[    2.264560]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    2.265095] RIP: 0033:0x48d31a
[    2.265502] RSP: 002b:00007ffc6fe60a08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[    2.266089] RAX: ffffffffffffffda RBX: 0000000000008000 RCX: 000000000048d31a
[    2.266607] RDX: 00007ffc6fe62fa5 RSI: 00007ffc6fe62f9d RDI: 00007ffc6fe62f94
[    2.267130] RBP: 00000000023078a0 R08: 0000000000000000 R09: 0000000000000000
[    2.267670] R10: 0000000000008000 R11: 0000000000000246 R12: 0000000000000000
[    2.268192] R13: 0000000000000000 R14: 00007ffc6fe60c78 R15: 0000000000000000
[    2.268767] Code: e8 5f c3 ff ff 83 c3 01 41 83 c7 01 81 fb c7 01 00 00 74 48 44 39 7d 04 76 42 48 63 c3 48 8d 04 c0 41 8b 44 06 05 83 f8 ff 75 c1 <0f> 0b 49 8b 45 50 48 8d b8 b0 00 00 00 e8 37 59 69 00 b9 01 00
[    2.270434] RIP: build_free_nids+0x337/0x3f0 RSP: ffffae7fc0857c50
[    2.271426] ---[ end trace ab20c06cd3c8fde4 ]---

During loading NAT entries, we will do sanity check, once the entry info
is corrupted, it will cause BUG_ON directly to protect user data from
being overwrited.

In this case, it will be better to just return failure on mount() instead
of panic, so that user can get hint from kmsg and try fsck for recovery
immediately rather than after an abnormal reboot.

https://bugzilla.kernel.org/show_bug.cgi?id=199769

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Weichao Guo
54c55c4e4f f2fs: support in-memory inode checksum when checking consistency
Enable in-memory inode checksum to protect metadata blocks from
in-memory scribbles when checking consistency, which has no
performance requirements.

Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
4e423832a6 f2fs: fix error path of fill_super
In fill_super, if root inode's attribute is incorrect, we need to
call f2fs_destroy_stats to release stats memory.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
4cac90d549 f2fs: relocate readdir_ra configure initialization
readdir_ra is sysfs configuration instead of mount option, so it should
not be initialized in default_options(), otherwise after remount, it can
be reset to be enabled which may not as user wish, so let's move it to
f2fs_tuning_parameters().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
0aa7e0f8c0 f2fs: move s_res{u,g}id initialization to default_options()
Let default_options() initialize s_res{u,g}id with default value like
other options.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Chao Yu
76a45e3c45 f2fs: don't acquire orphan ino during recovery
During orphan inode recovery, checkpoint should never succeed due to
SBI_POR_DOING flag, so we don't need acquire orphan ino which only be
used by checkpoint.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
a1933c09ef f2fs: avoid potential deadlock in f2fs_sbi_store
[  155.018460] ======================================================
[  155.021431] WARNING: possible circular locking dependency detected
[  155.024339] 4.18.0-rc3+ #5 Tainted: G           OE
[  155.026879] ------------------------------------------------------
[  155.029783] umount/2901 is trying to acquire lock:
[  155.032187] 00000000c4282f1f (kn->count#130){++++}, at: kernfs_remove+0x1f/0x30
[  155.035439]
[  155.035439] but task is already holding lock:
[  155.038892] 0000000056e4307b (&type->s_umount_key#41){++++}, at: deactivate_super+0x33/0x50
[  155.042602]
[  155.042602] which lock already depends on the new lock.
[  155.042602]
[  155.047465]
[  155.047465] the existing dependency chain (in reverse order) is:
[  155.051354]
[  155.051354] -> #1 (&type->s_umount_key#41){++++}:
[  155.054768]        f2fs_sbi_store+0x61/0x460 [f2fs]
[  155.057083]        kernfs_fop_write+0x113/0x1a0
[  155.059277]        __vfs_write+0x36/0x180
[  155.061250]        vfs_write+0xbe/0x1b0
[  155.063179]        ksys_write+0x55/0xc0
[  155.065068]        do_syscall_64+0x60/0x1b0
[  155.067071]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  155.069529]
[  155.069529] -> #0 (kn->count#130){++++}:
[  155.072421]        __kernfs_remove+0x26f/0x2e0
[  155.074452]        kernfs_remove+0x1f/0x30
[  155.076342]        kobject_del.part.5+0xe/0x40
[  155.078354]        f2fs_put_super+0x12d/0x290 [f2fs]
[  155.080500]        generic_shutdown_super+0x6c/0x110
[  155.082655]        kill_block_super+0x21/0x50
[  155.084634]        kill_f2fs_super+0x9c/0xc0 [f2fs]
[  155.086726]        deactivate_locked_super+0x3f/0x70
[  155.088826]        cleanup_mnt+0x3b/0x70
[  155.090584]        task_work_run+0x93/0xc0
[  155.092367]        exit_to_usermode_loop+0xf0/0x100
[  155.094466]        do_syscall_64+0x162/0x1b0
[  155.096312]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  155.098603]
[  155.098603] other info that might help us debug this:
[  155.098603]
[  155.102418]  Possible unsafe locking scenario:
[  155.102418]
[  155.105134]        CPU0                    CPU1
[  155.107037]        ----                    ----
[  155.108910]   lock(&type->s_umount_key#41);
[  155.110674]                                lock(kn->count#130);
[  155.113010]                                lock(&type->s_umount_key#41);
[  155.115608]   lock(kn->count#130);

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:59 +09:00
Jaegeuk Kim
83a3bfdb5a f2fs: indicate shutdown f2fs to allow unmount successfully
Once we shutdown f2fs, we have to flush stale pages in order to unmount
the system. In order to make stable, we need to stop fault injection as well.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:56 +09:00
Jaegeuk Kim
af697c0f5c f2fs: keep meta pages in cp_error state
It turns out losing meta pages in shutdown period makes f2fs very unstable
so that I could see many unexpected error conditions.

Let's keep meta pages for fault injection and sudden power-off tests.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-27 18:03:39 +09:00
Kirill A. Shutemov
bfd40eaff5 mm: fix vma_is_anonymous() false-positives
vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
VMA.  This is unreliable as ->mmap may not set ->vm_ops.

False-positive vma_is_anonymous() may lead to crashes:

	next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
	prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
	pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
	flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
	------------[ cut here ]------------
	kernel BUG at mm/memory.c:1422!
	invalid opcode: 0000 [#1] SMP KASAN
	CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
	01/01/2011
	RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
	RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
	RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
	RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
	Call Trace:
	 unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
	 zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
	 unmap_mapping_range_vma mm/memory.c:2792 [inline]
	 unmap_mapping_range_tree mm/memory.c:2813 [inline]
	 unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
	 unmap_mapping_range+0x48/0x60 mm/memory.c:2880
	 truncate_pagecache+0x54/0x90 mm/truncate.c:800
	 truncate_setsize+0x70/0xb0 mm/truncate.c:826
	 simple_setattr+0xe9/0x110 fs/libfs.c:409
	 notify_change+0xf13/0x10f0 fs/attr.c:335
	 do_truncate+0x1ac/0x2b0 fs/open.c:63
	 do_sys_ftruncate+0x492/0x560 fs/open.c:205
	 __do_sys_ftruncate fs/open.c:215 [inline]
	 __se_sys_ftruncate fs/open.c:213 [inline]
	 __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
	 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
	 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reproducer:

	#include <stdio.h>
	#include <stddef.h>
	#include <stdint.h>
	#include <stdlib.h>
	#include <string.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <sys/ioctl.h>
	#include <sys/mman.h>
	#include <unistd.h>
	#include <fcntl.h>

	#define KCOV_INIT_TRACE			_IOR('c', 1, unsigned long)
	#define KCOV_ENABLE			_IO('c', 100)
	#define KCOV_DISABLE			_IO('c', 101)
	#define COVER_SIZE			(1024<<10)

	#define KCOV_TRACE_PC  0
	#define KCOV_TRACE_CMP 1

	int main(int argc, char **argv)
	{
		int fd;
		unsigned long *cover;

		system("mount -t debugfs none /sys/kernel/debug");
		fd = open("/sys/kernel/debug/kcov", O_RDWR);
		ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE);
		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
				PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
		munmap(cover, COVER_SIZE * sizeof(unsigned long));
		cover = mmap(NULL, COVER_SIZE * sizeof(unsigned long),
				PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
		memset(cover, 0, COVER_SIZE * sizeof(unsigned long));
		ftruncate(fd, 3UL << 20);
		return 0;
	}

This can be fixed by assigning anonymous VMAs own vm_ops and not relying
on it being NULL.

If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
dummy_vm_ops.  This way we will have non-NULL ->vm_ops for all VMAs.

Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 19:38:03 -07:00
Kirill A. Shutemov
2c4541e24c mm: use vma_init() to initialize VMAs on stack and data segments
Make sure to initialize all VMAs properly, not only those which come
from vm_area_cachep.

Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-26 19:38:03 -07:00
Lance Shelton
a61246c961 Fix error code in nfs_lookup_verify_inode()
Return -ESTALE to force a lookup when the file has no more links

Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
3825827ebf NFS: More excessive attribute revalidation in nfs_execute_ok()
execute_ok() will only check the mode bits if the object is not a
directory, so we don't need to revalidate the attributes in that case.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
cf8340277f NFS: Fix excessive attribute revalidation in nfs_execute_ok()
When nfs_update_inode() sets NFS_INO_INVALID_ACCESS it is a sign that
we want to revalidate the access cache, not the inode attributes.
In fact we only want to revalidate here if we see that the mode bits
are invalid, so check for NFS_INO_INVALID_OTHER instead.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
7be7b3ca16 NFS: Ensure we immediately start writeback on rescheduled writes
If the writes are being rescheduled due to a pNFS error, then we really
want to immediately start a new flush. The O_DIRECT code already does
this, so we only need to worry about buffered writes.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
bd3d16a887 NFSv4.1: Fix a potential layoutget/layoutrecall deadlock
If the client is sending a layoutget, but the server issues a callback
to recall what it thinks may be an outstanding layout, then we may find
an uninitialised layout attached to the inode due to the layoutget.
In that case, it is appropriate to return NFS4ERR_NOMATCHING_LAYOUT
rather than NFS4ERR_DELAY, as the latter can end up deadlocking.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
af9b6d7570 pNFS: Parse the results of layoutget on open even if permissions checks fail
Even if the results of the permissions checks failed, we should parse
the results of the layout on open call so that we can return the
layout if required.
Note that we also want to ignore the sequence counter for whether or not
a layout recall occurred. If the recall pertained to our OPEN, then the
callback will know, and will attempt to wait for us to finih processing
anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
b2b1ff3da6 NFS: Allow optimisation of lseek(fd, SEEK_CUR, 0) on directories
There should be no need to grab the inode lock if we're only reading
the file offset.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
411ae722d1 pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout()
If the old layout was recalled, and we returned NFS4ERR_NOMATCHINGLAYOUT
then we need to wait for all outstanding layoutget calls to complete
before we can send a new one.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
056f9ad62e pNFS/flexfiles: Ensure we always return a layout if it has layoutstats
If a layout segment is carrying layoutstats or layout error information,
then we always want to return it rather than using a forgetful model.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
f0b429819b pNFS: Ignore non-recalled layouts in pnfs_layout_need_return()
If a layout has been recalled, then we should fire off a layoutreturn as
soon as all the layout segments that match the recall have been retired.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:25 -04:00
Trond Myklebust
00bcbe119f pNFS: Don't update the stateid when replying NFS4ERR_DELAY to a layout recall
RFC5661 doesn't state directly that the client should update the layout
stateid if it returns NFS4ERR_NOMATCHING_LAYOUT in response to a recall,
however it does state that this error will "cleanly indicate completion"
on par with returning the layout. For this reason, we assume that the
client should update the layout stateid. The Linux pNFS server definitely
does expect this behaviour.

However, if the client replies NFS4ERR_DELAY, then it is stating that
the recall was not processed, so it would be very wrong to update the
layout stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:24 -04:00
Trond Myklebust
e0b7d420f7 pNFS: Don't discard layout segments that are marked for return
If there are layout segments that are marked for return, then we need
to ensure that pnfs_mark_matching_lsegs_return() does not just
silently discard them, but it should tell the caller that there is a
layoutreturn scheduled.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-07-26 16:25:24 -04:00
Bob Peterson
3f30f929bb gfs2: cleanup: call gfs2_rgrp_ondisk2lvb from gfs2_rgrp_out
Before this patch gfs2_rgrp_ondisk2lvb was called after every call
to gfs2_rgrp_out. This patch just calls it directly from within
gfs2_rgrp_out, and moves the function to be before it so we don't
need a function prototype.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-26 14:49:43 -05:00
Martin Wilck
9362dd1109 blkdev: __blkdev_direct_IO_simple: fix leak in error case
Fixes: 72ecad22d9 ("block: support a full bio worth of IO for simplified bdev direct-io")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-26 11:52:33 -06:00
Eric Sandeen
1c02d502c2 xfs: remove deprecated barrier/nobarrier mount
The barrier mount options have been no-ops and deprecated since

4cf4573 xfs: deprecate barrier/nobarrier mount option

i.e. kernel 4.10 / December 2016, with a stated deprecation schedule
after v4.15.  Should be fair game to remove them now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:17 -07:00
Darrick J. Wong
44a8736bd2 xfs: clean up IRELE/iput callsites
Replace the IRELE macro with a proper function so that we can do proper
typechecking and so that we can stop open-coding iput in scrub, which
means that we'll be able to ftrace inode lifetimes going through scrub
correctly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-26 10:15:16 -07:00
Darrick J. Wong
89c3e8cf3c xfs: kill IHOLD
Nobody uses this macro, get rid of it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2018-07-26 10:15:16 -07:00
Brian Foster
b277c37f43 xfs: bypass final dfops roll in trans commit path
Once xfs_defer_finish() has completed all deferred operations, it
checks the dirty state of the transaction and rolls it once more to
return a clean transaction for the caller. This primarily to cover
the case where repeated xfs_defer_finish() calls are made in a loop
and we need to make sure that the caller starts the next iteration
with a clean transaction. Otherwise we risk transaction reservation
overrun.

This final transaction roll is not required in the transaction
commit path, however, because the transaction is immediately
committed and freed after dfops completion. Refactor the final roll
into a separate helper such that we can avoid it in the transaction
commit path.  Lift the dfops reset as well so dfops remains valid
until after the last call to xfs_defer_trans_roll(). The reset is
also unnecessary in the transaction commit path because the
transaction is about to complete.

This eliminates unnecessary regrants of transactions where the
associated transaction roll can be replaced by a transaction commit.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:16 -07:00
Brian Foster
9e28a242be xfs: drop unnecessary xfs_defer_finish() dfops parameter
Every caller of xfs_defer_finish() now passes the transaction and
its associated ->t_dfops. The xfs_defer_ops parameter is therefore
no longer necessary and can be removed.

Since most xfs_defer_finish() callers also have to consider
xfs_defer_cancel() on error, update the latter to also receive the
transaction for consistency. The log recovery code contains an
outlier case that cancels a dfops directly without an available
transaction. Retain an internal wrapper to support this outlier case
for the time being.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:16 -07:00
Brian Foster
d5cca7eb24 xfs: remove unnecessary dfops init calls in xattr code
Each xfs_defer_init() call in the xattr code uses the internal dfops
reference. In addition, a successful xfs_defer_finish() always
returns with a reset xfs_defer_ops structure.

Given that along with the fact that every xfs_defer_init() call in
the xattr code is followed up by an xfs_defer_finish(), the former
calls are no longer necessary and can be removed.

Note that the xfs_defer_init() call in the remote value copy loop of
xfs_attr_rmtval_set() is not followed by a finish, but the dfops is
unused in this instance.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:15 -07:00
Brian Foster
c8eac49ef7 xfs: remove all boilerplate defer init/finish code
At this point, the transaction subsystem completely manages deferred
items internally such that the common and boilerplate
xfs_trans_alloc() -> xfs_defer_init() -> xfs_defer_finish() ->
xfs_trans_commit() sequence can be replaced with a simple
transaction allocation and commit.

Remove all such boilerplate deferred ops code. In doing so, we
change each case over to use the dfops in the transaction and
specifically eliminate:

- The on-stack dfops and associated xfs_defer_init() call, as the
  internal dfops is initialized on transaction allocation.
- xfs_bmap_finish() calls that precede a final xfs_trans_commit() of
  a transaction.
- xfs_defer_cancel() calls in error handlers that precede a
  transaction cancel.

The only deferred ops calls that remain are those that are
non-deterministic with respect to the final commit of the associated
transaction or are open-coded due to special handling.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:15 -07:00
Brian Foster
91ef75b657 xfs: use internal dfops during [b|c]ui recovery
bmap and refcount intent processing associates a dfops from the
caller with a local transaction to collect all deferred items for
post-processing. Use the internal dfops in both of these functions
and move the deferred items to the parent dfops before the
transaction commits.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:15 -07:00
Brian Foster
9c6bb0cf7b xfs: use internal dfops in attr code
Remove the unnecessary on-stack dfops structure and use the internal
transaction dfops instead. The lower level xattr code already
appropriately accesses ->t_dfops throughout.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:14 -07:00
Brian Foster
1e5ae1995a xfs: use internal dfops in cow blocks cancel
All callers either explicitly initialize a dfops or pass a
transaction with an internal dfops. Drop the hacky old dfops
replacement logic and use the one associated with the transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:14 -07:00
Brian Foster
e021a2e5fc xfs: support embedded dfops in transaction
The dfops structure used by multi-transaction operations is
typically stored on the stack and carried around by the associated
transaction. The lifecycle of dfops does not quite match that of the
transaction, but they are tightly related in that the former depends
on the latter.

The relationship of these objects is tight enough that we can avoid
the cumbersome boilerplate code required in most cases to manage
them separately by just embedding an xfs_defer_ops in the
transaction itself. This means that a transaction allocation returns
with an initialized dfops, a transaction commit finishes pending
deferred items before the tx commit, a transaction cancel cancels
the dfops before the transaction and a transaction dup operation
transfers the current dfops state to the new transaction.

The dup operation is slightly complicated by the fact that we can no
longer just copy a dfops pointer from the old transaction to the new
transaction. This is solved through a dfops move helper that
transfers the pending items and other dfops state across the
transactions. This also requires that transaction rolling code
always refer to the transaction for the current dfops reference.

Finally, to facilitate incremental conversion to the internal dfops
and continue to support the current external dfops mode of
operation, create the new ->t_dfops_internal field with a layer of
indirection. On allocation, ->t_dfops points to the internal dfops.
This state is overridden by callers who re-init a local dfops on the
transaction. Once ->t_dfops is overridden, the external dfops
reference is maintained as the transaction rolls.

This patch adds the fundamental ability to support an internal
dfops. All codepaths that perform deferred processing continue to
override the internal dfops until they are converted over in
subsequent patches.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:14 -07:00
Brian Foster
44fd294681 xfs: pack holes in xfs_defer_ops and xfs_trans
Both structures have holes due to member alignment. Move dop_low to
the end of xfs_defer ops to sanitize the cache line alignment and
move t_flags to save 8 bytes in xfs_trans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:13 -07:00
Brian Foster
509308b413 xfs: reset dfops to initial state after finish
xfs_defer_init() is currently used in two particular situations. The
first and most obvious case is raw initialization of an
xfs_defer_ops struct. The other case is partial reinit of
xfs_defer_ops on reuse due to iteration.

Most instances of the first case will be replaced by a single init
of a dfops embedded in the transaction. Init calls are still
technically required for the second case because the dfops may have
low space mode enabled or have joined items that need to be reset
before the dfops should be reused.

Since the current dfops usage expects either a final transaction
commit after xfs_defer_finish() or xfs_defer_init() if dfops is to
be reused, we can shift some of the init logic into
xfs_defer_finish() such that the latter returns with a reinitialized
dfops. This eliminates the second dependency noted above such that a
dfops is immediately ready for reuse after an xfs_defer_finish()
without the need to change any calling code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:13 -07:00
Brian Foster
83200bfac6 xfs: remove unused deferred ops committed field
dop_committed is set when deferred item processing rolls the
transaction at least once, but is only ever accessed in tracepoints.
The transaction roll/commit events are already available via
independent tracepoints, so remove the otherwise unused field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:13 -07:00
Brian Foster
03f4e4b26c xfs: make deferred processing safe for embedded dfops
xfs_defer_finish() has a couple quirks that are not safe with
respect to the upcoming internal dfops functionality. First,
xfs_defer_finish() attaches the passed in dfops structure to
->t_dfops and caches and restores the original value. Second, it
continues to use the initial dfops reference before and after the
transaction roll.

These behaviors assume that dop is an independent memory allocation
from the transaction itself, which may not always be true once
transactions begin to use an embedded dfops structure. In the latter
model, dfops processing creates a new xfs_defer_ops structure with
each transaction and the associated state is migrated across to the
new transaction.

Fix up xfs_defer_finish() to handle the possibility of the current
dfops changing after a transaction roll. Since ->t_dfops is used
unconditionally in this path, it is no longer necessary to
attach/restore ->t_dfops and pass it explicitly down to
xfs_defer_trans_roll(). Update dop in the latter function and the
caller to ensure that it always refers to the current dfops
structure.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Brian Foster
dcbd44f799 xfs: fix transaction leak on remote attr set/remove failure
The xattr remote value set/remove handlers both clear args.trans in
the error path without having cancelled the transaction. This leaks
the transaction, causes warnings around returning to userspace with
locks held and leads to system lockups or other general problems.

The higher level xfs_attr_[set|remove]() functions already detect
and cancel args.trans when set in the error path. Drop the NULL
assignments from the rmtval handlers and allow the callers to clean
up the transaction correctly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Brian Foster
a61acc3c78 xfs: use ->t_dfops in log recovery intent processing
xlog_finish_defer_ops() processes the deferred operations collected
over the entire intent recovery sequence. We can't xfs_defer_init()
here because the dfops is already populated. Attach it manually and
eliminate the last caller of xfs_defer_finish() that doesn't pass
->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Brian Foster
02dff7bf81 xfs: pull up dfops from xfs_itruncate_extents()
xfs_itruncate_extents[_flags]() uses a local dfops with a
transaction provided by the caller. It uses hacky ->t_dfops
replacement logic to avoid stomping over an already populated
->t_dfops.

The latter never occurs for current callers and the logic itself is
not really appropriate. Clean this up by updating all callers to
initialize a dfops and to use that down in xfs_itruncate_extents().
This more closely resembles the upcoming logic where dfops will be
embedded within the transaction. We can also replace the
xfs_defer_init() in the xfs_itruncate_extents_flags() loop with an
assert. Both dfops and firstblock should be in a valid state
after xfs_defer_finish() and the inode joined to the dfops is fixed
throughout the loop.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-26 10:15:12 -07:00
Andrey Ryabinin
9635453572 fuse: reduce allocation size for splice_write
The 'bufs' array contains 'pipe->buffers' elements, but the
fuse_dev_splice_write() uses only 'pipe->nrbufs' elements.

So reduce the allocation size to 'pipe->nrbufs' elements.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:12 +02:00
Andrey Ryabinin
d6d931adce fuse: use kvmalloc to allocate array of pipe_buffer structs.
The amount of pipe->buffers is basically controlled by userspace by
fcntl(... F_SETPIPE_SZ ...) so it could be large. High order allocations
could be slow (if memory is heavily fragmented) or may fail if the order
is larger than PAGE_ALLOC_COSTLY_ORDER.

Since the 'bufs' doesn't need to be physically contiguous, use
the kvmalloc_array() to allocate memory. If high order
page isn't available, the kvamalloc*() will fallback to 0-order.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:12 +02:00
Arnd Bergmann
a64ba10f65 fuse: convert last timespec use to timespec64
All of fuse uses 64-bit timestamps with the exception of the
fuse_change_attributes(), so let's convert this one as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:12 +02:00
Souptick Joarder
46fb504a71 fs: fuse: Adding new return type vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno.  Once all instances are
converted, vm_fault_t will become a distinct type.

commit 1c8f422059 ("mm: change return type to vm_fault_t")

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:12 +02:00
Miklos Szeredi
75f3ee4c28 fuse: simplify fuse_abort_conn()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:12 +02:00
Kirill Tkhai
109728ccc5 fuse: Add missed unlock_page() to fuse_readpages_fill()
The above error path returns with page unlocked, so this place seems also
to behave the same.

Fixes: f8dbdf8182 ("fuse: rework fuse_readpages()")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:12 +02:00
Andrey Ryabinin
a2477b0e67 fuse: Don't access pipe->buffers without pipe_lock()
fuse_dev_splice_write() reads pipe->buffers to determine the size of
'bufs' array before taking the pipe_lock(). This is not safe as
another thread might change the 'pipe->buffers' between the allocation
and taking the pipe_lock(). So we end up with too small 'bufs' array.

Move the bufs allocations inside pipe_lock()/pipe_unlock() to fix this.

Fixes: dd3bb14f44 ("fuse: support splice() writing to fuse device")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org> # v2.6.35
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:11 +02:00
Miklos Szeredi
63576c13bd fuse: fix initial parallel dirops
If parallel dirops are enabled in FUSE_INIT reply, then first operation may
leave fi->mutex held.

Reported-by: syzbot <syzbot+3f7b29af1baa9d0a55be@syzkaller.appspotmail.com>
Fixes: 5c672ab3f0 ("fuse: serialize dirops by default")
Cc: <stable@vger.kernel.org> # v4.7
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:11 +02:00
Miklos Szeredi
e8f3bd773d fuse: Fix oops at process_init_reply()
syzbot is hitting NULL pointer dereference at process_init_reply().
This is because deactivate_locked_super() is called before response for
initial request is processed.

Fix this by aborting and waiting for all requests (including FUSE_INIT)
before resetting fc->sb.

Original patch by Tetsuo Handa <penguin-kernel@I-love.SKAURA.ne.jp>.

Reported-by: syzbot <syzbot+b62f08f4d5857755e3bc@syzkaller.appspotmail.com>
Fixes: e27c9d3877 ("fuse: fuse: add time_gran to INIT_OUT")
Cc: <stable@vger.kernel.org> # v3.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:11 +02:00
Miklos Szeredi
b8f95e5d13 fuse: umount should wait for all requests
fuse_abort_conn() does not guarantee that all async requests have actually
finished aborting (i.e. their ->end() function is called).  This could
actually result in still used inodes after umount.

Add a helper to wait until all requests are fully done.  This is done by
looking at the "num_waiting" counter.  When this counter drops to zero, we
can be sure that no more requests are outstanding.

Fixes: 0d8e84b043 ("fuse: simplify request abort")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:11 +02:00
Miklos Szeredi
45ff350bbd fuse: fix unlocked access to processing queue
fuse_dev_release() assumes that it's the only one referencing the
fpq->processing list, but that's not true, since fuse_abort_conn() can be
doing the same without any serialization between the two.

Fixes: c3696046be ("fuse: separate pqueue for clones")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:11 +02:00
Miklos Szeredi
87114373ea fuse: fix double request_end()
Refcounting of request is broken when fuse_abort_conn() is called and
request is on the fpq->io list:

 - ref is taken too late
 - then it is not dropped

Fixes: 0d8e84b043 ("fuse: simplify request abort")
Cc: <stable@vger.kernel.org> # v4.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-26 16:13:11 +02:00
Andreas Gruenbacher
776125785a gfs2: Special-case rindex for gfs2_grow
To speed up the common case of appending to a file,
gfs2_write_alloc_required presumes that writing beyond the end of a file
will always require additional blocks to be allocated.  This assumption
is incorrect for preallocates files, but there are no negative
consequences as long as *some* space is still left on the filesystem.

One special file that always has some space preallocated beyond the end
of the file is the rindex: when growing a filesystem, gfs2_grow adds one
or more new resource groups and appends records describing those
resource groups to the rindex; the preallocated space ensures that this
is always possible.

However, when a filesystem is completely full, gfs2_write_alloc_required
will indicate that an additional allocation is required, and appending
the next record to the rindex will fail even though space for that
record has already been preallocated.  To fix that, skip the incorrect
optimization in gfs2_write_alloc_required, but for the rindex only.
Other writes to preallocated space beyond the end of the file are still
allowed to fail on completely full filesystems.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-25 22:56:14 +02:00
Linus Torvalds
5c61ef1b7c fscache fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAW1h/7Pu3V2unywtrAQJFOg/+NQswGJcTsJGeTL8tW+8nGtJVeTP6XaVh
 xPEjdqTBmimt6ciaNP1LxLLt9jQ50S1f83rWGZeFBQoNgWinoe3VtzSVdlQKhZcT
 jC5LtFVlTxw5rrL4uCsJywLLjD0NeH41ISbvCStcyYJExOZr+f4/VJXKNcwKjAvf
 kD1xDGnVZsZiGLWFjwBVaPJwFigquoLEU564InMnZbvMW95uZOPGfnwxAGmKQX2n
 BV3WxVizCc0MwlHMJYjs0cVMZNviuC+qg7YBJIoio3+Dq8FIn7ISn98LbhCpG7mi
 FoiRi+7xs7VCGm9yqtkXL+euHcSzjnJPnlYxpU8xGqAay0qKxoHecZj2iMEX327K
 E4mujQ40oqkMLhwy3GhT9cIpvbbQPu7+kS+k9x7UqVnzlhsKEeMp7TEFqSEebO9H
 kIuvfRBD3uZY0B/loLCB3Cc/B9OoWAUi6IGBRclwS9+RUuBnZY/jb7iQsEvcOv9u
 0EC0biSs1jizG1tLR0LmvjIyvS567t/DG/peLad1lOqPe6Up2mO4XIeyS9phwXAD
 ryupnKPr3tGRgvfJ4jLUZPC8/nrv5Fg9R0YhICEEBqhwKn1uTZgyc085d2EHOdQp
 fmfbXR/oz4TwjjwlgzrLMQbLB7GUpkgCFxsIPpEVeFH3RDbZNO1UbLovDMUuRaos
 lFWHzd4K4XA=
 =yuI9
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20180725' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache/cachefiles fixes from David Howells:

 - Allow cancelled operations to be queued so they can be cleaned up.

 - Fix a refcounting bug in the monitoring of reads on backend files
   whereby a race can occur between monitor objects being listed for
   work, the work processing being queued and the work processor running
   and destroying the monitor objects.

 - Fix a ref overput in object attachment, whereby a tentatively
   considered object is put in error handling without first being 'got'.

 - Fix a missing clear of the CACHEFILES_OBJECT_ACTIVE flag whereby an
   assertion occurs when we retry because it seems the object is now
   active.

 - Wait rather BUG'ing on an object collision in the depths of
   cachefiles as the active object should be being cleaned up - also
   depends on the one above.

* tag 'fscache-fixes-20180725' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  cachefiles: Wait rather than BUG'ing on "Unexpected object collision"
  cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
  fscache: Fix reference overput in fscache_attach_object() error handling
  cachefiles: Fix refcounting bug in backing-file read monitoring
  fscache: Allow cancelled operations to be enqueued
2018-07-25 10:55:24 -07:00
Kiran Kumar Modukuri
c2412ac45a cachefiles: Wait rather than BUG'ing on "Unexpected object collision"
If we meet a conflicting object that is marked FSCACHE_OBJECT_IS_LIVE in
the active object tree, we have been emitting a BUG after logging
information about it and the new object.

Instead, we should wait for the CACHEFILES_OBJECT_ACTIVE flag to be cleared
on the old object (or return an error).  The ACTIVE flag should be cleared
after it has been removed from the active object tree.  A timeout of 60s is
used in the wait, so we shouldn't be able to get stuck there.

Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-07-25 14:49:00 +01:00
Kiran Kumar Modukuri
5ce83d4bb7 cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
In cachefiles_mark_object_active(), the new object is marked active and
then we try to add it to the active object tree.  If a conflicting object
is already present, we want to wait for that to go away.  After the wait,
we go round again and try to re-mark the object as being active - but it's
already marked active from the first time we went through and a BUG is
issued.

Fix this by clearing the CACHEFILES_OBJECT_ACTIVE flag before we try again.

Analysis from Kiran Kumar Modukuri:

[Impact]
Oops during heavy NFS + FSCache + Cachefiles

CacheFiles: Error: Overlong wait for old active object to go away.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000002

CacheFiles: Error: Object already active kernel BUG at
fs/cachefiles/namei.c:163!

[Cause]
In a heavily loaded system with big files being read and truncated, an
fscache object for a cookie is being dropped and a new object being
looked. The new object being looked for has to wait for the old object
to go away before the new object is moved to active state.

[Fix]
Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when
retrying the object lookup.

[Testcase]
Have run ~100 hours of NFS stress tests and have not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-07-25 14:49:00 +01:00
Kiran Kumar Modukuri
f29507ce66 fscache: Fix reference overput in fscache_attach_object() error handling
When a cookie is allocated that causes fscache_object structs to be
allocated, those objects are initialised with the cookie pointer, but
aren't blessed with a ref on that cookie unless the attachment is
successfully completed in fscache_attach_object().

If attachment fails because the parent object was dying or there was a
collision, fscache_attach_object() returns without incrementing the cookie
counter - but upon failure of this function, the object is released which
then puts the cookie, whether or not a ref was taken on the cookie.

Fix this by taking a ref on the cookie when it is assigned in
fscache_object_init(), even when we're creating a root object.


Analysis from Kiran Kumar:

This bug has been seen in 4.4.0-124-generic #148-Ubuntu kernel

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277

fscache cookie ref count updated incorrectly during fscache object
allocation resulting in following Oops.

kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!

[Cause]
Two threads are trying to do operate on a cookie and two objects.

(1) One thread tries to unmount the filesystem and in process goes over a
    huge list of objects marking them dead and deleting the objects.
    cookie->usage is also decremented in following path:

      nfs_fscache_release_super_cookie
       -> __fscache_relinquish_cookie
        ->__fscache_cookie_put
        ->BUG_ON(atomic_read(&cookie->usage) <= 0);

(2) A second thread tries to lookup an object for reading data in following
    path:

    fscache_alloc_object
    1) cachefiles_alloc_object
        -> fscache_object_init
           -> assign cookie, but usage not bumped.
    2) fscache_attach_object -> fails in cant_attach_object because the
         cookie's backing object or cookie's->parent object are going away
    3) fscache_put_object
        -> cachefiles_put_object
          ->fscache_object_destroy
            ->fscache_cookie_put
               ->BUG_ON(atomic_read(&cookie->usage) <= 0);

[NOTE from dhowells] It's unclear as to the circumstances in which (2) can
take place, given that thread (1) is in nfs_kill_super(), however a
conflicting NFS mount with slightly different parameters that creates a
different superblock would do it.  A backtrace from Kiran seems to show
that this is a possibility:

    kernel BUG at/build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
    ...
    RIP: __fscache_cookie_put+0x3a/0x40 [fscache]
    Call Trace:
     __fscache_relinquish_cookie+0x87/0x120 [fscache]
     nfs_fscache_release_super_cookie+0x2d/0xb0 [nfs]
     nfs_kill_super+0x29/0x40 [nfs]
     deactivate_locked_super+0x48/0x80
     deactivate_super+0x5c/0x60
     cleanup_mnt+0x3f/0x90
     __cleanup_mnt+0x12/0x20
     task_work_run+0x86/0xb0
     exit_to_usermode_loop+0xc2/0xd0
     syscall_return_slowpath+0x4e/0x60
     int_ret_from_sys_call+0x25/0x9f

[Fix] Bump up the cookie usage in fscache_object_init, when it is first
being assigned a cookie atomically such that the cookie is added and bumped
up if its refcount is not zero.  Remove the assignment in
fscache_attach_object().

[Testcase]
I have run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

Fixes: ccc4fc3d11 ("FS-Cache: Implement the cookie management part of the netfs API")
Signed-off-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-07-25 14:49:00 +01:00
Kiran Kumar Modukuri
934140ab02 cachefiles: Fix refcounting bug in backing-file read monitoring
cachefiles_read_waiter() has the right to access a 'monitor' object by
virtue of being called under the waitqueue lock for one of the pages in its
purview.  However, it has no ref on that monitor object or on the
associated operation.

What it is allowed to do is to move the monitor object to the operation's
to_do list, but once it drops the work_lock, it's actually no longer
permitted to access that object.  However, it is trying to enqueue the
retrieval operation for processing - but it can only do this via a pointer
in the monitor object, something it shouldn't be doing.

If it doesn't enqueue the operation, the operation may not get processed.
If the order is flipped so that the enqueue is first, then it's possible
for the work processor to look at the to_do list before the monitor is
enqueued upon it.

Fix this by getting a ref on the operation so that we can trust that it
will still be there once we've added the monitor to the to_do list and
dropped the work_lock.  The op can then be enqueued after the lock is
dropped.

The bug can manifest in one of a couple of ways.  The first manifestation
looks like:

 FS-Cache:
 FS-Cache: Assertion failed
 FS-Cache: 6 == 5 is false
 ------------[ cut here ]------------
 kernel BUG at fs/fscache/operation.c:494!
 RIP: 0010:fscache_put_operation+0x1e3/0x1f0
 ...
 fscache_op_work_func+0x26/0x50
 process_one_work+0x131/0x290
 worker_thread+0x45/0x360
 kthread+0xf8/0x130
 ? create_worker+0x190/0x190
 ? kthread_cancel_work_sync+0x10/0x10
 ret_from_fork+0x1f/0x30

This is due to the operation being in the DEAD state (6) rather than
INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
fscache_put_operation().

The bug can also manifest like the following:

 kernel BUG at fs/fscache/operation.c:69!
 ...
    [exception RIP: fscache_enqueue_operation+246]
 ...
 #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
 #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
 #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028

I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
entirely clear which assertion failed.

Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: Lei Xue <carmark.dlut@gmail.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Reported-by: Anthony DeRobertis <aderobertis@metrics.net>
Reported-by: NeilBrown <neilb@suse.com>
Reported-by: Daniel Axtens <dja@axtens.net>
Reported-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
2018-07-25 14:49:00 +01:00
Kiran Kumar Modukuri
d0eb06afe7 fscache: Allow cancelled operations to be enqueued
Alter the state-check assertion in fscache_enqueue_operation() to allow
cancelled operations to be given processing time so they can be cleaned up.

Also fix a debugging statement that was requiring such operations to have
an object assigned.

Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: Kiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-07-25 14:31:20 +01:00
David S. Miller
19725496da Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2018-07-24 19:21:58 -07:00
Bob Peterson
f6753df35c GFS2: rgrp free blocks used incorrectly
Before this patch, several functions in rgrp.c checked the value of
rgd->rd_free_clone. That does not take into account blocks that were
reserved by a multi-block reservation. This causes a problem when
space gets tight in the file system. For example, when function
gfs2_inplace_reserve checks to see if a rgrp has enough blocks to
satisfy the request, it can accept a rgrp that it should reject
because, although there are enough blocks to satisfy the request
_now_, those blocks may be reserved for another running process.

A second problem with this occurs when we've reserved the remaining
blocks in an rgrp: function rg_mblk_search() can reject an rgrp
improperly because it calculates:

   u32 free_blocks = rgd->rd_free_clone - rgd->rd_reserved;

But rd_reserved includes blocks that the current process just
reserved in its own call to inplace_reserve. For example, it can
reserve the last 128 blocks of an rgrp, then reject that same rgrp
because the above calculates out to free_blocks = 0;

Consequences include, but are not limited to, (1) leaving holes,
and thus increasing file system fragmentation, and (2) reporting
file system is full long before it actually is.

This patch introduces a new function, rgd_free, which returns the
number of clone-free blocks (blocks that are truly free as opposed
to blocks that are still being used because an unlinked file is
still open) minus the number of blocks reserved by processes, but
not counting the blocks we ourselves reserved (because obviously
we need to allocate them).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-25 00:09:09 +02:00
Colin Ian King
d1b0cb933c gfs2: remove redundant variable 'moved'
Variable 'moved' s being assigned but is never used hence it is
redundant and can be removed.  This has been the case ever since commit
c752666c.

Cleans up clang warning:
warning: variable 'moved' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-25 00:08:59 +02:00
Andreas Gruenbacher
f95cbb44ab gfs2: use iomap_readpage for blocksize == PAGE_SIZE
We only use iomap_readpage for pages that don't have buffer heads
attached yet: iomap_readpage would otherwise read pages from disk that
are marked buffer_uptodate() but not PageUptodate().  Those pages may
actually contain data more recent than what's on disk.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-25 00:08:49 +02:00
Andreas Gruenbacher
1d45bb7f9d gfs2: Use iomap for stuffed direct I/O reads
Remove the fallback code from direct to buffered I/O for stuffed reads.

For stuffed writes, we must keep the fallback code: the deferred glock
we are holding under direct I/O doesn't allow to write to the inode or
change the file size.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-25 00:08:40 +02:00
Andreas Gruenbacher
0ed91eca11 Merge branch 'iomap-4.19-merge' into linux-gfs2/for-next
Merge xfs branch 'iomap-4.19-merge' into linux-gfs2/for-next.  This
brings in readpage and direct I/O support for inline data.

The IOMAP_F_BUFFER_HEAD flag introduced in commit "iomap: add initial
support for writes without buffer heads" needs to be set for gfs2 as
well, so do that in the merge.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-25 00:08:20 +02:00
Andreas Gruenbacher
c25892827c gfs2: fallocate_chunk: Always initialize struct iomap
In fallocate_chunk, always initialize the iomap before calling
gfs2_iomap_get_alloc: future changes could otherwise cause things like
iomap.flags to leak across calls.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-25 00:06:48 +02:00
Bart Van Assche
7393059506 fs/cifs: Simplify ib_post_(send|recv|srq_recv)() calls
Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL
as third argument to ib_post_(send|recv|srq_recv)().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24 16:06:36 -06:00
Bob Peterson
4a7727725d GFS2: Fix recovery issues for spectators
This patch fixes a couple problems dealing with spectators who
remain with gfs2 mounts after the last non-spectator node fails.

Before this patch, spectator mounts would try to acquire the dlm's
mounted lock EX as part of its normal recovery sequence.
The mounted lock is only used to determine whether the node is
the first mounter, the first node to mount the file system, for
the purposes of file system recovery and journal replay.

It's not necessary for spectators: they should never do journal
recovery. If they acquire the lock it will prevent another "real"
first-mounter from acquiring the lock in EX mode, which means it
also cannot do journal recovery because it doesn't think it's the
first node to mount the file system.

This patch checks if the mounter is a spectator, and if so, avoids
grabbing the mounted lock. This allows a secondary mounter who is
really the first non-spectator mounter, to do journal recovery:
since the spectator doesn't acquire the lock, it can grab it in
EX mode, and therefore consider itself to be the first mounter
both as a "real" first mount, and as a first-real-after-spectator.

Note that the control lock still needs to be taken in PR mode
in order to fetch the lvb value so it has the current status of
all journal's recovery. This is used as it is today by a first
mounter to replay the journals. For spectators, it's merely
used to fetch the status bits. All recovery is bypassed and the
node waits until recovery is completed by a non-spectator node.

I also improved the cryptic message given by control_mount when
a spectator is waiting for a non-spectator to perform recovery.

It also fixes a problem in gfs2_recover_set whereby spectators
were never queueing recovery work for their own journal.
They cannot do recovery themselves, but they still need to queue
the work so they can check the recovery bits and clear the
DFL_BLOCK_LOCKS bit once the recovery happens on another node.

When the work queue runs on a spectator, it bypasses most of the
work so it won't print a bunch of annoying messages. All it will
print is a bunch of messages that look like this until recovery
completes on the non-spectator node:

GFS2: fsid=mycluster:scratch.s: recover generation 3 jid 0
GFS2: fsid=mycluster:scratch.s: recover jid 0 result busy

These continue every 1.5 seconds until the recovery is done by
the non-spectator, at which time it says:

GFS2: fsid=mycluster:scratch.s: recover generation 4 done

Then it proceeds with its mount.

If the file system is mounted in spectator node and the last
remaining non-spectator is fenced, any IO to the file system is
blocked by dlm and the spectator waits until recovery is
performed by a non-spectator.

If a spectator tries to mount the file system before any
non-spectators, it blocks and repeatedly gives this kernel
message:

GFS2: fsid=mycluster:scratch: Recovery is required. Waiting for a non-spectator to mount.
GFS2: fsid=mycluster:scratch: Recovery is required. Waiting for a non-spectator to mount.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-25 00:06:24 +02:00
Christoph Hellwig
076ff2f0b8 exofs: use bio_clone_fast in _write_mirror
The mirroring code never changes the bio data or biovecs.  This means
we can reuse the biovec allocation easily instead of duplicating it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by Boaz Harrosh <ooo@electrozaur.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-24 14:43:20 -06:00
Eric Sandeen
d4a34e1655 xfs: properly handle free inodes in extent hint validators
When inodes are freed in xfs_ifree(), di_flags is cleared (so extent size
hints are removed) but the actual extent size fields are left intact.
This causes the extent hint validators to fail on freed inodes which once
had extent size hints.

This can be observed (for example) by running xfs/229 twice on a
non-crc xfs filesystem, or presumably on V5 with ikeep.

Fixes: 7d71a67 ("xfs: verify extent size hint is valid in inode verifier")
Fixes: 02a0fda ("xfs: verify COW extent size hint is valid in inode verifier")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-24 11:34:52 -07:00
Andreas Gruenbacher
a3479c7fc0 Merge branch 'iomap-write' into linux-gfs2/for-next
Pull in the gfs2 iomap-write changes: Tweak the existing code to
properly support iomap write and eliminate an unnecessary special case
in gfs2_block_map.  Implement iomap write support for buffered and
direct I/O.  Simplify some of the existing code and eliminate code that
is no longer used:

  gfs2: Remove gfs2_write_{begin,end}
  gfs2: iomap direct I/O support
  gfs2: gfs2_extent_length cleanup
  gfs2: iomap buffered write support
  gfs2: Further iomap cleanups

This is based on the following changes on the xfs 'iomap-4.19-merge'
branch:

  iomap: add private pointer to struct iomap
  iomap: add a page_done callback
  iomap: generic inline data handling
  iomap: complete partial direct I/O writes synchronously
  iomap: mark newly allocated buffer heads as new
  fs: factor out a __generic_write_end helper

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24 20:02:40 +02:00
Souptick Joarder
109dbb1e6f fs: gfs2: Adding new return type vm_fault_t
Use new return type vm_fault_t for gfs2_page_mkwrite
handler.

see commit 1c8f422059 ("mm: change return type to
vm_fault_t") for reference.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24 20:02:11 +02:00
Chengguang Xu
910f3d58d0 gfs2: using posix_acl_xattr_size instead of posix_acl_to_xattr
It seems better to get size by calling posix_acl_xattr_size() instead of
calling posix_acl_to_xattr() with NULL buffer argument.

posix_acl_xattr_size() never returns 0, so remove the unnecessary check.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24 20:02:11 +02:00
Bob Peterson
e79e0e1428 gfs2: Don't reject a supposedly full bitmap if we have blocks reserved
Before this patch, you could get into situations like this:

1. Process 1 searches for X free blocks, finds them, makes a reservation
2. Process 2 searches for free blocks in the same rgrp, but now the
   bitmap is full because process 1's reservation is skipped over.
   So it marks the bitmap as GBF_FULL.
3. Process 1 tries to allocate blocks from its own reservation, but
   since the GBF_FULL bit is set, it skips over the rgrp and searches
   elsewhere, thus not using its own reservation.

This patch adds an additional check to allow processes to use their
own reservations.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-07-24 20:02:11 +02:00
Dan Williams
c2a7d2a115 filesystem-dax: Introduce dax_lock_mapping_entry()
In preparation for implementing support for memory poison (media error)
handling via dax mappings, implement a lock_page() equivalent. Poison
error handling requires rmap and needs guarantees that the page->mapping
association is maintained / valid (inode not freed) for the duration of
the lookup.

In the device-dax case it is sufficient to simply hold a dev_pagemap
reference. In the filesystem-dax case we need to use the entry lock.

Export the entry lock via dax_lock_mapping_entry() that uses
rcu_read_lock() to protect against the inode being freed, and
revalidates the page->mapping association under xa_lock().

Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-23 10:38:06 -07:00
Darrick J. Wong
f467cad95f xfs: force summary counter recalc at next mount
Use the "bad summary count" mount flag from the previous patch to skip
writing the unmount record to force log recovery at the next mount,
which will recalculate the summary counters for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
53235f2215 xfs: refactor unmount record write
Refactor the writing of the unmount record into a separate helper.  No
functionality changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
2e9e6481e2 xfs: detect and fix bad summary counts at mount
Filippo Giunchedi complained that xfs doesn't even perform basic sanity
checks of the fs summary counters at mount time.  Therefore, recalculate
the summary counters from the AGFs after log recovery if the counts were
bad (or we had to recover the fs).  Enhance the recalculation routine to
fail the mount entirely if the new values are also obviously incorrect.

We use a mount state flag to record the "bad summary count" state so
that the (subsequent) online fsck patches can detect subtlely incorrect
counts and set the flag; clear it userspace asks for a repair; or force
a recalculation at the next mount if nobody fixes it by unmount time.

Reported-by: Filippo Giunchedi <fgiunchedi@wikimedia.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
032d91f982 xfs: fix indentation and other whitespace problems in scrub/repair
Now that we've shortened everything, fix up all the indentation and
whitespace problems.  There are no functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:01 -07:00
Darrick J. Wong
1d8a748a8a xfs: shorten struct xfs_scrub_context to struct xfs_scrub
Shorten the name of the online fsck context structure.  Whitespace
damage will be fixed by a subsequent patch.  There are no functional
changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
b5e2196e9c xfs: shorten xfs_repair_ prefix to xrep_
Shorten all the metadata repair xfs_repair_* symbols to xrep_.
Whitespace damage will be fixed by a subsequent patch.  There are no
functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
c517b3aa02 xfs: shorten xfs_scrub_ prefix
Shorten all the metadata checking xfs_scrub_ prefixes to xchk_.  After
this, the only xfs_scrub* symbols are the ones that pertain to both
scrub and repair.  Whitespace damage will be fixed in a subsequent
patch.  There are no functional changes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
ef97ef26d2 xfs: clean up xfs_btree_del_cursor callers
Less trivial cleanups of the error argument to xfs_btree_del_cursor;
these require some minor code refactoring.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
0b04b6b875 xfs: trivial xfs_btree_del_cursor cleanups
The error argument to xfs_btree_del_cursor already understands the
"nonzero for error" semantics, so remove pointless error testing in the
callers and pass it directly.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
81b549aa62 xfs: return from _defer_finish with a clean transaction
The following assertion was seen on generic/051:

XFS: Assertion failed: tp->t_firstblock == NULLFSBLOCK, file: fs/xfs/libxfs5
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 20757 Comm: fsstress Not tainted 4.18.0-rc4+ #3969
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/4
RIP: 0010:assfail+0x23/0x30
Code: c3 66 0f 1f 44 00 00 48 89 f1 41 89 d0 48 c7 c6 88 e0 8c 82 48 89 fa
RSP: 0018:ffff88012dc43c08 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88012dc43ca0 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff828480eb
RBP: ffff88012aa92758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: f000000000000000 R12: 0000000000000000
R13: ffff88012dc43d48 R14: ffff88013092e7e8 R15: 0000000000000014
FS:  00007f8d689b8e80(0000) GS:ffff88013fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8d689c7000 CR3: 000000012ba6a000 CR4: 00000000000006e0
Call Trace:
 xfs_defer_init+0xff/0x160
 xfs_reflink_remap_extent+0x31b/0xa00
 xfs_reflink_remap_blocks+0xec/0x4a0
 xfs_reflink_remap_range+0x3a1/0x650
 xfs_file_dedupe_range+0x39/0x50
 vfs_dedupe_file_range+0x218/0x260
 do_vfs_ioctl+0x262/0x6a0
 ? __se_sys_newfstat+0x3c/0x60
 ksys_ioctl+0x35/0x60
 __x64_sys_ioctl+0x11/0x20
 do_syscall_64+0x4b/0x190
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The root cause of the assertion failure is that xfs_defer_finish doesn't
roll the transaction after processing all the deferred items.  Therefore
it returns a dirty transaction to the caller, which leaves the caller at
risk of exceeding the transaction reservation if it logs more items.

Brian Foster's patchset to move the defer_ops firstblock into the
transaction requires t_firstblock == NULLFSBLOCK upon defer_ops
initialization, which is how this was noticed at all.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Darrick J. Wong
65cfcc3897 xfs: check leaf attribute block freemap in verifier
Check the leaf attribute freemap when we're verifying the block.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:08:00 -07:00
Linus Torvalds
165ea0d1c2 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Fix several places that screw up cleanups after failures halfway
  through opening a file (one open-coding filp_clone_open() and getting
  it wrong, two misusing alloc_file()). That part is -stable fodder from
  the 'work.open' branch.

  And Christoph's regression fix for uapi breakage in aio series;
  include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
  definition of sigset_t, the reason for doing so in the first place had
  been bogus - there's no need to expose struct __aio_sigset in
  aio_abi.h at all"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: don't expose __aio_sigset in uapi
  ocxlflash_getfile(): fix double-iput() on alloc_file() failures
  cxl_getfile(): fix double-iput() on alloc_file() failures
  drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
2018-07-22 12:04:51 -07:00
Andy Shevchenko
c4326563d9 efivars: Call guid_parse() against guid_t type of variable
uuid_le_to_bin() is deprecated API and take into consideration that variable,
to where we store parsed data, is type of guid_t we switch to guid_parse()
for sake of consistency.

While here, add error checking to it.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180720014726.24031-10-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-22 14:13:44 +02:00
Linus Torvalds
55b636b419 for-4.18-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltSwkIACgkQxWXV+ddt
 WDtMBQ//UXMjHaXvFmC0SM6NczuYQR51hLYtJFIKig93XK5goVpUBTNbxO7LX/Tn
 4zKoKyVhkW1884V6mRiC+G23QLbo0BQZA7DExfyJ3jylQdjZMBm+K+r19OtGQf5v
 CII7oUwni03KIiXIqiFAL5dLWebVQpG5EKJbh8GLZsmg6xNcyVaUqZ/fHXajbZiv
 ldEBtHBKIv7WWTJmylMBKMWnRz+jqU91fXPahoU6R5qivODrLt1o/PMuSjVNhaxe
 iDldHfdOaiQmLHB/1kOGyv492oW5mSSVNDE8LjEDZ61tDNlAcUyuKUWIRBxDEDtD
 6D7rlVQXJ/N7sJ6+UYmJKsRpHL+NOkyzSZ0QEU/sm1Xpm8gkhHuuofRPrVCtd3l1
 ZSbwvlrdyjigVEBfM3IbToQ/K6Rc1ZGId20OAs9PCQbb+mj9IxPIncZ7pI1c4hlh
 pPEjcYsp14JbCTjctFalcqTiFY5tHRQsx+GUFnDyOcdL7Mi+CoH+0Jy61Vgz9GQE
 7s934cfEC0ot/f66kAL/PZzxUfC7TePqaa+sDfS5BIkJ4M6lPMxS5De5R4Z0+Nzr
 DXgQAlgXmxfRjpOYMTH9D0EDdSeJaNmVHgk7hFbiYk/KX3oyd4NmgI9Cfao8rQJv
 2yd8wF2httfSJKD4b/Hv9r6Ho/Bw9PK59BvWOKYhSj6IGl32utw=
 =f7eB
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fix of a corruption regarding fsync and clone, under some very
  specific conditions explained in the patch.

  The fix is marked for stable 3.16+ so I'd like to get it merged now
  given the impact"

* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix file data corruption after cloning a range and fsync
2018-07-21 16:42:03 -07:00
Linus Torvalds
490fc05386 mm: make vm_area_alloc() initialize core fields
Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 15:24:03 -07:00
Linus Torvalds
3928d4f5ee mm: use helper functions for allocating and freeing vm_area structs
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 13:48:51 -07:00
OGAWA Hirofumi
35033ab988 fat: fix memory allocation failure handling of match_strdup()
In parse_options(), if match_strdup() failed, parse_options() leaves
opts->iocharset in unexpected state (i.e.  still pointing the freed
string).  And this can be the cause of double free.

To fix, this initialize opts->iocharset always when freeing.

Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-21 12:50:46 -07:00
Eric W. Biederman
40b3b02535 signal: Pass pid type into do_send_sig_info
This passes the information we already have at the call sight into
do_send_sig_info.  Ultimately allowing for better handling of signals
sent to a group of processes during fork.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 12:57:35 -05:00
Eric W. Biederman
9c2db00778 signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
This information is already present and using it directly simplifies the logic
of the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 12:57:41 -05:00
Eric W. Biederman
019191342f signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
When f_setown is called a pid and a pid type are stored.  Replace the use
of PIDTYPE_PID with PIDTYPE_TGID as PIDTYPE_TGID goes to the entire thread
group.  Replace the use of PIDTYPE_MAX with PIDTYPE_PID as PIDTYPE_PID now
is only for a thread.

Update the users of __f_setown to use PIDTYPE_TGID instead of
PIDTYPE_PID.

For now the code continues to capture task_pid (when task_tgid would
really be appropriate), and iterate on PIDTYPE_PID (even when type ==
PIDTYPE_TGID) out of an abundance of caution to preserve existing
behavior.

Oleg Nesterov suggested using the test to ensure we use PIDTYPE_PID
for tgid lookup also be used to avoid taking the tasklist lock.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Eric W. Biederman
6883f81aac pid: Implement PIDTYPE_TGID
Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id).  Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array.  Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest.  The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Eric W. Biederman
2c4704756c pids: Move the pgrp and session pid pointers from task_struct to signal_struct
To access these fields the code always has to go to group leader so
going to signal struct is no loss and is actually a fundamental simplification.

This saves a little bit of memory by only allocating the pid pointer array
once instead of once for every thread, and even better this removes a
few potential races caused by the fact that group_leader can be changed
by de_thread, while signal_struct can not.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Eric W. Biederman
7a36094d61 pids: Compute task_tgid using signal->leader_pid
The cost is the the same and this removes the need
to worry about complications that come from de_thread
and group_leader changing.

__task_pid_nr_ns has been updated to take advantage of this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-21 10:43:12 -05:00
Dmitry Torokhov
5f81880d52 sysfs, kobject: allow creating kobject belonging to arbitrary users
Normally kobjects and their sysfs representation belong to global root,
however it is not necessarily the case for objects in separate namespaces.
For example, objects in separate network namespace logically belong to the
container's root and not global root.

This change lays groundwork for allowing network namespace objects
ownership to be transferred to container's root user by defining
get_ownership() callback in ktype structure and using it in sysfs code to
retrieve desired uid/gid when creating sysfs objects for given kobject.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 23:44:35 -07:00
Dmitry Torokhov
488dee96bb kernfs: allow creating kernfs objects with arbitrary uid/gid
This change allows creating kernfs files and directories with arbitrary
uid/gid instead of always using GLOBAL_ROOT_UID/GID by extending
kernfs_create_dir_ns() and kernfs_create_file_ns() with uid/gid arguments.
The "simple" kernfs_create_file() and kernfs_create_dir() are left alone
and always create objects belonging to the global root.

When creating symlinks ownership (uid/gid) is taken from the target kernfs
object.

Co-Developed-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-20 23:44:35 -07:00
Dan Williams
73449daf8f filesystem-dax: Set page->index
In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2018-07-20 11:21:15 -07:00
Vivek Goyal
989974c804 ovl: Enable metadata only feature
All the bits are in patches before this.  So it is time to enable the
metadata only copy up feature.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:17 +02:00
Vivek Goyal
935a074f48 ovl: Do not do metacopy only for ioctl modifying file attr
ovl_copy_up() by default will only do metadata only copy up (if enabled).
That means when ovl_real_ioctl() calls ovl_real_file(), it will still get
the lower file (as ovl_real_file() opens data file and not metacopy).  And
that means "chattr +i" will end up modifying lower inode.

There seem to be two ways to solve this.
A. Open metacopy file in ovl_real_ioctl() and do operations on that
B. Force full copy up when FS_IOC_SETFLAGS is called.

I am resorting to option B for now as it feels little safer option.  If
there are performance issues due to this, we can revisit it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:17 +02:00
Vivek Goyal
997336f2c3 ovl: Do not do metadata only copy-up for truncate operation
truncate should copy up full file (and not do metacopy only), otherwise it
will be broken.  For example, use truncate to increase size of a file so
that any read beyong existing size will return null bytes.  If we don't
copy up full file, then we end up opening lower file and read from it only
reads upto the old size (and not new size after truncate).  Hence to avoid
such situations, copy up data as well when file size changes.

So far it was being done by d_real(O_WRONLY) call in truncate() path.  Now
that patch has been reverted.  So force full copy up in ovl_setattr() if
size of file is changing.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:17 +02:00
Vivek Goyal
d1e6f6a94d ovl: add helper to force data copy-up
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:16 +02:00
Vivek Goyal
0a2d0d3f2f ovl: Check redirect on index as well
Right now we seem to check redirect only if upperdentry is found.  But it
is possible that there is no upperdentry but later we found an index.

We need to check redirect on index as well and set it in
ovl_inode->redirect.  Otherwise link code can assume that dentry does not
have redirect and place a new one which breaks things.  In my testing
overlay/033 test started failing in xfstests.  Following are the details.

For example do following.

$ mkdir lower upper work merged

 - Make lower dir with 4 links.
  $ echo "foo" > lower/l0.txt
  $ ln  lower/l0.txt lower/l1.txt
  $ ln  lower/l0.txt lower/l2.txt
  $ ln  lower/l0.txt lower/l3.txt

 - Mount with index on and metacopy on.

  $ mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,\
                        index=on,metacopy=on none merged

 - Link lower

  $ ln merged/l0.txt merged/l4.txt
    (This will metadata copy up of l0.txt and put an absolute redirect
     /l0.txt)

  $ echo 2 > /proc/sys/vm/drop/caches

  $ ls merged/l1.txt
  (Now l1.txt will be looked up.  There is no upper dentry but there is
   lower dentry and index will be found.  We don't check for redirect on
   index, hence ovl_inode->redirect will be NULL.)

 - Link Upper

  $ ln merged/l4.txt merged/l5.txt
  (Lookup of l4.txt will use inode from l1.txt lookup which is still in
   cache.  It has ovl_inode->redirect NULL, hence link will put a new
   redirect and replace /l0.txt with /l4.txt

 - Drop caches.
  echo 2 > /proc/sys/vm/drop_caches

 - List l1.txt and it returns -ESTALE

  $ ls merged/l0.txt

  (It returns stale because, we found a metacopy of l0.txt in upper and it
   has redirect l4.txt but there is no file named l4.txt in lower layer.
   So lower data copy is not found and -ESTALE is returned.)

So problem here is that we did not process redirect on index.  Check
redirect on index as well and then problem is fixed.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:16 +02:00
Vivek Goyal
4120fe64dc ovl: Set redirect on upper inode when it is linked
When we create a hardlink to a metacopy upper file, first the redirect on
that inode.  Path based lookup will not work with newly created link and
redirect will solve that issue.

Also use absolute redirect as two hardlinks could be in different
directores and relative redirect will not work.

I have not put any additional locking around setting redirects while
introducing redirects for non-dir files.  For now it feels like existing
locking is sufficient.  If that's not the case, we will have add more
locking.  Following is my rationale about why do I think current locking
seems ok.

Basic problem for non-dir files is that more than on dentry could be
pointing to same inode and in theory only relying on dentry based locks
(d->d_lock) did not seem sufficient.

We set redirect upon rename and upon link creation.  In both the paths for
non-dir file, VFS locks both source and target inodes (->i_rwsem).  That
means vfs rename and link operations on same source and target can't he
happening in parallel (Even if there are multiple dentries pointing to same
inode).  So that probably means that at a time on an inode, only one call
of ovl_set_redirect() could be working and we don't need additional locking
in ovl_set_redirect().

ovl_inode->redirect is initialized only when inode is created new.  That
means it should not race with any other path and setting
ovl_inode->redirect should be fine.

Reading of ovl_inode->redirect happens in ovl_get_redirect() path.  And
this called only in ovl_set_redirect().  And ovl_set_redirect() already
seemed to be protected using ->i_rwsem.  That means ovl_set_redirect() and
ovl_get_redirect() on source/target inode should not make progress in
parallel and is mutually exclusive.  Hence no additional locking required.

Now, only case where ovl_set_redirect() and ovl_get_redirect() could race
seems to be case of absolute redirects where ovl_get_redirect() has to
travel up the tree.  In that case we already take d->d_lock and that should
be sufficient as directories will not have multiple dentries pointing to
same inode.

So given VFS locking and current usage of redirect, current locking around
redirect seems to be ok for non-dir as well.  Once we have the logic to
remove redirect when metacopy file gets copied up, then we probably will
need additional locking.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:15 +02:00
Vivek Goyal
7bb083837d ovl: Set redirect on metacopy files upon rename
Set redirect on metacopy files upon rename.  This will help find data
dentry in lower dirs.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:15 +02:00
Vivek Goyal
60124877b9 ovl: Do not set dentry type ORIGIN for broken hardlinks
If a dentry has copy up origin, we set flag OVL_PATH_ORIGIN.  So far this
decision was easy that we had to check only for oe->numlower and if it is
non-zero, we knew there is copy up origin.  (For non-dir we installed
origin dentry in lowerstack[0]).

But we don't create ORGIN xattr for broken hardlinks (index=off).  And with
metacopy feature it is possible that we will install lowerstack[0] but
ORIGIN xattr is not there.  It is data dentry of upper metacopy dentry
which has been found using regular name based lookup or using REDIRECT.  So
with addition of this new case, just presence of oe->numlower is not
sufficient to guarantee that ORIGIN xattr is present.

So to differentiate between two cases, look at OVL_CONST_INO flag.  If this
flag is set and upperdentry is there, that means it can be marked as type
ORIGIN.  OVL_CONST_INO is not set if lower hardlink is broken or will be
broken over copy up.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:14 +02:00
Vivek Goyal
a00c2d59e9 ovl: Add an inode flag OVL_CONST_INO
Add an ovl_inode flag OVL_CONST_INO.  This flag signifies if inode number
will remain constant over copy up or not.  This flag does not get updated
over copy up and remains unmodifed after setting once.

Next patch in the series will make use of this flag.  It will basically
figure out if dentry is of type ORIGIN or not.  And this can be derived by
this flag.

ORIGIN = (upperdentry && ovl_test_flag(OVL_CONST_INO, inode)).

Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:14 +02:00
Vivek Goyal
0b17c28af1 ovl: Treat metacopy dentries as type OVL_PATH_MERGE
Right now OVL_PATH_MERGE is used only for merged directories.  But
conceptually, a metacopy dentry (backed by a lower data dentry) is a merged
entity as well.

So mark metacopy dentries as OVL_PATH_MERGE and ovl_rename() makes use of
this property later to set redirect on a metacopy file.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:13 +02:00
Vivek Goyal
b8a8824ca0 ovl: Check redirects for metacopy files
Right now we rely on path based lookup for data origin of metacopy upper.
This will work only if upper has not been renamed.  We solved this problem
already for merged directories using redirect.  Use same logic for metacopy
files.

This patch just goes on to check redirects for metacopy files.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:13 +02:00
Vivek Goyal
0618a816ed ovl: Move some dir related ovl_lookup_single() code in else block
Move some directory related code in else block.  This is pure code
reorganization and no functionality change.

Next patch enables redirect processing on metacopy files and needs this
change.  By keeping non-functional changes in a separate patch, next patch
looks much smaller and cleaner.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:12 +02:00
Vivek Goyal
2c3d73589a ovl: Do not expose metacopy only dentry from d_real()
Metacopy dentry/inode is internal to overlay and is never exposed outside
of it.  Exception is metacopy upper file used for fsync().  Modify d_real()
to look for dentries/inode which have data, but also allow matching upper
inode without data for the fsync case.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:12 +02:00
Vivek Goyal
8c444d2a97 ovl: Open file with data except for the case of fsync
ovl_open() should open file which contains data and not open metacopy
inode.  With the introduction of metacopy inodes, with current
implementaion we will end up opening metacopy inode as well.

But there can be certain circumstances like ovl_fsync() where we want to
allow opening a metacopy inode instead.

Hence, change ovl_open_realfile() and and add extra parameter which
specifies whether to allow opening metacopy inode or not.  If this
parameter is false, we look for data inode and open that.

This should allow covering both the cases.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:12 +02:00
Vivek Goyal
4823d49c26 ovl: Add helper ovl_inode_realdata()
Add an helper to retrieve real data inode associated with overlay inode.
This helper will ignore all metacopy inodes and will return only the real
inode which has data.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:11 +02:00
Vivek Goyal
2664bd0897 ovl: Store lower data inode in ovl_inode
Right now ovl_inode stores inode pointer for lower inode.  This helps with
quickly getting lower inode given overlay inode (ovl_inode_lower()).

Now with metadata only copy-up, we can have metacopy inode in middle layer
as well and inode containing data can be different from ->lower.  I need to
be able to open the real file in ovl_open_realfile() and for that I need to
quickly find the lower data inode.

Hence store lower data inode also in ovl_inode.  Also provide an helper
ovl_inode_lowerdata() to access this field.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:11 +02:00
Vivek Goyal
67d756c27a ovl: Fix ovl_getattr() to get number of blocks from lower
If an inode has been copied up metadata only, then we need to query the
number of blocks from lower and fill up the stat->st_blocks.

We need to be careful about races where we are doing stat on one cpu and
data copy up is taking place on other cpu.  We want to return
stat->st_blocks either from lower or stable upper and not something in
between.  Hence, ovl_has_upperdata() is called first to figure out whether
block reporting will take place from lower or upper.

We now support metacopy dentries in middle layer.  That means number of
blocks reporting needs to come from lowest data dentry and this could be
different from lower dentry.  Hence we end up making a separate
vfs_getxattr() call for metacopy dentries to get number of blocks.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:10 +02:00
Vivek Goyal
647d253fcd ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
Now we have the notion of data dentry and metacopy dentry.
ovl_dentry_lower() will return uppermost lower dentry, but it could be
either data or metacopy dentry.  Now we support metacopy dentries in lower
layers so it is possible that lowerstack[0] is metacopy dentry while
lowerstack[1] is actual data dentry.

So add an helper which returns lowest most dentry which is supposed to be
data dentry.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:10 +02:00
Vivek Goyal
4f93b426ab ovl: Copy up meta inode data from lowest data inode
So far lower could not be a meta inode.  So whenever it was time to copy up
data of a meta inode, we could copy it up from top most lower dentry.

But now lower itself can be a metacopy inode.  That means data copy up
needs to take place from a data inode in metacopy inode chain.  Find lower
data inode in the chain and use that for data copy up.

Introduced a helper called ovl_path_lowerdata() to find the lower data
inode chain.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:09 +02:00
Vivek Goyal
9d3dfea3d3 ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
This patch modifies ovl_lookup() and friends to lookup metacopy dentries.
It also allows for presence of metacopy dentries in lower layer.

During lookup, check for presence of OVL_XATTR_METACOPY and if not present,
set OVL_UPPERDATA bit in flags.

We don't support metacopy feature with nfs_export.  So in nfs_export code,
we set OVL_UPPERDATA flag set unconditionally if upper inode exists.

Do not follow metacopy origin if we find a metacopy only inode and metacopy
feature is not enabled for that mount.  Like redirect, this can have
security implications where an attacker could hand craft upper and try to
gain access to file on lower which it should not have to begin with.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:09 +02:00
Vivek Goyal
027065b726 ovl: Use out_err instead of out_nomem
Right now we use goto out_nomem which assumes error code is -ENOMEM.  But
there are other errors returned like -ESTALE as well.  So instead of
out_nomem, use out_err which will do ERR_PTR(err).  That way one can put
error code in err and jump to out_err.

This just code reorganization and no change of functionality.

I am about to add more code and this organization helps laying more code
and error paths on top of it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:08 +02:00
Vivek Goyal
0c28887493 ovl: A new xattr OVL_XATTR_METACOPY for file on upper
Now we will have the capability to have upper inodes which might be only
metadata copy up and data is still on lower inode.  So add a new xattr
OVL_XATTR_METACOPY to distinguish between two cases.

Presence of OVL_XATTR_METACOPY reflects that file has been copied up
metadata only and and data will be copied up later from lower origin.  So
this xattr is set when a metadata copy takes place and cleared when data
copy takes place.

We also use a bit in ovl_inode->flags to cache OVL_UPPERDATA which reflects
whether ovl inode has data or not (as opposed to metadata only copy up).

If a file is copied up metadata only and later when same file is opened for
WRITE, then data copy up takes place.  We copy up data, remove METACOPY
xattr and then set the UPPERDATA flag in ovl_inode->flags.  While all these
operations happen with oi->lock held, read side of oi->flags can be
lockless.  That is another thread on another cpu can check if UPPERDATA
flag is set or not.

So this gives us an ordering requirement w.r.t UPPERDATA flag.  That is, if
another cpu sees UPPERDATA flag set, then it should be guaranteed that
effects of data copy up and remove xattr operations are also visible.

For example.

	CPU1				CPU2
ovl_open()				acquire(oi->lock)
 ovl_open_maybe_copy_up()                ovl_copy_up_data()
  open_open_need_copy_up()		 vfs_removexattr()
   ovl_already_copied_up()
    ovl_dentry_needs_data_copy_up()	 ovl_set_flag(OVL_UPPERDATA)
     ovl_test_flag(OVL_UPPERDATA)       release(oi->lock)

Say CPU2 is copying up data and in the end sets UPPERDATA flag.  But if
CPU1 perceives the effects of setting UPPERDATA flag but not the effects of
preceding operations (ex. upper that is not fully copied up), it will be a
problem.

Hence this patch introduces smp_wmb() on setting UPPERDATA flag operation
and smp_rmb() on UPPERDATA flag test operation.

May be some other lock or barrier is already covering it. But I am not sure
what that is and is it obvious enough that we will not break it in future.

So hence trying to be safe here and introducing barriers explicitly for
UPPERDATA flag/bit.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:08 +02:00
Vivek Goyal
2002df8536 ovl: Add helper ovl_already_copied_up()
There are couple of places where we need to know if file is already copied
up (in lockless manner).  Right now its open coded and there are only two
conditions to check.  Soon this patch series will introduce another
condition to check and Amir wants to introduce one more.  So introduce a
helper instead to check this so that code is easier to read.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:08 +02:00
Vivek Goyal
44d5bf109a ovl: Copy up only metadata during copy up where it makes sense
If it makes sense to copy up only metadata during copy up, do it.  This is
done for regular files which are not opened for WRITE.

Right now ->metacopy is set to 0 always.  Last patch in the series will
remove the hard coded statement and enable metacopy feature.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:07 +02:00
Vivek Goyal
bd64e57586 ovl: During copy up, first copy up metadata and then data
Just a little re-ordering of code.  This helps with next patch where after
copying up metadata, we skip data copying step, if needed.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:07 +02:00
Vivek Goyal
d5791044d2 ovl: Provide a mount option metacopy=on/off for metadata copyup
By default metadata only copy up is disabled.  Provide a mount option so
that users can choose one way or other.

Also provide a kernel config and module option to enable/disable metacopy
feature.

metacopy feature requires redirect_dir=on when upper is present.
Otherwise, it requires redirect_dir=follow atleast.

As of now, metacopy does not work with nfs_export=on.  So if both
metacopy=on and nfs_export=on then nfs_export is disabled.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:06 +02:00
Vivek Goyal
d6eac03913 ovl: Move the copy up helpers to copy_up.c
Right now two copy up helpers are in inode.c.  Amir suggested it might be
better to move these to copy_up.c.

There will one more related function which will come in later patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:06 +02:00
Vivek Goyal
9cec54c83a ovl: Initialize ovl_inode->redirect in ovl_get_inode()
ovl_inode->redirect is an inode property and should be initialized in
ovl_get_inode() only when we are adding a new inode to cache.  If inode is
already in cache, it is already initialized and we should not be touching
ovl_inode->redirect field.

As of now this is not a problem as redirects are used only for directories
which don't share inode.  But soon I want to use redirects for regular
files also and there it can become an issue.

Hence, move ->redirect initialization in ovl_get_inode().

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-20 09:56:05 +02:00
Al Viro
f2df5da662 fold generic_readlink() into its only caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-19 17:35:51 -04:00
Maciej W. Rozycki
2f819db565
binfmt_elf: Respect error return from `regset->active'
The regset API documented in <linux/regset.h> defines -ENODEV as the
result of the `->active' handler to be used where the feature requested
is not available on the hardware found.  However code handling core file
note generation in `fill_thread_core_info' interpretes any non-zero
result from the `->active' handler as the regset requested being active.
Consequently processing continues (and hopefully gracefully fails later
on) rather than being abandoned right away for the regset requested.

Fix the problem then by making the code proceed only if a positive
result is returned from the `->active' handler.

Signed-off-by: Maciej W. Rozycki <macro@mips.com>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: 4206d3aa19 ("elf core dump: notes user_regset")
Patchwork: https://patchwork.linux-mips.org/patch/19332/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
2018-07-19 13:46:34 -07:00
Filipe Manana
bd3599a0e1 Btrfs: fix file data corruption after cloning a range and fsync
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).

The following example, turned into a test for fstests, reproduces the
issue:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
  $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

  $ xfs_io -c "fsync" /mnt/bar
  # reflink destination offset corresponds to the size of file bar,
  # 2728Kb minus 4Kb.
  $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  $ md5sum /mnt/bar
  95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar

  <power fail>

  $ mount /dev/sdb /mnt
  $ md5sum /mnt/bar
  207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
  # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
  # power failure

In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.

Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).

CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-19 15:36:31 +02:00
Linus Torvalds
04a1320651 for-4.18-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAltPMI0ACgkQxWXV+ddt
 WDvmXw//fyV+2hoARngzjd+4o32YHfxdf+Xv4XCnMsZVKOHKkqX8qrmMNyX0sd4w
 5NwUZpv/mZ4LHnm4M+EMGJWjXL/oXkLGrDzndninNC+u7GlFVieZ/aF5D96z6rOm
 p45wGETYvAbZI7XZ3dLebpIDqr+eXOhx3lpJTAKY5sfTIwzwJ+KC5vFYdt+Rz4cr
 cbjwHhRUsRfu1I0SSjUVFIC5frtegIzbDgjWNiLLO44ozbDAH3j1SufOgNLb5GFM
 n+eh0xIHDNLOrH3aVKO19zk9NigVBu96/FJnIz0+Jzs67hifksfZWVDV5vKetUxA
 M46aqtTrSVb/NJ/RHkQkyWiJjZqioXXx+KsZjdU63fyv4iu0+o2HV0uY/Pifm+X/
 fCS7xbQOhWJySQ+6mAjxXB9eo0RqO+RIGGIV9gJWZKt3S3DvAUmvd980jeHUtXRB
 VwMwmnvqvYaGWLWmaTRm1mjdmhCX2JdNN2RMmVN36tGfed0uopIFeax2rtWJ4153
 V+8eZWaLkvvT3iGu+XLUhEfv3UCUy7N1LDk8toe7Xp+qIMvWus3GIsKAUCmJJ3b+
 sGmbYSgn5v8TR65m5QO4/ZWmt4/bi/2Usd6Cq3vd0Op08kTWBTxjdelAVm+dlEYb
 sZLIMrxPg8ogEw8qX4GxROa8/1z9F/62RSmHfk4W7InY2AMJJAg=
 =Ga4m
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Three regression fixes. They're few-liners and fixing some corner
  cases missed in the origial patches"

* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
  btrfs: fix use-after-free of cmp workspace pages
  btrfs: restore uuid_mutex in btrfs_open_devices
2018-07-18 11:13:25 -07:00
Michael Callahan
dbae2c5513 block: Define and use STAT_READ and STAT_WRITE
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.

tj: Refreshed on top of v4.17.

Signed-off-by: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:18 -06:00
Tejun Heo
3f289dcb4b block: make bdev_ops->rw_page() take a REQ_OP instead of bool
c11f0c0b5b ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.

Unfortunately, we want to track discards separately and @is_write
isn't enough information.  This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-18 08:44:14 -06:00
Arnd Bergmann
5f7a01e222 jffs2: use unsigned 32-bit timstamps consistently
Most users of jffs2 are 32-bit systems that traditionally only support
timestamps using a 32-bit signed time_t, in the range from years 1902 to
2038. On 64-bit systems, jffs2 however interpreted the same timestamps
as unsigned values, reading back negative times (before 1970) as times
between 2038 and 2106.

Now that Linux supports 64-bit inode timestamps even on 32-bit systems,
let's use the second interpretation everywhere to allow jffs2 to be
used on 32-bit systems beyond 2038 without a fundamental change to the
inode format.

This has a slight risk of regressions, when existing files with timestamps
before 1970 are present in file system images and are now interpreted
as future time stamps. I considered moving the wraparound point a bit,
e.g. to 1960, in order to deal with timestamps that ended up on Dec 31,
1969 due to incorrect timezone handling. However, this would complicate
the implementation unnecessarily, so I went with the simplest possible
method of extending the timestamps.

Writing files with timestamps before 1970 or after 2106 now results
in those times being clamped in the file system.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-07-18 16:44:01 +02:00
Arnd Bergmann
c4592b9c37 jffs2: use 64-bit intermediate timestamps
The VFS now uses timespec64 timestamps consistently, but jffs2 still
converts them to 32-bit numbers on the storage medium. As the helper
functions for the conversion (get_seconds() and timespec_to_timespec64())
are now deprecated, let's change them over to the more modern
replacements.

This keeps the traditional interpretation of those values, where
the on-disk 32-bit numbers are taken to be negative numbers, i.e.
dates before 1970, on 32-bit machines, but future numbers past 2038
on 64-bit machines.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-07-18 16:43:58 +02:00
Miklos Szeredi
670c23248e ovl: obsolete "check_copy_up" module option
This was provided for debugging the ro/rw inconsistecy.  The inconsitency
is now gone so this option is obsolete.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:44 +02:00
Miklos Szeredi
fb16043b46 vfs: remove open_flags from d_real()
Opening regular files on overlayfs is now handled via ovl_open().  Remove
the now unused "open_flags" argument from d_op->d_real() and the d_real()
helper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:44 +02:00
Miklos Szeredi
de2a4a501e Partially revert "locks: fix file locking on overlayfs"
This partially reverts commit c568d68341.

Overlayfs files will now automatically get the correct locks, no need to
hack overlay support in VFS.

It is a partial revert, because it leaves the locks_inode() calls in place
and defines locks_inode() to file_inode().  We could revert those as well,
but it would be unnecessary code churn and it makes sense to document that
we are getting the inode for locking purposes.

Don't revert MS_NOREMOTELOCK yet since that has been part of the userspace
API for some time (though not in a useful way).  Will try to remove
internal flags later when the dust around the new mount API settles.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
8cf9ee5061 Revert "vfs: do get_write_access() on upper layer of overlayfs"
This reverts commit 4d0c5ba2ff.

We now get write access on both overlay and underlying layers so this patch
is no longer needed for correct operation.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
4ab30319fd Revert "vfs: add flags to d_real()"
This reverts commit 495e642939.

No user of "flags" argument of d_real() remain.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
c671854346 Revert "vfs: update ovl inode before relatime check"
This reverts commit 598e3c8f72.

Overlayfs no longer relies on the vfs correct atime handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
88059de155 Revert "ovl: fix relatime for directories"
This reverts commit cd91304e71.

Overlayfs no longer relies on the vfs correct atime handling.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
a6795a5859 vfs: fix freeze protection in mnt_want_write_file() for overlayfs
The underlying real file used by overlayfs still contains the overlay path.
This results in mnt_want_write_file() calls by the filesystem getting
freeze protection on the wrong inode (the overlayfs one instead of the real
one).

Fix by using file_inode(file)->i_sb instead of file->f_path.mnt->mnt_sb.

Reported-by: Amir Goldstein <amir73il@gmail.com> 
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
6742cee043 Revert "ovl: don't allow writing ioctl on lower layer"
This reverts commit 7c6893e3c9.

Overlayfs no longer relies on the vfs for checking writability of files.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
d561f21856 Revert "ovl: fix may_write_real() for overlayfs directories"
This reverts commit 954c736f86.

Overlayfs no longer relies on the vfs for checking writability of files.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:43 +02:00
Miklos Szeredi
a6518f73e6 vfs: don't open real
Let overlayfs do its thing when opening a file.

This enables stacking and fixes the corner case when a file is opened for
read, modified through a writable open, and data is read from the read-only
file.  After this patch the read-only open will not return stale data even
in this case.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
8ede205541 ovl: add reflink/copyfile/dedup support
Since set of arguments are so similar, handle in a common helper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
f7c72396d0 ovl: add O_DIRECT support
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
9e142c4102 ovl: add ovl_fiemap()
Implement stacked fiemap().

Need to split inode operations for regular file (which has fiemap) and
special file (which doesn't have fiemap).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
dab5ca8fd9 ovl: add lsattr/chattr support
Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
aab8848cee ovl: add ovl_fallocate()
Implement stacked fallocate.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
2f502839e8 ovl: add ovl_mmap()
Implement stacked mmap.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
de30dfd629 ovl: add ovl_fsync()
Implement stacked fsync().

Don't sync if lower (noticed by Amir Goldstein).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:42 +02:00
Miklos Szeredi
2a92e07edc ovl: add ovl_write_iter()
Implement stacked writes.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
16914e6fc7 ovl: add ovl_read_iter()
Implement stacked reading.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
2ef66b8a03 ovl: add helper to return real file
In the common case we can just use the real file cached in
file->private_data.  There are two exceptions:

1) File has been copied up since open: in this unlikely corner case just
use a throwaway real file for the operation.  If ever this becomes a
perfomance problem (very unlikely, since overlayfs has been doing most fine
without correctly handling this case at all), then we can deal with that by
updating the cached real file.

2) File's f_flags have changed since open: no need to reopen the cached
real file, we can just change the flags there as well.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
d1d04ef857 ovl: stack file ops
Implement file operations on a regular overlay file.  The underlying file
is opened separately and cached in ->private_data.

It might be worth making an exception for such files when accounting in
nr_file to confirm to userspace expectations.  We are only adding a small
overhead (248bytes for the struct file) since the real inode and dentry are
pinned by overlayfs anyway.

This patch doesn't have any effect, since the vfs will use d_real() to find
the real underlying file to open.  The patch at the end of the series will
actually enable this functionality.

AV: make it use open_with_fake_path(), don't mess with override_creds

SzM: still need to mess with override_creds() until no fs uses
current_cred() in their open method.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
e8c985bace ovl: deal with overlay files in ovl_d_real()
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
46e5d0a390 ovl: copy up file size as well
Copy i_size of the underlying inode to the overlay inode in ovl_copyattr().

This is in preparation for stacking I/O operations on overlay files.

This patch shouldn't have any observable effect.

Remove stale comment from ovl_setattr() [spotted by Vivek Goyal].

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
5812160eb5 Revert "Revert "ovl: get_write_access() in truncate""
This reverts commit 31c3a70695.

Re-add functionality dealing with i_writecount on truncate to overlayfs.
This patch shouldn't have any observable effects, since we just re-assert
the writecout that vfs_truncate() already got for us.

This is in preparation for moving overlay functionality out of the VFS.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
4f3572954a ovl: copy up inode flags
On inode creation copy certain inode flags from the underlying real inode
to the overlay inode.

This is in preparation for moving overlay functionality out of the VFS.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:41 +02:00
Miklos Szeredi
d9854c87f0 ovl: copy up times
Copy up mtime and ctime to overlay inode after times in real object are
modified.  Be careful not to dirty cachelines when not necessary.

This is in preparation for moving overlay functionality out of the VFS.

This patch shouldn't have any observable effect.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:40 +02:00
Miklos Szeredi
f182536684 vfs: export vfs_dedupe_file_range_one() to modules
This is needed by the stacked dedupe implementation in overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:40 +02:00
Miklos Szeredi
9df6702ad0 vfs: export vfs_ioctl() to modules
This is needed by the stacked ioctl implementation in overlayfs.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:40 +02:00
Miklos Szeredi
d3b1084dfd vfs: make open_with_fake_path() not contribute to nr_files
Stacking file operations in overlay will store an extra open file for each
overlay file opened.

The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.

For fear of breaking working setups, don't start accounting the extra file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-18 15:44:40 +02:00
Miklos Szeredi
51e6ce820b Merge branch 'dedupe-cleanup' into overlayfs-next
Following series for stacking overlay files depends on this mini series.
2018-07-18 15:39:29 +02:00
Miklos Szeredi
9951934d76 Merge branch 'for-ovl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into overlayfs-next
This gives us the open_with_fake_path() helper that is needed for stacked
open files in overlay and mmap in particular.
2018-07-18 10:46:05 +02:00
Christoph Hellwig
9ba546c019 aio: don't expose __aio_sigset in uapi
glibc uses a different defintion of sigset_t than the kernel does,
and the current version would pull in both.  To fix this just do not
expose the type at all - this somewhat mirrors pselect() where we
do not even have a type for the magic sigmask argument, but just
use pointer arithmetics.

Fixes: 7a074e96 ("aio: implement io_pgetevents")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Adrian Reber <adrian@lisas.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-17 23:26:58 -04:00
Carlos Maiolino
5089eafffb libxfs: Fix a couple of sparse complaintis
No significant changes, just silence a couple of sparse errors.

Using cpu_to_be32(NULLAGINO), the NULLAGINO constant will be encoded in
BE as a constant, avoiding a BE -> CPU conversion every iteraction of
the loop, if be32_to_cpu(agi->agi_unlinked[i]) was used instead.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:58 -07:00
Gustavo A. R. Silva
e4e542a683 xfs: use swap macro in xfs_dir2_leafn_rebalance
Make use of the swap macro and remove unnecessary variable *tmp*. This
makes the code easier to read and maintain. Also, slightly refactor some
code.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:57 -07:00
Gustavo A. R. Silva
897992b7e3 xfs_bmap_util: use swap macro
Make use of the swap macro and remove some unnecessary variables.
This makes the code easier to read and maintain. Also, reduces the
stack usage.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:57 -07:00
Gustavo A. R. Silva
1d5bebbafc xfs_attr_leaf: use swap macro in xfs_attr3_leaf_rebalance
Make use of the swap macro and remove some unnecessary variables.
This makes the code easier to read and maintain. Also, reduces the
stack usage.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-17 14:25:57 -07:00
Darrick J. Wong
fa248de98a xfs: don't assume a left rmap when allocating a new rmap
The original rmap code assumed that there would always be at least one
rmap in the rmapbt (the AG sb/agf/agi) and so errored out if it didn't
find one.  This assumption isn't true for the rmapbt repair function
(and it won't be true for realtime rmap either), so remove the check and
just deal with the situation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-07-17 14:25:57 -07:00
Amir Goldstein
6781069307 ovl: fix wrong use of impure dir cache in ovl_iterate()
Only upper dir can be impure, but if we are in the middle of
iterating a lower real dir, dir could be copied up and marked
impure. We only want the impure cache if we started iterating
a real upper dir to begin with.

Aditya Kali reported that the following reproducer hits the
WARN_ON(!cache->refcount) in ovl_get_cache():

 docker run --rm drupal:8.5.4-fpm-alpine \
    sh -c 'cd /var/www/html/vendor/symfony && \
           chown -R www-data:www-data . && ls -l .'

Reported-by: Aditya Kali <adityakali@google.com>
Tested-by: Aditya Kali <adityakali@google.com>
Fixes: 4edb83bb10 ('ovl: constant d_ino for non-merge dirs')
Cc: <stable@vger.kernel.org> # v4.14
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-17 16:04:34 +02:00
Mike Christie
cc57c07343 configfs: fix registered group removal
This patch fixes a bug where configfs_register_group had added
a group in a tree, and userspace has done a rmdir on a dir somewhere
above that group and we hit a kernel crash. The problem is configfs_rmdir
will detach everything under it and unlink groups on the default_groups
list. It will not unlink groups added with configfs_register_group so when
configfs_unregister_group is called to drop its references to the group/items
we crash when we try to access the freed dentrys.

The patch just adds a check for if a rmdir has been done above
us and if so just does the unlink part of unregistration.

Sorry if you are getting this multiple times. I thouhgt I sent
this to some of you and lkml, but I do not see it.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-17 06:14:07 -07:00
Qu Wenruo
665d4953cd btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
In commit ac0b4145d6 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.

However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().

In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.

So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.

This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.

This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.

The deadlock or WARN_ON() won't be triggered before commit ac0b4145d6
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.

This patch disables the faulty code and will be removed completely in a
followup patch.

Fixes: ac0b4145d6 ("btrfs: scrub: Don't use inode pages for device replace")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-17 13:56:30 +02:00
Ingo Molnar
52b544bd38 Linux 4.18-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltLpVUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGWisH/ikONMwV7OrSk36Y
 5rxzTFUoBk0Qffct88gtSNuRVCxaVb1ofCndvFJE6A6HfJkWpbBzH6eq90aakmJi
 f7uFcu4YmsQpeQaf9lpftWmY2vDf2fIadVTV0RnSMXks57wMax1cpBe7LJGpz13e
 f+g5XRVs1MdlZVtr6tG2SU3Y5AqVVVsYe/0DBPonEqeh9/JJbPFCuNkFOxxzAqPu
 VTnjyoOqG8qtZzjklNtR5rZn0Gv592tWX36eiWTQdThNmVFkGEAJwsHCQlY4OQYK
 61QN4UhOHiu8e1ZuGDNEDhNVRnKtaaYUPFeWL1wLRW73ul4P3ZkpvpS8QTMwcFJI
 JjzNOkI=
 =ckcO
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-17 09:27:43 +02:00
Jaegeuk Kim
1cb50f87e1 f2fs: do checkpoint in kill_sb
When unmounting f2fs in force mode, we can get it stuck by io_schedule()
by some pending IOs in meta_inode.

io_schedule+0xd/0x30
wait_on_page_bit_common+0xc6/0x130
__filemap_fdatawait_range+0xbd/0x100
filemap_fdatawait_keep_errors+0x15/0x40
sync_inodes_sb+0x1cf/0x240
sync_filesystem+0x52/0x90
generic_shutdown_super+0x1d/0x110
kill_f2fs_super+0x28/0x80 [f2fs]
deactivate_locked_super+0x35/0x60
cleanup_mnt+0x36/0x70
task_work_run+0x79/0xa0
exit_to_usermode_loop+0x62/0x70
do_syscall_64+0xdb/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
0xffffffffffffffff

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-15 12:14:04 +09:00
Jaegeuk Kim
8a56dd9685 f2fs: allow wrong configured dio to buffered write
This fixes to support dio having unaligned buffers as buffered writes.

xfs_io -f -d -c "pwrite 0 512" $testfile
 -> okay

xfs_io -f -d -c "pwrite 1 512" $testfile
 -> EINVAL

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-15 12:14:04 +09:00
Eric Biggers
fe10e398e8 reiserfs: fix buffer overflow with long warning messages
ReiserFS prepares log messages into a 1024-byte buffer with no bounds
checks.  Long messages, such as the "unknown mount option" warning when
userspace passes a crafted mount options string, overflow this buffer.
This causes KASAN to report a global-out-of-bounds write.

Fix it by truncating messages to the buffer size.

Link: http://lkml.kernel.org/r/20180707203621.30922-1-ebiggers3@gmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b890b3335a4d8c608963@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-14 11:11:10 -07:00
Oscar Salvador
24962af7e1 fs, elf: make sure to page align bss in load_elf_library
The current code does not make sure to page align bss before calling
vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to
the requested lenght not being correctly aligned.

Let us make sure to align it properly.

Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured
for libc5.

Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-14 11:11:10 -07:00
Tomas Bortoli
02f51d4593 autofs: fix slab out of bounds read in getname_kernel()
The autofs subsystem does not check that the "path" parameter is present
for all cases where it is required when it is passed in via the "param"
struct.

In particular it isn't checked for the AUTOFS_DEV_IOCTL_OPENMOUNT_CMD
ioctl command.

To solve it, modify validate_dev_ioctl(function to check that a path has
been provided for ioctl commands that require it.

Link: http://lkml.kernel.org/r/153060031527.26631.18306637892746301555.stgit@pluto.themaw.net
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Reported-by: syzbot+60c837b428dc84e83a93@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-14 11:11:09 -07:00
Vlastimil Babka
e70cc2bd57 fs/proc/task_mmu.c: fix Locked field in /proc/pid/smaps*
Thomas reports:
 "While looking around in /proc on my v4.14.52 system I noticed that all
  processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
  memory than a regular user can usually lock with mlock().

  Commit 493b0e9d94 (in v4.14-rc1) seems to have changed the behavior
  of "Locked".

  Before that commit the code was like this. Notice the VM_LOCKED check.

           (vma->vm_flags & VM_LOCKED) ?
                (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

  After that commit Locked is now the same as Pss:

	  (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));

  This looks like a mistake."

Indeed, the commit has added mss->pss_locked with the correct value that
depends on VM_LOCKED, but forgot to actually use it.  Fix it.

Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
Fixes: 493b0e9d94 ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-14 11:11:09 -07:00
Naohiro Aota
97b191702b btrfs: fix use-after-free of cmp workspace pages
btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves
their page address intact. Now, if you hit "goto again" in
btrfs_extent_same_range() and hit some error in
btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages.

This is simple fix to reset the address to avoid use-after-free.

Fixes: 67b07bd4be ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-13 17:31:35 +02:00
David Sterba
20c5bbc640 btrfs: restore uuid_mutex in btrfs_open_devices
Commit 542c5908ab ("btrfs: replace uuid_mutex by
device_list_mutex in btrfs_open_devices") switched to device_list_mutex
as we need that for the device list traversal, but we also need
uuid_mutex to protect access to fs_devices::opened to be consistent with
other users of that.

Fixes: 542c5908ab ("btrfs: replace uuid_mutex by device_list_mutex in btrfs_open_devices")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-13 14:55:46 +02:00
Theodore Ts'o
8d5a803c6a ext4: check for allocation block validity with block group locked
With commit 044e6e3d74: "ext4: don't update checksum of new
initialized bitmaps" the buffer valid bit will get set without
actually setting up the checksum for the allocation bitmap, since the
checksum will get calculated once we actually allocate an inode or
block.

If we are doing this, then we need to (re-)check the verified bit
after we take the block group lock.  Otherwise, we could race with
another process reading and verifying the bitmap, which would then
complain about the checksum being invalid.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-07-12 19:08:05 -04:00
Thomas Gleixner
c6bb11147e Merge branch 'fortglx/4.19/time' of https://git.linaro.org/people/john.stultz/linux into timers/core
Pull timekeeping updates from John Stultz:

  - Make the timekeeping update more precise when NTP frequency is set
    directly by updating the multiplier.

  - Adjust selftests
2018-07-12 22:19:58 +02:00
Al Viro
2abc77af89 new helper: open_with_fake_path()
open a file by given inode, faking ->f_path.  Use with shitloads
of caution - at the very least you'd damn better make sure that
some dentry alias of that inode is pinned down by the path in
question.  Again, this is no general-purpose interface and I hope
it will eventually go away.  Right now overlayfs wants something
like that, but nothing else should.

Any out-of-tree code with bright idea of using this one *will*
eventually get hurt, with zero notice and great delight on my part.
I refuse to use EXPORT_SYMBOL_GPL(), especially in situations when
it's really EXPORT_SYMBOL_DONT_USE_IT(), but don't take that export
as "you are welcome to use it".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 11:18:42 -04:00
Al Viro
5f336e722c few more cleanups of link_path_walk() callers
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:31 -04:00
Al Viro
9b5858e99a allow link_path_walk() to take ERR_PTR()
There is a check for IS_ERR(name) immediately upstream of each call
of link_path_walk(name, nd), with positives treated as if link_path_walk()
failed with PTR_ERR(name).  Taking that check into link_path_walk() itself
simplifies things nicely.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:30 -04:00
Al Viro
edc2b1da77 make path_init() unconditionally paired with terminate_walk()
including the failure exits

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:30 -04:00
Al Viro
ee1904ba44 make alloc_file() static
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:29 -04:00
Al Viro
183266f26f new helper: alloc_file_clone()
alloc_file_clone(old_file, mode, ops): create a new struct file with
->f_path equal to that of old_file.  pipe converted.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:28 -04:00
Al Viro
152b6372c9 create_pipe_files(): switch the first allocation to alloc_file_pseudo()
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:27 -04:00
Al Viro
52c91f8b3b anon_inode_getfile(): switch to alloc_file_pseudo()
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:27 -04:00
Al Viro
e68375c850 hugetlb_file_setup(): switch to alloc_file_pseudo()
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:26 -04:00
Al Viro
d93aa9d82a new wrapper: alloc_file_pseudo()
takes inode, vfsmount, name, O_... flags and file_operations and
either returns a new struct file (in which case inode reference we
held is consumed) or returns ERR_PTR(), in which case no refcounts
are altered.

converted aio_private_file() and sock_alloc_file() to it

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:23 -04:00
Al Viro
00a07c1591 switch atomic_open() and lookup_open() to returning 0 in all success cases
caller can tell "opened" from "open it yourself" by looking at ->f_mode.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:22 -04:00
Al Viro
64e1ac4d46 ->atomic_open(): return 0 in all success cases
FMODE_OPENED can be used to distingusish "successful open" from the
"called finish_no_open(), do it yourself" cases.  Since finish_no_open()
has been adjusted, no changes in the instances were actually needed.
The caller has been adjusted.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:21 -04:00
Al Viro
3ec2eef116 get rid of 'opened' in path_openat() and the helpers downstream
unused now

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:21 -04:00
Al Viro
44907d7900 get rid of 'opened' argument of ->atomic_open() - part 3
now it can be done...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:20 -04:00
Al Viro
b452a458ca getting rid of 'opened' argument of ->atomic_open() - part 2
__gfs2_lookup(), gfs2_create_inode(), nfs_finish_open() and fuse_create_open()
don't need 'opened' anymore.  Get rid of that argument in those.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:20 -04:00
Al Viro
be12af3ef5 getting rid of 'opened' argument of ->atomic_open() - part 1
'opened' argument of finish_open() is unused.  Kill it.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:19 -04:00
Al Viro
6035a27b25 IMA: don't propagate opened through the entire thing
just check ->f_mode in ima_appraise_measurement()

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:19 -04:00
Al Viro
73a09dd943 introduce FMODE_CREATED and switch to it
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now).  IMA is another one (here and everywhere)...

Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:18 -04:00
Al Viro
aad888f828 switch all remaining checks for FILE_OPENED to FMODE_OPENED
... and don't bother with setting FILE_OPENED at all.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:18 -04:00
Al Viro
69527c554f now we can fold open_check_o_direct() into do_dentry_open()
These checks are better off in do_dentry_open(); the reason we couldn't
put them there used to be that callers couldn't tell what kind of cleanup
would do_dentry_open() failure call for.  Now that we have FMODE_OPENED,
cleanup is the same in all cases - it's simply fput().  So let's fold
that into do_dentry_open(), as Christoph's patch tried to.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:17 -04:00
Al Viro
7c1c01ec20 lift fput() on late failures into path_openat()
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:17 -04:00
Al Viro
4d27f3266f fold put_filp() into fput()
Just check FMODE_OPENED in __fput() and be done with that...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:16 -04:00
Al Viro
f5d11409e6 introduce FMODE_OPENED
basically, "is that instance set up enough for regular fput(), or
do we want put_filp() for that one".

NOTE: the only alloc_file() caller that could be followed by put_filp()
is in arch/ia64/kernel/perfmon.c, which is (Kconfig-level) broken.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:16 -04:00
Al Viro
e3f20ae210 security_file_open(): lose cred argument
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:15 -04:00
Al Viro
ae2bb293a3 get rid of cred argument of vfs_open() and do_dentry_open()
always equal to ->f_cred

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:14 -04:00
Al Viro
ea73ea7279 pass ->f_flags value to alloc_empty_file()
... and have it set the f_flags-derived part of ->f_mode.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:13 -04:00
Al Viro
6de37b6dc0 pass creds to get_empty_filp(), make sure dentry_open() passes the right creds
... and rename get_empty_filp() to alloc_empty_file().

dentry_open() gets creds as argument, but the only thing that sees those is
security_file_open() - file->f_cred still ends up with current_cred().  For
almost all callers it's the same thing, but there are several broken cases.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:04:13 -04:00
Al Viro
c9c554f214 alloc_file(): switch to passing O_... flags instead of FMODE_... mode
... so that it could set both ->f_flags and ->f_mode, without callers
having to set ->f_flags manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-12 10:02:57 -04:00
Mark Rutland
9b54bf9d6a kernel: add kcompat_sys_{f,}statfs64()
Using this helper allows us to avoid the in-kernel calls to the
compat_sys_{f,}statfs64() sycalls, as are necessary for parameter
mangling in arm64's compat handling.

Following the example of ksys_* functions, kcompat_sys_* functions are
intended to be a drop-in replacement for their compat_sys_*
counterparts, with the same calling convention.

This is necessary to enable conversion of arm64's syscall handling to
use pt_regs wrappers.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-07-12 14:49:48 +01:00
Carlos Maiolino
efe8032773 xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure we initialize *bno and *len, before jumping to out_bad_rec
label, and risk calling xfs_warn() with uninitialized variables.

Coverity: 100898
Coverity: 1437081
Coverity: 1437129
Coverity: 1437191
Coverity: 1437201
Coverity: 1437212
Coverity: 1437341
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:36 -07:00
Eric Sandeen
a4722a643f xfs: remove unused iolock arg from xfs_break_dax_layouts
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:36 -07:00
Brian Foster
bb00b6f1e2 xfs: kill __xfs_buf_submit_common()
Now that there is only one caller, fold the common submission helper
into __xfs_buf_submit().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:35 -07:00
Brian Foster
6af88cda00 xfs: combine [a]sync buffer submission apis
The buffer I/O submission path consists of separate function calls
per type. The buffer I/O type is already controlled via buffer
state (XBF_ASYNC), however, so there is no real need for separate
submission functions.

Combine the buffer submission functions into a single function that
processes the buffer appropriately based on XBF_ASYNC. Retain an
internal helper with a conditional wait parameter to continue to
support batched !XBF_ASYNC submission/completion required by delwri
queues.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:35 -07:00
Brian Foster
e339dd8d8b xfs: use sync buffer I/O for sync delwri queue submission
If a delwri queue occurs of a buffer that sits on a delwri queue
wait list, the queue sets _XBF_DELWRI_Q without changing the state
of ->b_list. This occurs, for example, if another thread beats the
current delwri waiter thread to the buffer lock after I/O
completion. Once the waiter acquires the lock, it removes the buffer
from the wait list and leaves a buffer with _XBF_DELWRI_Q set but
not populated on a list. This results in a lost buffer submission
and in turn can result in assert failures due to _XBF_DELWRI_Q being
set on buffer reclaim or filesystem lockups if the buffer happens to
cover an item in the AIL.

This problem has been reproduced by repeated iterations of xfs/305
on high CPU count (28xcpu) systems with limited memory (~1GB). Dirty
dquot reclaim races with an xfsaild push of a separate dquot backed
by the same buffer such that the buffer sits on the reclaim wait
list at the time xfsaild attempts to queue it. Since the latter
dquot has been flush locked but the underlying buffer not submitted
for I/O, the dquot pins the AIL and causes the filesystem to
livelock.

This race is essentially made possible by the buffer lock cycle
involved with waiting on a synchronous delwri queue submission.
Close the race by using synchronous buffer I/O for respective delwri
queue submission. This means the buffer remains locked across the
I/O and so is inaccessible from other contexts while in the
intermediate wait list state. The sync buffer I/O wait mechanism is
factored into a helper such that sync delwri buffer submission and
serialization are batched operations.

Designed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:34 -07:00
Brian Foster
eaebb515f1 xfs: refactor buffer submission into a common helper
Sync and async buffer submission both do generally similar things
with a couple odd exceptions. Refactor the core buffer submission
code into a common helper to isolate buffer submission from
completion handling of synchronous buffer I/O.

This patch does not change behavior. It is a step towards support
for using synchronous buffer I/O via synchronous delwri queue
submission.

Designed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:34 -07:00
Brian Foster
5fdd97944e xfs: remove xfs_defer_init() firstblock param
All but one caller of xfs_defer_init() passes in the ->t_firstblock
of the associated transaction. The one outlier is
xlog_recover_process_intents(), which simply passes a dummy value
because a valid pointer is required. This firstblock variable can
simply be removed.

At this point we could remove the xfs_defer_init() firstblock
parameter and initialize ->t_firstblock directly. Even that is not
necessary, however, because ->t_firstblock is automatically
reinitialized in the new transaction on a transaction roll. Since
xfs_defer_init() should never occur more than once on a particular
transaction (since the corresponding finish will roll it), replace
the reinit from xfs_defer_init() with an assert that verifies the
transaction has a NULLFSBLOCK firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:33 -07:00
Brian Foster
9c3bf5da80 xfs: use ->t_firstblock in inode inactivate
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:32 -07:00
Brian Foster
f537538921 xfs: use ->t_firstblock in extent swap
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:32 -07:00
Brian Foster
381d592848 xfs: use ->t_firstblock in reflink cow block cancel
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:31 -07:00
Brian Foster
fb91f4b5d6 xfs: replace no-op firstblock init with ->t_firstblock
xfs_refcount_recover_cow_leftovers() has no need for a firstblock
variable and so passes an unrelated xfs_fsblock_t to
xfs_defer_init() to avoid declaring one. Replace this no-op
initialization with ->t_firstblock. This will be optimized away by
the removal of the xfs_defer_init() firstblock param.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:31 -07:00
Brian Foster
058529c5f5 xfs: use ->t_firstblock in dq alloc
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:30 -07:00
Brian Foster
64396ff2c2 xfs: remove xfs_alloc_arg firstblock field
The xfs_alloc_arg.firstblock field is used to control the starting
agno for an allocation. The structure already carries a pointer to
the transaction, which carries the current firstblock value.

Remove the field and access ->t_firstblock directly in the
allocation code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:30 -07:00
Brian Foster
cf612de732 xfs: remove xfs_btree_cur private firstblock field
The bmbt cursor private structure has a firstblock field that is
used to maintain locking order on bmbt allocations. The field holds
an actual firstblock value (as opposed to a pointer), so it is
initialized on cursor creation, updated on allocation and then the
value is transferred back to the source before the cursor is
destroyed.

This value is always transferred from and back to the ->t_firstblock
field. Since xfs_btree_cur already carries a reference to the
transaction, we can remove this field from xfs_btree_cur and the
associated copying. The bmbt allocations will update the value in
the transaction directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:29 -07:00
Brian Foster
280253d213 xfs: remove bmap format helpers firstblock params
The bmap format helpers receive firstblock via ->t_firstblock. Drop
the param and access it directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:29 -07:00
Brian Foster
92f9da30f5 xfs: remove bmap extent add helper firstblock params
The add extent helpers all receive firstblock via ->t_firstblock.
Drop the parameter and access it directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:28 -07:00
Brian Foster
94c07b4dba xfs: remove xfs_bmalloca firstblock field
The xfs_bmalloca.firstblock field carries the firstblock value from
the transaction into the bmap infrastructure. It's initialized in
one place from ->t_firstblock, so drop the field and access
->t_firstblock directly throughout the bmap code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:28 -07:00
Brian Foster
4b77a088d7 xfs: use ->t_firstblock in bmap extent split
Also remove the unnecessary xfs_bmap_split_extent_at() parameter.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:27 -07:00
Brian Foster
333f950c89 xfs: remove bmap insert/collapse firstblock param
The only callers pass ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:27 -07:00
Brian Foster
2af5284253 xfs: remove xfs_bunmapi() firstblock param
All callers pass ->t_firstblock from the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:25 -07:00
Brian Foster
a7beabeae2 xfs: remove xfs_bmapi_write() firstblock param
All callers pass ->t_firstblock from the current transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:25 -07:00
Brian Foster
d0a9d79572 xfs: use ->t_firstblock in insert/collapse range
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:24 -07:00
Brian Foster
580c4ff948 xfs: use ->t_firstblock in xfs_bmapi_remap()
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:24 -07:00
Brian Foster
372837978d xfs: use ->t_firstblock for all xfs_bunmapi() callers
Convert all xfs_bunmapi() callers to ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:23 -07:00
Brian Foster
650919f131 xfs: use ->t_firstblock for all xfs_bmapi_write() callers
Convert all xfs_bmapi_write() users to ->t_firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:23 -07:00
Brian Foster
766139032f xfs: use ->t_firstblock in xattr ops
Similar to the dirops code, the xattr code uses an on-stack
firstblock variable for the various operations. This code rolls the
underlying transaction in various places, however, which means we
cannot simply replace the local firstblock vars with ->t_firstblock.
Doing so (without further changes) would invalidate the memory
pointed to by xfs_da_args.firstblock as soon as the first
transaction rolls.

To avoid this problem, remove xfs_da_args.firstblock and replace all
such accesses with ->t_firstblock at the same time. This ensures
that accesses to the current firstblock always occur through the
current transaction rather than a potentially invalid xfs_da_args
pointer.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:22 -07:00
Brian Foster
825d75cd8c xfs: use ->t_firstblock in attrfork add
Note that this codepath is a user of struct xfs_da_args. Switch it
over to ->t_firstblock in preparation to remove
xfs_da_args.firstblock.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:21 -07:00
Brian Foster
381eee69f8 xfs: remove firstblock param from xfs dir ops
All callers of the xfs_dir_*() functions pass ->t_firstblock as the
firstblock parameter. Drop the parameter and access ->t_firstblock
directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:21 -07:00
Brian Foster
f16dea54b7 xfs: use ->t_firstblock in dir ops
Callers of the xfs_dir_*() functions currently pass an on-stack
firstblock variable. While the dirops infrastructure carries a
pointer to this variable, it never rolls the transaction and so it
is safe to use ->t_firstblock instead.

Fix up the various xfs_dir_*() callers to use ->t_firstblock. Also
remove the unnecessary parameter for xfs_cross_rename().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:20 -07:00
Brian Foster
bba59c5e4b xfs: add firstblock field to xfs_trans
A firstblock var is typically allocated and initialized along with
xfs_defer_ops structures and passed around independent from the
associated transaction. To facilitate combining the two, add an
optional ->t_firstblock field to xfs_trans that can be used in place
of an on-stack variable.

The firstblock value follows the lifetime of the transaction, so
initialize it on allocation and when a transaction rolls.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:20 -07:00
Brian Foster
3ae2d89174 xfs: allow null firstblock in xfs_bmapi_write() when tp is null
xfs_bmapi_write() always expects a valid firstblock pointer. It
immediately dereferences the pointer to help determine how to
initialize the bma.minleft field. The remaining accesses are
related to modifying btree format forks, which is only relevant for
!COW fork callers.

The reflink code passes a NULL transaction to xfs_bmapi_write() in a
couple places that do COW fork unwritten conversion. The purpose of
the firstblock field is to track the first block allocation in the
current transaction, so technically firstblock should not be
required for these callers either.

Tweak xfs_bmapi_write() to initialize the bma correctly without
accessing the firstblock pointer if no transaction is provided in
the first place. Update the reflink callers to pass NULL instead of
otherwise unused firstblock references.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:19 -07:00
Brian Foster
bcd2c9f335 xfs: refactor dfops init to attach to transaction
Most callers of xfs_defer_init() immediately attach the dfops
structure to a transaction. Add a transaction parameter to eliminate
much of this boilerplate code. This also helps self-document the
fact that many codepaths now expect a dfops pointer implicitly via
xfs_trans->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:19 -07:00
Brian Foster
d5669ed581 xfs: use ->t_dfops in reflink cow recover path
Use ->t_dfops of the leftover COW reservation cleanup transaction.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:18 -07:00
Brian Foster
27356a063a xfs: use ->t_dfops in cancel cow blocks operation
Use ->t_dfops of the transaction from the caller. Reset it before we
return to avoid leaks of local stack memory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:18 -07:00
Brian Foster
7a7943c7e0 xfs: use ->t_dfops for rmap extent swap operations
xfs_swap_extent_rmap() uses a local dfops instance with a
transaction from the caller. Since there is only one caller, pull
the dfops structure into the caller and attach it to the
transaction. This avoids the need to clear ->t_dfops to prevent
invalid stack memory access.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:17 -07:00
Brian Foster
ed7ef8e55c xfs: remove unused btree cursor bc_private.a.dfops field
The xfs_btree_cur.bc_private.a.dfops field is only ever initialized
by the refcountbt cursor init function. The only caller of that
function with a non-NULL dfops is from deferred completion context,
which already has attached to ->t_dfops.

In addition to that, the only actual reference of a.dfops is the
cursor duplication function, which means the field is effectively
unused.

Remove the dfops field from the bc_private.a union. Any future users
can acquire the dfops from the transaction. This patch does not
change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:17 -07:00
Brian Foster
42b394a925 xfs: remove xfs_btree_cur bmbt dfops field
All assignments of xfs_btree_cur.bc_private.b.dfops originate from
->t_dfops. Replace accesses of the former with the latter and remove
the unnecessary field. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:16 -07:00
Brian Foster
81ba8f3e94 xfs: remove dfops param from internal bmap extent helpers
All callers of the various bmap extent helpers now use ->t_dfops.
Remove the unnecessary dfops params and access ->t_dfops directly.
This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:16 -07:00
Brian Foster
f4a9cf97fa xfs: use ->t_dfops for collapse/insert range operations
Use ->t_dfops for the collapse and insert range transactions. These
are the only callers of the respective bmap helpers, so replace the
unnecessary dfops parameters with direct accesses to ->t_dfops.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:15 -07:00
Brian Foster
3e3673e302 xfs: remove struct xfs_bmalloca dfops field
Now that bma.dfops is only assigned from ->t_dfops, replace all
accesses to the former with the latter and remove the unnecessary
field. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:14 -07:00
Brian Foster
ff3edf255d xfs: remove xfs_bmapi_remap() dfops param
All xfs_bmapi_remap() callers already use ->t_dfops. Note that
deferred completion context unconditionally sets ->t_dfops if it
hasn't already been set by the caller. Remove the unnecessary
parameter and access ->t_dfops directly.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:14 -07:00
Brian Foster
ccd9d91148 xfs: remove xfs_bunmapi() dfops param
Now that all xfs_bunmapi() callers use ->t_dfops, remove the
unnecessary parameter and access ->t_dfops directly. This patch does
not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:13 -07:00
Brian Foster
4bcfa613a0 xfs: use ->t_dfops for all xfs_bunmapi() callers
Use ->t_dfops for all remaining xfs_bunmapi() callers. This prepares
the latter to no longer require a dfops parameter.

Note that xfs_itruncate_extents_flags() associates a local dfops
with a transaction provided from the caller. Since there are
multiple callers, set and reset ->t_dfops before the function
returns to avoid exposure of stack memory to the caller.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:13 -07:00
Brian Foster
6e702a5dcb xfs: remove xfs_bmapi_write() dfops param
Now that all callers use ->t_dfops, the xfs_bmapi_write() dfops
parameter is no longer necessary. Remove it and access ->t_dfops
directly. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:12 -07:00
Brian Foster
175d1a013e xfs: use ->t_dfops for all xfs_bmapi_write() callers
Attach ->t_dfops for all remaining callers of xfs_bmapi_write().
This prepares the latter to no longer require a separate dfops
parameter.

Note that xfs_symlink() already uses ->t_dfops. Fix up the local
references for consistency.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:12 -07:00
Brian Foster
2ba1372125 xfs: use ->t_dfops in dqalloc transaction
xfs_dquot_disk_alloc() receives a transaction from the caller and
passes a local dfops along to xfs_bmapi_write(). If we attach this
dfops to the transaction, we have to make sure to clear it before
returning to avoid invalid access of stack memory.

Since xfs_qm_dqread_alloc() is the only caller, pull dfops into the
caller and attach it to the transaction to eliminate this pattern
entirely.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:11 -07:00
Brian Foster
32a9b7c65c xfs: replace xfs_da_args->dfops accesses with ->t_dfops and remove
Now that xfs_da_args->dfops is always assigned from a ->t_dfops
pointer (or one that is immediately attached), replace all
downstream accesses of the former with the latter and remove the
field from struct xfs_da_args.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:11 -07:00
Brian Foster
d76e6ce8ed xfs: use ->t_dfops in extent split tx and remove param
Attach the local dfops to ->t_dfops of the extent split transaction.
Since this is the only caller of xfs_bmap_split_extent_at(), remove
the dfops parameter as well.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:10 -07:00
Brian Foster
0bd6207f83 xfs: remove dfops param in attr fork add path
Now that the attribute fork add tx carries dfops along with the
transaction, it is unnecessary to pass it down the stack. Remove the
dfops parameter and access ->t_dfops directly where necessary. This
patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:10 -07:00
Brian Foster
40d03ac6aa xfs: use ->t_dfops for attr set/remove operations
Attach the local dfops to the transaction allocated for xattr add
and remove operations. Add an earlier initialization in
xfs_attr_remove() to ensure the structure is valid if it remains
unused at transaction commit time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:09 -07:00
Brian Foster
813d08cb6d xfs: use ->t_dfops for recovery of [b|c]ui log items
Log recovery passes down a central dfops structure to recovery
handlers for bui and cui log items. Each of these handlers allocates
and commits a transaction and defers any remaining operations to be
completed by the main recovery sequence.

Since dfops outlives the transaction in this context, set and clear
->t_dfops appropriately such that the *_finish_item() paths and
below (i.e., xfs_bmapi*()) can expect to find the dfops in the
transaction without it being committed with the dfops attached. This
is required because transaction commit expects that an associated
dfops is finished and in this context the dfops may be populated at
commit time.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:09 -07:00
Brian Foster
c9cfdb3811 xfs: remove dfops param from high level dirname calls
All callers of the directory create, rename and remove interfaces
already associate the dfops with the transaction. Drop the dfops
parameters in these calls in preparation for further cleanups in the
layers below. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:08 -07:00
Brian Foster
0e0417f3e5 xfs: remove dfops parameter from ifree call stack
The inode free callchain starting in xfs_inactive_ifree() already
associates its dfops with the transaction. It still passes the dfops
on the stack down through xfs_difree_inobt(), however.

Clean up the call stack and reference dfops directly from the
transaction. This patch does not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Brian Foster
6aa6718439 xfs: rename xfs_trans ->t_agfl_dfops to ->t_dfops
The ->t_agfl_dfops field is currently used to defer agfl block frees
from associated transaction contexts. While all known problematic
contexts have already been updated to use ->t_agfl_dfops, the
broader goal is defer agfl frees from all callers that already use a
deferred operations structure. Further, the transaction field
facilitates a good amount of code clean up where the transaction and
dfops have historically been passed down through the stack
separately.

Rename the field to something more generic to prepare to use it as
such throughout XFS. This patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:07 -07:00
Brian Foster
8a74938649 xfs: cow unwritten conversion uses uninitialized dfops
A couple COW fork unwritten extent conversion helpers pass an
uninitialized dfops pointer to xfs_bmapi_write(). This does not
cause problems because conversion does not use a transaction or the
dfops structure for the COW fork.  Drop the uninitialized usage of
dfops in these codepaths and pass NULL along to xfs_bmapi_write()
instead.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:06 -07:00
Christoph Hellwig
98c1a7c0ec xfs: update my copyrights for the writeback and iomap code
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:06 -07:00
Christoph Hellwig
82cb14175e xfs: add support for sub-pagesize writeback without buffer_heads
Switch to using the iomap_page structure for checking sub-page uptodate
status and track sub-page I/O completion status, and remove large
quantities of boilerplate code working around buffer heads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:05 -07:00
Christoph Hellwig
9dc55f1389 iomap: add support for sub-pagesize buffered I/O without buffer heads
After already supporting a simple implementation of buffered writes for
the blocksize == PAGE_SIZE case in the last commit this adds full support
even for smaller block sizes.   There are three bits of per-block
information in the buffer_head structure that really matter for the iomap
read and write path:

 - uptodate status (BH_uptodate)
 - marked as currently under read I/O (BH_Async_Read)
 - marked as currently under write I/O (BH_Async_Write)

Instead of having new per-block structures this now adds a per-page
structure called struct iomap_page to track this information in a slightly
different form:

 - a bitmap for the per-block uptodate status.  For worst case of a 64k
   page size system this bitmap needs to contain 128 bits.  For the
   typical 4k page size case it only needs 8 bits, although we still
   need a full unsigned long due to the way the atomic bitmap API works.
 - two atomic_t counters are used to track the outstanding read and write
   counts

There is quite a bit of boilerplate code as the buffered I/O path uses
various helper methods, but the actual code is very straight forward.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:05 -07:00
Christoph Hellwig
ac8ee54669 xfs: allow writeback on pages without buffer heads
Disable the IOMAP_F_BUFFER_HEAD flag on file systems with a block size
equal to the page size, and deal with pages without buffer heads in
writeback.  Thanks to the previous refactoring this is basically trivial
now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:04 -07:00
Christoph Hellwig
8e1f065bea xfs: refactor the tail of xfs_writepage_map
Rejuggle how we deal with the different error vs non-error and have
ioends vs not have ioend cases to keep the fast path streamlined, and
the duplicate code at a minimum.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:04 -07:00
Christoph Hellwig
1b65d3dd2d xfs: remove xfs_start_page_writeback
This helper only has two callers, one of them with a constant error
argument.  Remove it to make pending changes to the code a little easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:03 -07:00
Christoph Hellwig
6d465e8953 xfs: move all writeback buffer_head manipulation into xfs_map_at_offset
This keeps it in a single place so it can be made otional more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:03 -07:00
Christoph Hellwig
3faed66764 xfs: don't look at buffer heads in xfs_add_to_ioend
Calculate all information for the bio based on the passed in information
without requiring a buffer_head structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
889c65b3f6 xfs: remove the imap_valid flag
Simplify the way we check for a valid imap - we know we have a valid
mapping after xfs_map_blocks returned successfully, and we know we can
call xfs_imap_valid on any imap, as it will always fail on a
zero-initialized map.

We can also remove the xfs_imap_valid function and fold it into
xfs_map_blocks now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
3345746ef3 xfs: simplify xfs_map_blocks by using xfs_iext_lookup_extent directly
xfs_bmapi_read adds zero value in xfs_map_blocks.  Replace it with a
direct call to the low-level extent lookup function.

Note that we now always pass a 0 length to the trace points as we ask
for an unspecified len.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:02 -07:00
Christoph Hellwig
060d4eaa0b xfs: remove xfs_reflink_find_cow_mapping
We only have one caller left, and open coding the simple extent list
lookup in it allows us to make the code both more understandable and
reuse calculations and variables already present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:01 -07:00
Christoph Hellwig
c3a2f9fff1 xfs: remove the now unused XFS_BMAPI_IGSTATE flag
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:01 -07:00
Dave Chinner
e2f6ad4624 xfs: make xfs_writepage_map extent map centric
xfs_writepage_map() iterates over the bufferheads on a page to decide
what sort of IO to do and what actions to take.  However, when it comes
to reflink and deciding when it needs to execute a COW operation, we no
longer look at the bufferhead state but instead we ignore than and look
up internal state held in the COW fork extent list.

This means xfs_writepage_map() is somewhat confused. It does stuff, then
ignores it, then tries to handle the impedence mismatch by shovelling the
results inside the existing mapping code.  It works, but it's a bit of a
mess and it makes it hard to fix the cached map bug that the writepage
code currently has.

To unify the two different mechanisms, we first have to choose a direction.
That's already been set - we're de-emphasising bufferheads so they are no
longer a control structure as we need to do taht to allow for eventual
removal.  Hence we need to move away from looking at bufferhead state to
determine what operations we need to perform.

We can't completely get rid of bufferheads yet - they do contain some
state that is absolutely necessary, such as whether that part of the page
contains valid data or not (buffer_uptodate()).  Other state in the
bufferhead is redundant:

	BH_dirty - the page is dirty, so we can ignore this and just
		write it
	BH_delay - we have delalloc extent info in the DATA fork extent
		tree
	BH_unwritten - same as BH_delay
	BH_mapped - indicates we've already used it once for IO and it is
		mapped to a disk address. Needs to be ignored for COW
		blocks.

The BH_mapped flag is an interesting case - it's supposed to indicate that
it's already mapped to disk and so we can just use it "as is".  In theory,
we don't even have to do an extent lookup to find where to write it too,
but we have to do that anyway to determine we are actually writing over a
valid extent.  Hence it's not even serving the purpose of avoiding a an
extent lookup during writeback, and so we can pretty much ignore it.
Especially as we have to ignore it for COW operations...

Therefore, use the extent map as the source of information to tell us
what actions we need to take and what sort of IO we should perform.  The
first step is to have xfs_map_blocks() set the io type according to what
it looks up.  This means it can easily handle both normal overwrite and
COW cases.  The only thing we also need to add is the ability to return
hole mappings.

We need to return and cache hole mappings now for the case of multiple
blocks per page.  We no longer use the BH_mapped to indicate a block over
a hole, so we have to get that info from xfs_map_blocks().  We cache it so
that holes that span two pages don't need separate lookups.  This allows us
to avoid ever doing write IO over a hole, too.

Now that we have xfs_map_blocks() returning both a cached map and the type
of IO we need to perform, we can rewrite xfs_writepage_map() to drop all
the bufferhead control. It's also much simplified because it doesn't need
to explicitly handle COW operations.  Instead of iterating bufferheads, it
iterates blocks within the page and then looks up what per-block state is
required from the appropriate bufferhead.  It then validates the cached
map, and if it's not valid, we get a new map.  If we don't get a valid map
or it's over a hole, we skip the block.

At this point, we have to remap the bufferhead via xfs_map_at_offset().
As previously noted, we had to do this even if the buffer was already
mapped as the mapping would be stale for XFS_IO_DELALLOC, XFS_IO_UNWRITTEN
and XFS_IO_COW IO types.  With xfs_map_blocks() now controlling the type,
even XFS_IO_OVERWRITE types need remapping, as converted-but-not-yet-
written delalloc extents beyond EOF can be reported at XFS_IO_OVERWRITE.
Bufferheads that span such regions still need their BH_Delay flags cleared
and their block numbers calculated, so we now unconditionally map each
bufferhead before submission.

But wait! There's more - remember the old "treat unwritten extents as
holes on read" hack?  Yeah, that means we can have a dirty page with
unmapped, unwritten bufferheads that contain data!  What makes these so
special is that the unwritten "hole" bufferheads do not have a valid block
device pointer, so if we attempt to write them xfs_add_to_ioend() blows
up. So we make xfs_map_at_offset() do the "realtime or data device"
lookup from the inode and ignore what was or wasn't put into the
bufferhead when the buffer was instantiated.

The astute reader will have realised by now that this code treats
unwritten extents in multiple-blocks-per-page situations differently.
If we get any combination of unwritten blocks on a dirty page that contain
valid data in the page, we're going to convert them to real extents.  This
can actually be a win, because it means that pages with interleaving
unwritten and written blocks will get converted to a single written extent
with zeros replacing the interspersed unwritten blocks.  This is actually
good for reducing extent list and conversion overhead, and it means we
issue a contiguous IO instead of lots of little ones.  The downside is
that we use up a little extra IO bandwidth.  Neither of these seem like a
bad thing given that spinning disks are seek sensitive, and SSDs/pmem have
bandwidth to burn and the lower Io latency/CPU overhead of fewer, larger
IOs will result in better performance on them...

As a result of all this, the only state we actually care about from the
bufferhead is a single flag - BH_Uptodate. We still use the bufferhead to
pass some information to the bio via xfs_add_to_ioend(), but that is
trivial to separate and pass explicitly.  This means we really only need
1 bit of state per block per page from the buffered write path in the
writeback path.  Everything else we do with the bufferhead is purely to
make the buffered IO front end continue to work correctly. i.e we've
pretty much marginalised bufferheads in the writeback path completely.

Signed-off-By: Dave Chinner <dchinner@redhat.com>
[hch: forward port, refactor and split off bits into other commits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
Christoph Hellwig
6a4c950136 xfs: rename the offset variable in xfs_writepage_map
Calling it file_offset makes the usage more clear, especially with
a new poffset variable that will be added soon for the offset inside
the page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:26:00 -07:00
Christoph Hellwig
5c665e5b5a xfs: remove xfs_map_cow
We can handle the existing cow mapping case as a special case directly
in xfs_writepage_map, and share code for allocating delalloc blocks
with regular I/O in xfs_map_blocks.  This means we need to always
call xfs_map_blocks for reflink inodes, but we can still skip most of
the work if it turns out that there is no COW mapping overlapping the
current block.

As a subtle detail we need to start caching holes in the wpc to deal
with the case of COW reservations between EOF.  But we'll need that
infrastructure later anyway, so this is no big deal.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
fca8c80542 xfs: remove xfs_reflink_trim_irec_to_next_cow
We already have to check for overlapping COW extents everytime we
come back to a page in xfs_writepage_map / xfs_map_cow, so this
additional trim is not required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
a7b28f72ab xfs: don't use XFS_BMAPI_IGSTATE in xfs_map_blocks
We want to be able to use the extent state as a reliably indicator for
the type of I/O, and stop using the buffer head state.  For this we
need to stop using the XFS_BMAPI_IGSTATE so that we don't see merged
extents of different types.

Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:59 -07:00
Christoph Hellwig
c57371a16d xfs: don't clear imap_valid for a non-uptodate buffers
Finding a buffer that isn't uptodate doesn't invalidate the mapping for
any given block.  The last_sector check will already take care of starting
another ioend as soon as we find any non-update buffer, and if the current
mapping doesn't include the next uptodate buffer the xfs_imap_valid check
will take care of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:58 -07:00
Christoph Hellwig
91cdfd1761 xfs: do not set the page uptodate in xfs_writepage_map
We already track the page uptodate status based on the buffer uptodate
status, which is updated whenever reading or zeroing blocks.

This code has been there since commit a ptool commit in 2002, which
claims to:

    "merge" the 2.4 fsx fix for block size < page size to 2.5.  This needed
    major changes to actually fit.

and isn't present in other writepage implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:58 -07:00
Christoph Hellwig
d438017757 xfs: move locking into xfs_bmap_punch_delalloc_range
Both callers want the same looking, so do it only once.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:57 -07:00
Christoph Hellwig
0362572138 xfs: simplify xfs_aops_discard_page
Instead of looking at the buffer heads to see if a block is delalloc just
call xfs_bmap_punch_delalloc_range on the whole page - this will leave
any non-delalloc block intact and handle the iteration for us.  As a side
effect one more place stops caring about buffer heads and we can remove the
xfs_check_page_type function entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:57 -07:00
Christoph Hellwig
8b2e77c163 xfs: use iomap for blocksize == PAGE_SIZE readpage and readpages
For file systems with a block size that equals the page size we never do
partial reads, so we can use the buffer_head-less iomap versions of
readpage and readpages without conflicting with the buffer_head structures
create later in write_begin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-11 22:25:56 -07:00
Darrick J. Wong
c2efdfc100 Merge branch 'iomap-4.19-merge' into xfs-4.19-merge 2018-07-11 22:24:40 -07:00
Jaegeuk Kim
7f2ecdd837 f2fs: flush journal nat entries for nat_bits during unmount
Let's flush journal nat entries for speed up in the next run.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-07-11 19:54:51 -07:00
Al Viro
6b4e8085c0 make sure do_dentry_open() won't return positive as an error
An ->open() instances really, really should not be doing that.  There's
a lot of places e.g. around atomic_open() that could be confused by that,
so let's catch that early.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Al Viro
b10a4a9f76 create_pipe_files(): use fput() if allocation of the second file fails
... just use put_pipe_info() to get the pipe->files down to 1 and let
fput()-called pipe_release() do freeing.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Al Viro
b4e7a7a88b drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
Failure of ->open() should *not* be followed by fput().  Fixed by
using filp_clone_open(), which gets the cleanups right.

Cc: stable@vger.kernel.org
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Al Viro
19f391eb05 turn filp_clone_open() into inline wrapper for dentry_open()
it's exactly the same thing as
	dentry_open(&file->f_path, file->f_flags, file->f_cred)

... and rename it to file_clone_open(), while we are at it.
'filp' naming convention is bogus; sure, it's "file pointer",
but we generally don't do that kind of Hungarian notation.
Some of the instances have too many callers to touch, but this
one has only two, so let's sanitize it while we can...

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Al Viro
e8cff84faa fold security_file_free() into file_free()
.. and the call of file_free() in case of security_file_alloc() failure
in get_empty_filp() should be simply file_free_rcu() - no point in
rcu-delays there, anyway.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-10 23:29:03 -04:00
Theodore Ts'o
362eca70b5 ext4: fix inline data updates with checksums enabled
The inline data code was updating the raw inode directly; this is
problematic since if metadata checksums are enabled,
ext4_mark_inode_dirty() must be called to update the inode's checksum.
In addition, the jbd2 layer requires that get_write_access() be called
before the metadata buffer is modified.  Fix both of these problems.

https://bugzilla.kernel.org/show_bug.cgi?id=200443

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-07-10 01:07:43 -04:00
Theodore Ts'o
2dca60d98e ext4: clear mmp sequence number when remounting read-only
Previously, when an MMP-protected file system is remounted read-only,
the kmmpd thread would exit the next time it woke up (a few seconds
later), without resetting the MMP sequence number back to
EXT4_MMP_SEQ_CLEAN.

Fix this by explicitly killing the MMP thread when the file system is
remounted read-only.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
2018-07-08 19:36:02 -04:00
Theodore Ts'o
44de022c43 ext4: fix false negatives *and* false positives in ext4_check_descriptors()
Ext4_check_descriptors() was getting called before s_gdb_count was
initialized.  So for file systems w/o the meta_bg feature, allocation
bitmaps could overlap the block group descriptors and ext4 wouldn't
notice.

For file systems with the meta_bg feature enabled, there was a
fencepost error which would cause the ext4_check_descriptors() to
incorrectly believe that the block allocation bitmap overlaps with the
block group descriptor blocks, and it would reject the mount.

Fix both of these problems.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-07-08 19:35:02 -04:00
Linus Torvalds
70a2dc6abc Bug fixes for ext4; most of which relate to vulnerabilities where a
maliciously crafted file system image can result in a kernel OOPS or
 hang.  At least one fix addresses an inline data bug could be
 triggered by userspace without the need of a crafted file system
 (although it does require that the inline data feature be enabled).
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAltBmcYACgkQ8vlZVpUN
 gaPDJgf/cEa9QuiYTbNOmcOMorK9LEk5XO8qsiJdUVNQtLsHZfl0QowbkF9/F/W5
 andTJzNpFvXeLADMTTjpsDnQ90i8LKD11Kol3dPJcMhJhELtQsjxUBguxpQBP86R
 dvHuCl2/AaqX7rr6Co80yYSinRCquqkzJNhdM5/MLNGziSpkQL3dPSs93rmV+YbU
 8DkUwmhDhoiToLBTLaldrAsAzKvor3uyjNPJ3qhxeE2kXrnuI1V4XfstBGjhVKFB
 /5aYWexDZkL5qiCo+lZnqdITqUnPx3uAkUdBn0dj7V+nDow+/R/8nApvlvJu6usF
 OfMoKr098/pmPAjE5aZ8QpBNVtLFpg==
 =njzR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Bug fixes for ext4; most of which relate to vulnerabilities where a
  maliciously crafted file system image can result in a kernel OOPS or
  hang.

  At least one fix addresses an inline data bug could be triggered by
  userspace without the need of a crafted file system (although it does
  require that the inline data feature be enabled)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: check superblock mapped prior to committing
  ext4: add more mount time checks of the superblock
  ext4: add more inode number paranoia checks
  ext4: avoid running out of journal credits when appending to an inline file
  jbd2: don't mark block as modified if the handle is out of credits
  ext4: never move the system.data xattr out of the inode body
  ext4: clear i_data in ext4_inode_info when removing inline data
  ext4: include the illegal physical block in the bad map ext4_error msg
  ext4: verify the depth of extent tree in ext4_find_extent()
  ext4: only look at the bg_flags field if it is valid
  ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
  ext4: always check block group bounds in ext4_init_block_bitmap()
  ext4: always verify the magic number in xattr blocks
  ext4: add corruption check in ext4_xattr_set_entry()
  ext4: add warn_on_error mount option
2018-07-08 11:10:30 -07:00
Linus Torvalds
b2d44d145d five smb3/cifs fixes for stable (including for some leaks and memory overwrites) and also a few fixes for recent regressions in packet signing
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAltBFyMACgkQiiy9cAdy
 T1EwcAwAoflntkPJtDX1/Ch3fm4cwR/GHiOHJ3jXUUs5x1JVy2YMyIpojijcDB9q
 ifmc9ZEVdov5kJVJF4dz4HUhxDwPbZTgZdAwSaYUdbQepA0Nzu7k7ZaTfzWwzYTa
 AaGxNShfEWogBdhMjNPKHpIUfrnOEtosv6iLLN3iwkbypLH0f3z1Tye38+9wnDO/
 B0M64lf4gxMB7ZsjFoQIu9ZLZMlQgF9ISycPUUmahR6G9sTJaykfyTihTwOo8HUb
 zNA6hgW5lUxCpCc2eNwy2wFuLqwf3+t3JmWUgJoYqVCbscywtTScivZyNEO36/17
 4oFCExMuJ79TXBP9RyTFrYkNhsTTdAyfDOLWcsMVsAo+zHub1nqjm8ENlmGJ7ZAS
 ESdLY+E+59Hndb21Te1IVq7HZsmXKHU6UHxknXTaXFPlBIKeHbH7vtt5zUzq7lxW
 hDwPTmev+b7jOE/4+cR5WQItMxzZ+pW7Toc6f8gmN1IU2FJjEsTgNGy2n4Az5WyR
 pZAydSRd
 =x5ij
 -----END PGP SIGNATURE-----

Merge tag '4.18-rc3-smb3fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Five smb3/cifs fixes for stable (including for some leaks and memory
  overwrites) and also a few fixes for recent regressions in packet
  signing.

  Additional testing at the recent SMB3 test event, and some good work
  by Paulo and others spotted the issues fixed here. In addition to my
  xfstest runs on these, Aurelien and Stefano did additional test runs
  to verify this set"

* tag '4.18-rc3-smb3fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix stack out-of-bounds in smb{2,3}_create_lease_buf()
  cifs: Fix infinite loop when using hard mount option
  cifs: Fix slab-out-of-bounds in send_set_info() on SMB2 ACE setting
  cifs: Fix memory leak in smb2_set_ea()
  cifs: fix SMB1 breakage
  cifs: Fix validation of signed data in smb2
  cifs: Fix validation of signed data in smb3+
  cifs: Fix use after free of a mid_q_entry
2018-07-07 18:31:34 -07:00
Rajat Jain
c855cf2759 sysfs: Fix internal_create_group() for named group updates
There are a couple of problems with named group updates in the code
today:

* sysfs_update_group() will always fail for a named group, because
  internal_create_group() will try to create a new sysfs directory
  unconditionally, which will ofcourse fail with -EEXIST.

* We can leak the kernfs_node for grp->name if some one tries to:
  - rename a group (change grp->name), or
  - update a named group, to an unnamed group

It appears that the whole purpose of sysfs_update_group() was to
allow changing the permissions or visibility of attributes and not
the names. So make it clear in the comments, and allow it to update
an existing named group.

Signed-off-by: Rajat Jain <rajatja@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-07 17:54:46 +02:00
Guenter Roeck
166126c1e5 kernfs: Replace strncpy with memcpy
gcc 8.1.0 complains:

fs/kernfs/symlink.c:91:3: warning:
	'strncpy' output truncated before terminating nul copying
	as many bytes from a string as its length
fs/kernfs/symlink.c: In function 'kernfs_iop_get_link':
fs/kernfs/symlink.c:88:14: note: length computed here

Using strncpy() is indeed less than perfect since the length of data to
be copied has already been determined with strlen(). Replace strncpy()
with memcpy() to address the warning and optimize the code a little.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-07 09:57:10 +02:00
Miklos Szeredi
1b4f42a1e3 vfs: dedupe: extract helper for a single dedup
Extract vfs_dedupe_file_range_one() helper to deal with a single dedup
request.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-07-06 23:57:03 +02:00
Miklos Szeredi
87eb5eb242 vfs: dedupe: rationalize args
Clean up f_op->dedupe_file_range() interface.

1) Use loff_t for offsets and length instead of u64
2) Order the arguments the same way as {copy|clone}_file_range().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-06 23:57:03 +02:00
Miklos Szeredi
5740c99e9d vfs: dedupe: return int
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-06 23:57:03 +02:00
Miklos Szeredi
92b66d2cdd vfs: limit size of dedupe
Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> 
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-07-06 23:57:02 +02:00
Linus Torvalds
0fa3ecd878 Fix up non-directory creation in SGID directories
sgid directories have special semantics, making newly created files in
the directory belong to the group of the directory, and newly created
subdirectories will also become sgid.  This is historically used for
group-shared directories.

But group directories writable by non-group members should not imply
that such non-group members can magically join the group, so make sure
to clear the sgid bit on non-directories for non-members (but remember
that sgid without group execute means "mandatory locking", just to
confuse things even more).

Reported-by: Jann Horn <jannh@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-05 12:36:36 -07:00
Stefano Brivio
729c0c9dd5 cifs: Fix stack out-of-bounds in smb{2,3}_create_lease_buf()
smb{2,3}_create_lease_buf() store a lease key in the lease
context for later usage on a lease break.

In most paths, the key is currently sourced from data that
happens to be on the stack near local variables for oplock in
SMB2_open() callers, e.g. from open_shroot(), whereas
smb2_open_file() properly allocates space on its stack for it.

The address of those local variables holding the oplock is then
passed to create_lease_buf handlers via SMB2_open(), and 16
bytes near oplock are used. This causes a stack out-of-bounds
access as reported by KASAN on SMB2.1 and SMB3 mounts (first
out-of-bounds access is shown here):

[  111.528823] BUG: KASAN: stack-out-of-bounds in smb3_create_lease_buf+0x399/0x3b0 [cifs]
[  111.530815] Read of size 8 at addr ffff88010829f249 by task mount.cifs/985
[  111.532838] CPU: 3 PID: 985 Comm: mount.cifs Not tainted 4.18.0-rc3+ #91
[  111.534656] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  111.536838] Call Trace:
[  111.537528]  dump_stack+0xc2/0x16b
[  111.540890]  print_address_description+0x6a/0x270
[  111.542185]  kasan_report+0x258/0x380
[  111.544701]  smb3_create_lease_buf+0x399/0x3b0 [cifs]
[  111.546134]  SMB2_open+0x1ef8/0x4b70 [cifs]
[  111.575883]  open_shroot+0x339/0x550 [cifs]
[  111.591969]  smb3_qfs_tcon+0x32c/0x1e60 [cifs]
[  111.617405]  cifs_mount+0x4f3/0x2fc0 [cifs]
[  111.674332]  cifs_smb3_do_mount+0x263/0xf10 [cifs]
[  111.677915]  mount_fs+0x55/0x2b0
[  111.679504]  vfs_kern_mount.part.22+0xaa/0x430
[  111.684511]  do_mount+0xc40/0x2660
[  111.698301]  ksys_mount+0x80/0xd0
[  111.701541]  do_syscall_64+0x14e/0x4b0
[  111.711807]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  111.713665] RIP: 0033:0x7f372385b5fa
[  111.715311] Code: 48 8b 0d 99 78 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 66 78 2c 00 f7 d8 64 89 01 48
[  111.720330] RSP: 002b:00007ffff27049d8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[  111.722601] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f372385b5fa
[  111.724842] RDX: 000055c2ecdc73b2 RSI: 000055c2ecdc73f9 RDI: 00007ffff270580f
[  111.727083] RBP: 00007ffff2705804 R08: 000055c2ee976060 R09: 0000000000001000
[  111.729319] R10: 0000000000000000 R11: 0000000000000206 R12: 00007f3723f4d000
[  111.731615] R13: 000055c2ee976060 R14: 00007f3723f4f90f R15: 0000000000000000

[  111.735448] The buggy address belongs to the page:
[  111.737420] page:ffffea000420a7c0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
[  111.739890] flags: 0x17ffffc0000000()
[  111.741750] raw: 0017ffffc0000000 0000000000000000 dead000000000200 0000000000000000
[  111.744216] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[  111.746679] page dumped because: kasan: bad access detected

[  111.750482] Memory state around the buggy address:
[  111.752562]  ffff88010829f100: 00 f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00
[  111.754991]  ffff88010829f180: 00 00 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
[  111.757401] >ffff88010829f200: 00 00 00 00 00 f1 f1 f1 f1 01 f2 f2 f2 f2 f2 f2
[  111.759801]                                               ^
[  111.762034]  ffff88010829f280: f2 02 f2 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00
[  111.764486]  ffff88010829f300: f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  111.766913] ==================================================================

Lease keys are however already generated and stored in fid data
on open and create paths: pass them down to the lease context
creation handlers and use them.

Suggested-by: Aurélien Aptel <aaptel@suse.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Fixes: b8c32dbb0d ("CIFS: Request SMB2.1 leases")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:25 -05:00
Paulo Alcantara
7ffbe65578 cifs: Fix infinite loop when using hard mount option
For every request we send, whether it is SMB1 or SMB2+, we attempt to
reconnect tcon (cifs_reconnect_tcon or smb2_reconnect) before carrying
out the request.

So, while server->tcpStatus != CifsNeedReconnect, we wait for the
reconnection to succeed on wait_event_interruptible_timeout(). If it
returns, that means that either the condition was evaluated to true, or
timeout elapsed, or it was interrupted by a signal.

Since we're not handling the case where the process woke up due to a
received signal (-ERESTARTSYS), the next call to
wait_event_interruptible_timeout() will _always_ fail and we end up
looping forever inside either cifs_reconnect_tcon() or smb2_reconnect().

Here's an example of how to trigger that:

$ mount.cifs //foo/share /mnt/test -o
username=foo,password=foo,vers=1.0,hard

(break connection to server before executing bellow cmd)
$ stat -f /mnt/test & sleep 140
[1] 2511

$ ps -aux -q 2511
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2511  0.0  0.0  12892  1008 pts/0    S    12:24   0:00 stat -f
/mnt/test

$ kill -9 2511

(wait for a while; process is stuck in the kernel)
$ ps -aux -q 2511
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2511 83.2  0.0  12892  1008 pts/0    R    12:24  30:01 stat -f
/mnt/test

By using 'hard' mount point means that cifs.ko will keep retrying
indefinitely, however we must allow the process to be killed otherwise
it would hang the system.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Cc: stable@vger.kernel.org
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:25 -05:00
Stefano Brivio
f46ecbd97f cifs: Fix slab-out-of-bounds in send_set_info() on SMB2 ACE setting
A "small" CIFS buffer is not big enough in general to hold a
setacl request for SMB2, and we end up overflowing the buffer in
send_set_info(). For instance:

 # mount.cifs //127.0.0.1/test /mnt/test -o username=test,password=test,nounix,cifsacl
 # touch /mnt/test/acltest
 # getcifsacl /mnt/test/acltest
 REVISION:0x1
 CONTROL:0x9004
 OWNER:S-1-5-21-2926364953-924364008-418108241-1000
 GROUP:S-1-22-2-1001
 ACL:S-1-5-21-2926364953-924364008-418108241-1000:ALLOWED/0x0/0x1e01ff
 ACL:S-1-22-2-1001:ALLOWED/0x0/R
 ACL:S-1-22-2-1001:ALLOWED/0x0/R
 ACL:S-1-5-21-2926364953-924364008-418108241-1000:ALLOWED/0x0/0x1e01ff
 ACL:S-1-1-0:ALLOWED/0x0/R
 # setcifsacl -a "ACL:S-1-22-2-1004:ALLOWED/0x0/R" /mnt/test/acltest

this setacl will cause the following KASAN splat:

[  330.777927] BUG: KASAN: slab-out-of-bounds in send_set_info+0x4dd/0xc20 [cifs]
[  330.779696] Write of size 696 at addr ffff88010d5e2860 by task setcifsacl/1012

[  330.781882] CPU: 1 PID: 1012 Comm: setcifsacl Not tainted 4.18.0-rc2+ #2
[  330.783140] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  330.784395] Call Trace:
[  330.784789]  dump_stack+0xc2/0x16b
[  330.786777]  print_address_description+0x6a/0x270
[  330.787520]  kasan_report+0x258/0x380
[  330.788845]  memcpy+0x34/0x50
[  330.789369]  send_set_info+0x4dd/0xc20 [cifs]
[  330.799511]  SMB2_set_acl+0x76/0xa0 [cifs]
[  330.801395]  set_smb2_acl+0x7ac/0xf30 [cifs]
[  330.830888]  cifs_xattr_set+0x963/0xe40 [cifs]
[  330.840367]  __vfs_setxattr+0x84/0xb0
[  330.842060]  __vfs_setxattr_noperm+0xe6/0x370
[  330.843848]  vfs_setxattr+0xc2/0xd0
[  330.845519]  setxattr+0x258/0x320
[  330.859211]  path_setxattr+0x15b/0x1b0
[  330.864392]  __x64_sys_setxattr+0xc0/0x160
[  330.866133]  do_syscall_64+0x14e/0x4b0
[  330.876631]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  330.878503] RIP: 0033:0x7ff2e507db0a
[  330.880151] Code: 48 8b 0d 89 93 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 bc 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 56 93 2c 00 f7 d8 64 89 01 48
[  330.885358] RSP: 002b:00007ffdc4903c18 EFLAGS: 00000246 ORIG_RAX: 00000000000000bc
[  330.887733] RAX: ffffffffffffffda RBX: 000055d1170de140 RCX: 00007ff2e507db0a
[  330.890067] RDX: 000055d1170de7d0 RSI: 000055d115b39184 RDI: 00007ffdc4904818
[  330.892410] RBP: 0000000000000001 R08: 0000000000000000 R09: 000055d1170de7e4
[  330.894785] R10: 00000000000002b8 R11: 0000000000000246 R12: 0000000000000007
[  330.897148] R13: 000055d1170de0c0 R14: 0000000000000008 R15: 000055d1170de550

[  330.901057] Allocated by task 1012:
[  330.902888]  kasan_kmalloc+0xa0/0xd0
[  330.904714]  kmem_cache_alloc+0xc8/0x1d0
[  330.906615]  mempool_alloc+0x11e/0x380
[  330.908496]  cifs_small_buf_get+0x35/0x60 [cifs]
[  330.910510]  smb2_plain_req_init+0x4a/0xd60 [cifs]
[  330.912551]  send_set_info+0x198/0xc20 [cifs]
[  330.914535]  SMB2_set_acl+0x76/0xa0 [cifs]
[  330.916465]  set_smb2_acl+0x7ac/0xf30 [cifs]
[  330.918453]  cifs_xattr_set+0x963/0xe40 [cifs]
[  330.920426]  __vfs_setxattr+0x84/0xb0
[  330.922284]  __vfs_setxattr_noperm+0xe6/0x370
[  330.924213]  vfs_setxattr+0xc2/0xd0
[  330.926008]  setxattr+0x258/0x320
[  330.927762]  path_setxattr+0x15b/0x1b0
[  330.929592]  __x64_sys_setxattr+0xc0/0x160
[  330.931459]  do_syscall_64+0x14e/0x4b0
[  330.933314]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  330.936843] Freed by task 0:
[  330.938588] (stack is not available)

[  330.941886] The buggy address belongs to the object at ffff88010d5e2800
 which belongs to the cache cifs_small_rq of size 448
[  330.946362] The buggy address is located 96 bytes inside of
 448-byte region [ffff88010d5e2800, ffff88010d5e29c0)
[  330.950722] The buggy address belongs to the page:
[  330.952789] page:ffffea0004357880 count:1 mapcount:0 mapping:ffff880108fdca80 index:0x0 compound_mapcount: 0
[  330.955665] flags: 0x17ffffc0008100(slab|head)
[  330.957760] raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff880108fdca80
[  330.960356] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[  330.963005] page dumped because: kasan: bad access detected

[  330.967039] Memory state around the buggy address:
[  330.969255]  ffff88010d5e2880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  330.971833]  ffff88010d5e2900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  330.974397] >ffff88010d5e2980: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  330.976956]                                            ^
[  330.979226]  ffff88010d5e2a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  330.981755]  ffff88010d5e2a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  330.984225] ==================================================================

Fix this by allocating a regular CIFS buffer in
smb2_plain_req_init() if the request command is SMB2_SET_INFO.

Reported-by: Jianhong Yin <jiyin@redhat.com>
Fixes: 366ed846df ("cifs: Use smb 2 - 3 and cifsacl mount options setacl function")
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-and-tested-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:25 -05:00
Paulo Alcantara
6aa0c114ec cifs: Fix memory leak in smb2_set_ea()
This patch fixes a memory leak when doing a setxattr(2) in SMB2+.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-07-05 13:48:24 -05:00
Ronnie Sahlberg
81f39f951b cifs: fix SMB1 breakage
SMB1 mounting broke in commit 35e2cc1ba7
("cifs: Use correct packet length in SMB2_TRANSFORM header")
Fix it and also rename smb2_rqst_len to smb_rqst_len
to make it less unobvious that the function is also called from
CIFS/SMB1

Good job by Paulo reviewing and cleaning up Ronnie's original patch.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:24 -05:00
Paulo Alcantara
8de8c4608f cifs: Fix validation of signed data in smb2
Fixes: c713c8770f ("cifs: push rfc1002 generation down the stack")

We failed to validate signed data returned by the server because
__cifs_calc_signature() now expects to sign the actual data in iov but
we were also passing down the rfc1002 length.

Fix smb3_calc_signature() to calculate signature of rfc1002 length prior
to passing only the actual data iov[1-N] to __cifs_calc_signature(). In
addition, there are a few cases where no rfc1002 length is passed so we
make sure there's one (iov_len == 4).

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:24 -05:00
Paulo Alcantara
27c32b49c3 cifs: Fix validation of signed data in smb3+
Fixes: c713c8770f ("cifs: push rfc1002 generation down the stack")

We failed to validate signed data returned by the server because
__cifs_calc_signature() now expects to sign the actual data in iov but
we were also passing down the rfc1002 length.

Fix smb3_calc_signature() to calculate signature of rfc1002 length prior
to passing only the actual data iov[1-N] to __cifs_calc_signature(). In
addition, there are a few cases where no rfc1002 length is passed so we
make sure there's one (iov_len == 4).

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:24 -05:00
Lars Persson
696e420bb2 cifs: Fix use after free of a mid_q_entry
With protocol version 2.0 mounts we have seen crashes with corrupt mid
entries. Either the server->pending_mid_q list becomes corrupt with a
cyclic reference in one element or a mid object fetched by the
demultiplexer thread becomes overwritten during use.

Code review identified a race between the demultiplexer thread and the
request issuing thread. The demultiplexer thread seems to be written
with the assumption that it is the sole user of the mid object until
it calls the mid callback which either wakes the issuer task or
deletes the mid.

This assumption is not true because the issuer task can be woken up
earlier by a signal. If the demultiplexer thread has proceeded as far
as setting the mid_state to MID_RESPONSE_RECEIVED then the issuer
thread will happily end up calling cifs_delete_mid while the
demultiplexer thread still is using the mid object.

Inserting a delay in the cifs demultiplexer thread widens the race
window and makes reproduction of the race very easy:

		if (server->large_buf)
			buf = server->bigbuf;

+		usleep_range(500, 4000);

		server->lstrp = jiffies;

To resolve this I think the proper solution involves putting a
reference count on the mid object. This patch makes sure that the
demultiplexer thread holds a reference until it has finished
processing the transaction.

Cc: stable@vger.kernel.org
Signed-off-by: Lars Persson <larper@axis.com>
Acked-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-07-05 13:48:24 -05:00
Linus Torvalds
d02d21ea00 autofs: rename 'autofs' module back to 'autofs4'
It turns out that systemd has a bug: it wants to load the autofs module
early because of some initialization ordering with udev, and it doesn't
do that correctly.  Everywhere else it does the proper "look up module
name" that does the proper alias resolution, but in that early code, it
just uses a hardcoded "autofs4" for the module name.

The result of that is that as of commit a2225d931f ("autofs: remove
left-over autofs4 stubs"), you get

    systemd[1]: Failed to insert module 'autofs4': No such file or directory

in the system logs, and a lack of module loading.  All this despite the
fact that we had very clearly marked 'autofs4' as an alias for this
module.

What's so ridiculous about this is that literally everything else does
the module alias handling correctly, including really old versions of
systemd (that just used 'modprobe' to do this), and even all the other
systemd module loading code.

Only that special systemd early module load code is broken, hardcoding
the module names for not just 'autofs4', but also "ipv6", "unix",
"ip_tables" and "virtio_rng".  Very annoying.

Instead of creating an _additional_ separate compatibility 'autofs4'
module, just rely on the fact that everybody else gets this right, and
just call the module 'autofs4' for compatibility reasons, with 'autofs'
as the alias name.

That will allow the systemd people to fix their bugs, adding the proper
alias handling, and maybe even fix the name of the module to be just
"autofs" (so that they can _test_ the alias handling).  And eventually,
we can revert this silly compatibility hack.

See also

    https://github.com/systemd/systemd/issues/9501
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=902946

for the systemd bug reports upstream and in the Debian bug tracker
respectively.

Fixes: a2225d931f ("autofs: remove left-over autofs4 stubs")
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Reported-by: Michael Biebl <biebl@debian.org>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-05 11:35:04 -07:00
Andreas Gruenbacher
b7eba890a2 gfs2: Eliminate redundant ip->i_rgd
GFS2 remembers the last rgrp used for allocations in ip->i_rgd.
However, block allocations are made by way of a reservations structure,
ip->i_res, which keeps the last rgrp in ip->i_res.rs_rgd, and ip->i_res
is kept in sync with ip->i_res.rs_rgd, so it's redundant.  Get rid of
ip->i_rgd and just use ip->i_res.rs_rgd in its place.

Based on patches by Robert Peterson.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-07-05 17:47:16 +02:00
Andreas Gruenbacher
03f8c41c73 gfs2: Stop messing with ip->i_rgd in the rlist code
In the resource group list code, keep the last resource group added in
the last position in the array.  Check against that instead of messing
with ip->i_rgd.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-07-04 21:38:42 +01:00
Janosch Frank
1e2c043628 userfaultfd: hugetlbfs: fix userfaultfd_huge_must_wait() pte access
Use huge_ptep_get() to translate huge ptes to normal ptes so we can
check them with the huge_pte_* functions.  Otherwise some architectures
will check the wrong values and will not wait for userspace to bring in
the memory.

Link: http://lkml.kernel.org/r/20180626132421.78084-1-frankja@linux.ibm.com
Fixes: 369cd2121b ("userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges")
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-07-03 17:32:18 -07:00
Matthew Wilcox
3fae17468a fs: Fix attr.c kernel-doc
A couple of minor warnings.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-07-03 16:44:45 -04:00
Andreas Gruenbacher
806a1477b1 iomap: add inline data support to iomap_readpage_actor
Just copy the inline data into the page using the existing helper.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-03 09:07:47 -07:00
Andreas Gruenbacher
ec181f6782 iomap: support direct I/O to inline data
Add support for reading from and writing to inline data to iomap_dio_rw.
This saves filesystems from having to implement fallback code for this
case.

The inline data is actually cached in the inode, so the I/O is only
direct in the sense that it doesn't go through the page cache.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-03 09:07:47 -07:00
Christoph Hellwig
09230435df iomap: refactor iomap_dio_actor
Split the function up into two helpers for the bio based I/O and hole
case, and a small helper to call the two.  This separates the code a
little better in preparation for supporting I/O to inline data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-07-03 09:07:46 -07:00
Jon Derrick
a17712c8e4 ext4: check superblock mapped prior to committing
This patch attempts to close a hole leading to a BUG seen with hot
removals during writes [1].

A block device (NVME namespace in this test case) is formatted to EXT4
without partitions. It's mounted and write I/O is run to a file, then
the device is hot removed from the slot. The superblock attempts to be
written to the drive which is no longer present.

The typical chain of events leading to the BUG:
ext4_commit_super()
  __sync_dirty_buffer()
    submit_bh()
      submit_bh_wbc()
        BUG_ON(!buffer_mapped(bh));

This fix checks for the superblock's buffer head being mapped prior to
syncing.

[1] https://www.spinics.net/lists/linux-ext4/msg56527.html

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-07-02 18:45:18 -04:00
Andreas Gruenbacher
025d0e7f73 gfs2: Remove gfs2_write_{begin,end}
Now that generic_file_write_iter is no longer used, there are no
remaining users of these address space operations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02 16:27:38 +01:00
Andreas Gruenbacher
967bcc91b0 gfs2: iomap direct I/O support
The page unmapping previously done in gfs2_direct_IO is now done
generically in iomap_dio_rw.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02 16:27:32 +01:00
Andreas Gruenbacher
bcfe94139a gfs2: gfs2_extent_length cleanup
Now that gfs2_extent_length is no longer used for determining the size
of a hole and always with an upper size limit, the function can be
simplified.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02 16:27:24 +01:00
Andreas Gruenbacher
64bc06bb32 gfs2: iomap buffered write support
With the traditional page-based writes, blocks are allocated separately
for each page written to.  With iomap writes, we can allocate a lot more
blocks at once, with a fraction of the allocation overhead for each
page.

Split calculating the number of blocks that can be allocated at a given
position (gfs2_alloc_size) off from gfs2_iomap_alloc: that size
determines the number of blocks to allocate and reserve in the journal.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02 16:27:17 +01:00
Andreas Gruenbacher
d505a96a3b gfs2: Further iomap cleanups
In gfs2_iomap_alloc, set the type of newly allocated extents to
IOMAP_MAPPED so that iomap_to_bh will set the bh states correctly:
otherwise, the bhs would not be marked as mapped, confusing
__mpage_writepage.  This means that we need to check for the IOMAP_F_NEW
flag in fallocate_chunk now.

Further clean up gfs2_iomap_get and implement gfs2_stuffed_iomap here
directly.  For reads beyond the end of the file, return holes instead of
failing with -ENOENT so that we can get rid of that special case in
gfs2_block_map.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Bob Peterson <rpeterso@redhat.com>
2018-07-02 16:26:01 +01:00
Guenter Roeck
1823342a1f configfs: replace strncpy with memcpy
gcc 8.1.0 complains:

fs/configfs/symlink.c:67:3: warning:
	'strncpy' output truncated before terminating nul copying as many
	bytes from a string as its length
fs/configfs/symlink.c: In function 'configfs_get_link':
fs/configfs/symlink.c:63:13: note: length computed here

Using strncpy() is indeed less than perfect since the length of data to
be copied has already been determined with strlen(). Replace strncpy()
with memcpy() to address the warning and optimize the code a little.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-02 07:12:55 -06:00
Linus Torvalds
d3bc0e67f8 for-4.18-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAls4zz0ACgkQxWXV+ddt
 WDs0ZhAAplEAcN1BP986BS7GpjrG20vQtP9AnHlnSEJJJmsnykpspBylOcLRkKjF
 LKBfBPCKqIo7kn5ebKT1Kk7zJPkOOEfmxGW7hffVN/oa/oMtmgJbbHDUgl2TDgdu
 rky1O+Bj+S37s5rhiXAJ4oU9ekdpWIlN30GczfynjiPqGigKh/cINsEEhQIIAiJG
 PRDQfSIJeh67x1AP0KE8sJAYSsaeFxT+kHrT/NPs1NFDSzrQSa/QWPFVjGVVuI/Y
 w84Mo0EqdRV7tap7D3QyWyYea6zdP00PG8TyLl0Kba+LckFbzpNN5hP3SUxleBzL
 0ZBJi7/tOqnrMV3YaGm40dLfgD4B+CFt8zDyg2JvWUxxEzfQfYif7KIT2IV8fSqS
 QrVw2NrzQC7EZ4Zu98wCN7dyyOE8yhqbq805YdG3Nj+zT6DqRu01TBo4Yr/Ek8ux
 +ITAtQVbaOZmTIt/qh/Oxc5jRsurAno1FP3XRH+1hfSlS7xc3LfI1CUbX3jAKzXN
 edxdM4/h+d4nekvROnKBH4EheS6+ZVfgzYlYUW9c2rjcJ1RHhDElbh14+IoM6LKJ
 nJ+Cp+744F6W5jaG3oWElJrdhlY31mWUjiZaj2CHl16EcH3MToytrxKMX+OWo95W
 gChnKicrtpO6+9nbED3Tdhp7SkbDysun6jvEpSgdlm8+2H5Kwrw=
 =KWL7
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have a few regression fixes for qgroup rescan status tracking and
  the vm_fault_t conversion that mixed up the error values"

* tag 'for-4.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix mount failure when qgroup rescan is in progress
  Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
  btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
2018-07-01 12:38:16 -07:00
Linus Torvalds
4a770e638f Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fix from Al Viro:
 "Followup to procfs-seq_file series this window"

This fixes a memory leak by making sure that proc seq files release any
private data on close.  The 'proc_seq_open' has to be properly paired
with 'proc_seq_release' that releases the extra private data.

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  proc: add proc_seq_release
2018-07-01 12:32:19 -07:00
Al Viro
e876c445df hpfs: fix an inode leak in lookup, switch to d_splice_alias()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-30 09:43:45 -04:00
Linus Torvalds
7886953859 A trivial dentry leak fix from Zheng.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAls2PxETHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi51OB/4yN29wPAiCFDeN87gTnkO2flNbATzH
 PCBMydy4nUSfUmnfaMZyMLo0xM6EXIxLte9kGt0VEQwJrw7cwK+IhkhhOQtbH2ER
 /oxul0nAfgEOp0kFzHY7iQzd1qppaxAYtfXLy/BdOsTbntHcHPvp619+JKiuJoXM
 XA2ABV/bdw21FMPq+kP4uyzQsPw/yTXAw47JlVe1PqsCE17Eyj3X8GaMXeTZo52z
 bbOzFJBxlOOHwJ2w6TM4uK6Pr/BgIF4QHDkcNfNzajtdzW9y9wAmVb2GrmkLnwMw
 rou6MgmbB76nUL82SWMMqSeqV1hHGW4iPqu8kpWekIbGOSZJ8a7FIbqo
 =qSQ6
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.18-rc3' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A trivial dentry leak fix from Zheng"

* tag 'ceph-for-4.18-rc3' of git://github.com/ceph/ceph-client:
  ceph: fix dentry leak in splice_dentry()
2018-06-29 12:19:47 -07:00
Al Viro
b0c6108ecf nfs_instantiate(): prevent multiple aliases for directory inode
Since NFS allows open-by-fhandle, we have to cope with the possibility
of mkdir vs. open-by-guessed-handle races.  A local filesystem could
decide what the inumber of the new object will be and insert a locked
inode with that inumber into icache _before_ the on-disk data structures
begin to look good and unlock it only once it has a dentry alias, so
that open-by-handle coming first would quietly fail and mkdir coming
first would have open-by-handle grab its dentry.

For NFS it's a non-starter - the icache key is server-supplied fhandle
and we do not get that until the object has been fully created on server.
We really have to deal with the possibility that open-by-handle gets
the in-core inode and attaches a dentry to it before mkdir does.

Solution: let nfs_mkdir() use d_splice_alias() to catch those.  We can
	* get an error.  Just return it to our caller.
	* get NULL - no preexisting dentry aliases, we'd just done what
d_add() would've done.  Success.
	* get a reference to preexisting alias.  In that case the alias
had been moved in place of nfs_mkdir() argument (and hashed there), while
nfs_mkdir() argument is left unhashed negative.  Which is just fine for
->mkdir() callers, all we need is to release the reference we'd got from
d_splice_alias() and report success.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-28 15:28:48 -04:00
Linus Torvalds
a11e1d432b Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL
The poll() changes were not well thought out, and completely
unexplained.  They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead.  That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case.  The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
  individually because that would just be unnecessarily messy  - Linus ]

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-28 10:40:47 -07:00
Carlos Maiolino
9991274fdd xfs: Initialize variables in xfs_alloc_get_rec before using them
Make sure we initialize *bno and *len, before jumping to out_bad_rec
label, and risk calling xfs_warn() with uninitialized variables.

Coverity: 100898
Coverity: 1437081
Coverity: 1437129
Coverity: 1437191
Coverity: 1437201
Coverity: 1437212
Coverity: 1437341
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-28 06:56:23 -07:00
Filipe Manana
e4e7ede739 Btrfs: fix mount failure when qgroup rescan is in progress
If a power failure happens while the qgroup rescan kthread is running,
the next mount operation will always fail. This is because of a recent
regression that makes qgroup_rescan_init() incorrectly return -EINVAL
when we are mounting the filesystem (through btrfs_read_qgroup_config()).
This causes the -EINVAL error to be returned regardless of any qgroup
flags being set instead of returning the error only when neither of
the flags BTRFS_QGROUP_STATUS_FLAG_RESCAN nor BTRFS_QGROUP_STATUS_FLAG_ON
are set.

A test case for fstests follows up soon.

Fixes: 9593bf4967 ("btrfs: qgroup: show more meaningful qgroup_rescan_init error message")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28 11:30:57 +02:00
Chris Mason
717beb96d9 Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking
the integer return values from internal btrfs functions.  It was
sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
and had been removed from the radix.  Something like this:

    ret2 = btrfs_delalloc_reserve_space() // returns zero on success

    lock_page(page)
    if (page->mapping != inode->i_mapping)
	goto out_unlock;

...

out_unlock:
    if (!ret2) {
	    ...
	    return VM_FAULT_LOCKED;
    }

This ends up triggering this WARNING in btrfs_destroy_inode()
    WARN_ON(BTRFS_I(inode)->block_rsv.size);

xfstests generic/095 was able to reliably reproduce the errors.

Since out_unlock: is only used for errors, this fix moves it below the
if (!ret2) check we use to return VM_FAULT_LOCKED for success.

Fixes: a528a24150 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28 11:30:50 +02:00
Qu Wenruo
6f7de19ed3 btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
Commit ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf
of extent tree") added a new exit for rescan finish.

However after finishing quota rescan, we set
fs_info->qgroup_rescan_progress to (u64)-1 before we exit through the
original exit path.
While we missed that assignment of (u64)-1 in the new exit path.

The end result is, the quota status item doesn't have the same value.
(-1 vs the last bytenr + 1)
Although it doesn't affect quota accounting, it's still better to keep
the original behavior.

Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Fixes: ff3d27a048 ("btrfs: qgroup: Finish rescan when hit the last leaf of extent tree")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-28 11:30:48 +02:00
Chunyu Hu
877f919e19 proc: add proc_seq_release
kmemleak reported some memory leak on reading proc files. After adding
some debug lines, find that proc_seq_fops is using seq_release as
release handler, which won't handle the free of 'private' field of
seq_file, while in fact the open handler proc_seq_open could create
the private data with __seq_open_private when state_size is greater
than zero. So after reading files created with proc_create_seq_private,
such as /proc/timer_list and /proc/vmallocinfo, the private mem of a
seq_file is not freed. Fix it by adding the paired proc_seq_release
as the default release handler of proc_seq_ops instead of seq_release.

Fixes: 44414d82cf ("proc: introduce proc_create_seq_private")
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-27 20:44:38 -04:00
Linus Torvalds
f57494321c Changes since last update:
- More metadata validation strengthening to prevent crashes.
 - Fix extent offset overflow problem when insert_range on a 512b block fs
 - Fix some off-by-one errors in the realtime fsmap code
 - Fix some math errors in the default resblks calculation when free space
   is low
 - Fix a problem where stale page contents are exposed via mmap read
   after a zero_range at eof
 - Fix accounting problems with per-ag reservations causing statfs
   reports to vary incorrectly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsv6mQACgkQ+H93GTRK
 tOv3ZA//W9QVf6aGv3fefg7ZkleRpKPhpvrdgy+LnnhGdkToKGsOIJGavFjBYfBh
 cFSOXrKamNiSZiJzzQ8NiFaZ8wrD2NAhOybatp7jAGKGb+a3VTtoz7v+4VwJ5Psw
 gPdZ+zADqsbbUg4Xx08TDqdQDHlIXz6kTvrM+MGBq5TynICAIUgS2o9fJT1HKf12
 xfbC2dh2cK/7TnN0MNCLzekrPoiLfuEPsMkrnWv/PBre0AhfcCJLbNhQi5+mCWcs
 e71vttXAc3kkSly6G4D9SSb+5Mn1Q1SkBsORn82fJdI59eHjFn1ujLvkfPM1w1Iu
 oDjJw8PCyrG7NFZiTh46MuJhvH/Tc6ZZvop/YJyJJApHOYYx6K64enpJrWcg+CKp
 Rcl5gQocvlIfMyzBvyozpHctlIuOjkQsSmyEXXx8PAsr7zU+FehXS5ENzCOtdg5U
 LivcsXmh0bvSU1PdzA3oiANcIPAfVX6+RoM4rQpDjT+IS7Ud9F2hELD0Yi/pmk4l
 oC91+e4oPJ4HsoCDmxePLl29Gqm4n1lY9i3l1PsZe8yWeYxs991R/hJuxKvRQLul
 sxkgKk5nvdG2PjSmAKx9XYRrHEf/3uK+TEtrQtrm06i1QEI881qqSEMHpaoRj0qo
 awESWimVodllaag+ugFLi+u+UAh8LA0jTAmijBqczLtspU1Cg0c=
 =Aw39
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are some patches for 4.18 to fix regressions, accounting
  problems, overflow problems, and to strengthen metadata validation to
  prevent corruption.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Changes since last update:

   - more metadata validation strengthening to prevent crashes.

   - fix extent offset overflow problem when insert_range on a 512b
     block fs

   - fix some off-by-one errors in the realtime fsmap code

   - fix some math errors in the default resblks calculation when free
     space is low

   - fix a problem where stale page contents are exposed via mmap read
     after a zero_range at eof

   - fix accounting problems with per-ag reservations causing statfs
     reports to vary incorrectly"

* tag 'xfs-4.18-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation
  xfs: ensure post-EOF zeroing happens after zeroing part of a file
  xfs: fix off-by-one error in xfs_rtalloc_query_range
  xfs: fix uninitialized field in rtbitmap fsmap backend
  xfs: recheck reflink state after grabbing ILOCK_SHARED for a write
  xfs: don't allow insert-range to shift extents past the maximum offset
  xfs: don't trip over negative free space in xfs_reserve_blocks
  xfs: allow empty transactions while frozen
  xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
  xfs: More robust inode extent count validation
  xfs: simplify xfs_bmap_punch_delalloc_range
2018-06-27 12:21:06 -07:00
Henry Wilson
4d97f7d53d inotify: Add flag IN_MASK_CREATE for inotify_add_watch()
The flag IN_MASK_CREATE is introduced as a flag for inotiy_add_watch()
which prevents inotify from modifying any existing watches when invoked.
If the pathname specified in the call has a watched inode associated
with it and IN_MASK_CREATE is specified, fail with an errno of EEXIST.

Use of IN_MASK_CREATE with IN_MASK_ADD is reserved for future use and
will return EINVAL.

RATIONALE

In the current implementation, there is no way to prevent
inotify_add_watch() from modifying existing watch descriptors. Even if
the caller keeps a record of all watch descriptors collected, this is
only sufficient to detect that an existing watch descriptor may have
been modified.

The assumption that a particular path will map to the same inode over
multiple calls to inotify_add_watch() cannot be made as files can be
renamed or deleted.  It is also not possible to assume that two distinct
paths do no map to the same inode, due to hard-links or a dereferenced
symbolic link. Further uses of inotify_add_watch() to revert the change
may cause other watch descriptors to be modified or created, merely
compunding the problem. There is currently no system call such as
inotify_modify_watch() to explicity modify a watch descriptor, which
would be able to revert unwanted changes. Thus the caller cannot
guarantee to be able to revert any changes to existing watch decriptors.

Additionally the caller cannot assume that the events that are
associated with a watch descriptor are within the set requested, as any
future calls to inotify_add_watch() may unintentionally modify a watch
descriptor's mask. Thus it cannot currently be guaranteed that a watch
descriptor will only generate events which have been requested. The
program must filter events which come through its watch descriptor to
within its expected range.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Henry Wilson <henry.wilson@acentic.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 19:21:25 +02:00
Arnd Bergmann
fe2c32545b ext2: use ktime_get_real_seconds for timestamps
get_seconds() is deprecated because of the y2038 overflow, so users
should migrate to 64-bit timestamps using ktime_get_real_seconds().
In ext2, the timestamps in the superblock and in the inode are all
limited to 32-bit, and this won't get fixed, so let's just stop
using the deprecated interface and keep truncating.

All users of ext2 should migrate to ext4 before 2038 to prevent this
from causing problems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:59:18 +02:00
Arnd Bergmann
c3b9cecd89 udf: convert inode stamps to timespec64
The VFS structures are finally converted to always use 64-bit timestamps,
and this file system can represent a long range of on-disk timestamps
already, so now let's fit in the missing bits for udf.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:58:00 +02:00
Amir Goldstein
eaa2c6b0c9 fanotify: factor out helpers to add/remove mark
Factor out helpers fanotify_add_mark() and fanotify_remove_mark()
to reduce duplicated code.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:12 +02:00
Amir Goldstein
3ac70bfcde fsnotify: add helper to get mask from connector
Use a helper to get the mask from the object (i.e. i_fsnotify_mask)
to generalize code of add/remove inode/vfsmount mark.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:07 +02:00
Amir Goldstein
36f10f55ff fsnotify: let connector point to an abstract object
Make the code to attach/detach a connector to object more generic
by letting the fsnotify connector point to an abstract fsnotify_connp_t.
Code that needs to dereference an inode or mount object now uses the
helpers fsnotify_conn_{inode,mount}.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:05 +02:00
Amir Goldstein
b812a9f589 fsnotify: pass connp and object type to fsnotify_add_mark()
Instead of passing inode and vfsmount arguments to fsnotify_add_mark()
and its _locked variant, pass an abstract object pointer and the object
type.

The helpers fsnotify_obj_{inode,mount} are added to get the concrete
object pointer from abstract object pointer.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:45:03 +02:00
Amir Goldstein
9b6e543450 fsnotify: use typedef fsnotify_connp_t for brevity
The object marks manipulation functions fsnotify_destroy_marks()
fsnotify_find_mark() and their helpers take an argument of type
struct fsnotify_mark_connector __rcu ** to dereference the connector
pointer. use a typedef to describe this type for brevity.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-27 13:44:59 +02:00
Yan, Zheng
8b8f53af1e ceph: fix dentry leak in splice_dentry()
In any case, d_splice_alias() does not drop reference of original
dentry.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-26 18:42:44 +02:00
Linus Torvalds
84bfed40fc for-4.18-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsvtAIACgkQxWXV+ddt
 WDvPJQ//eEr+ACNxG5n42e63TaNpLOeoGAsUChwekP6DAIc2n/r78SHX+T/woDqd
 7/dc2eqlYF5Fmn6MPQ1ufL2xlfw0t3OnemK8T5+4sxZDdzeAH9V+kHaqoaLUPExL
 4r6lK5Ywwa2cWghC7WvQg3+bWLX18OEExG63SlEhLo3YJM5uUqVhGVPi6ARrbxNM
 GJvXcQsxjXLqukm4gYvHC6Zra9Awv65uiAU+VCm2y96j1fEJ0yjK/pC1RtoFGCqS
 48Jjuzfq/V3nxy0Wjr3DvpQVEQcKyGha6i/eazZISdRhGSjrYdvIwpUn7gZeoo2Q
 hT8VVergLbVYgIeaOwgwubQzNaG2C/ZTsEjPQrNrA7a/AGsh5C/ommYE9MSyS6L0
 PG0NLUNDXFmEj8WdI97So+1Sm2OCb04DPbPhHbbkhw5L/MPE1TaLN5aUWguj8laB
 NnyPRdVP9jCJAI0OhJY7nPDmsPKe2jogVVsRheTcob+V5G+vIgzDXlGfsW/88Seg
 dHubMaC0nz6u8Cj4dIviitiLXuustyz0JkVdljTLawWWJ/Hs7NlsSf3Q3nj2Kvia
 e8QMID0vLphQyL1hqC0n7M0g2ohq9NUGT1nhLTPSpFl0l8bIA9PQehOx9Q5vx8yp
 tJF+d0qiNfgvadA+KhyvQ8puQb2+zZQ8+Pqfjwd4ySD7SI5dcA4=
 =fCdm
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two regression fixes and an incorrect error value propagation fix from
  'rename exchange'"

* tag 'for-4.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix return value on rename exchange failure
  btrfs: fix invalid-free in btrfs_extent_same
  Btrfs: fix physical offset reported by fiemap for inline extents
2018-06-26 08:41:54 -07:00
Darrick J. Wong
d8cb5e4237 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation
In __xfs_ag_resv_init we incorrectly calculate the amount by which to
decrease fdblocks when reserving blocks for the rmapbt.  Because rmapbt
allocations do not decrease fdblocks, we must decrease fdblocks by the
entire size of the requested reservation in order to achieve our goal of
always having enough free blocks to satisfy an rmapbt expansion.

This is in contrast to the refcountbt/finobt, which /do/ subtract from
fdblocks whenever they allocate a block.  For this allocation type we
preserve the existing behavior where we decrease fdblocks only by the
requested reservation minus the size of the existing tree.

This fixes the problem where the available block counts reported by
statfs change across a remount if there had been an rmapbt size change
since mount time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-06-24 12:00:12 -07:00
Darrick J. Wong
e53c4b5983 xfs: ensure post-EOF zeroing happens after zeroing part of a file
If a user asks us to zero_range part of a file, the end of the range is
EOF, and not aligned to a page boundary, invoke writeback of the EOF
page to ensure that the post-EOF part of the page is zeroed.  This
ensures that we don't expose stale memory contents via mmap, if in a
clumsy manner.

Found by running generic/127 when it runs zero_range and mapread at EOF
one after the other.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
a3a374bf18 xfs: fix off-by-one error in xfs_rtalloc_query_range
In commit 8ad560d256 ("xfs: strengthen rtalloc query range checks")
we strengthened the input parameter checks in the rtbitmap range query
function, but introduced an off-by-one error in the process.  The call
to xfs_rtfind_forw deals with the high key being rextents, but we clamp
the high key to rextents - 1.  This causes the returned results to stop
one block short of the end of the rtdev, which is incorrect.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
232d0a24b0 xfs: fix uninitialized field in rtbitmap fsmap backend
Initialize the extent count field of the high key so that when we use
the high key to synthesize an 'unknown owner' record (i.e. used space
record) at the end of the queried range we have a field with which to
compute rm_blockcount.  This is not strictly necessary because the
synthesizer never uses the rm_blockcount field, but we can shut up the
static code analysis anyway.

Coverity-id: 1437358
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
5bd88d1539 xfs: recheck reflink state after grabbing ILOCK_SHARED for a write
The reflink iflag could have changed since the earlier unlocked check,
so if we got ILOCK_SHARED for a write and but we're now a reflink inode
we have to switch to ILOCK_EXCL and relock.

This helps us avoid blowing lock assertions in things like generic/166:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/xfs_reflink.c, line: 383
WARNING: CPU: 1 PID: 24707 at fs/xfs/xfs_message.c:104 assfail+0x25/0x30 [xfs]
Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
CPU: 1 PID: 24707 Comm: xfs_io Not tainted 4.18.0-rc1-djw #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:assfail+0x25/0x30 [xfs]
Code: ff 0f 0b c3 90 66 66 66 66 90 48 89 f1 41 89 d0 48 c7 c6 e8 ef 1b a0 48 89 fa 31 ff e8 54 f9 ff ff 80 3d fd ba 0f 00 00 75 03 <0f> 0b c3 0f 0b 66 0f 1f 44 00 00 66 66 66 66 90 48 63 f6 49 89 f9
RSP: 0018:ffffc90006423ad8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880030b65e80 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffa01b0447
RBP: ffffc90006423c10 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88003d43fc30 R11: f000000000000000 R12: ffff880077cda000
R13: 0000000000000000 R14: ffffc90006423c30 R15: ffffc90006423bf9
FS:  00007feba8986800(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000138ab58 CR3: 000000003d40a000 CR4: 00000000000006a0
Call Trace:
 xfs_reflink_allocate_cow+0x24c/0x3d0 [xfs]
 xfs_file_iomap_begin+0x6d2/0xeb0 [xfs]
 ? iomap_to_fiemap+0x80/0x80
 iomap_apply+0x5e/0x130
 iomap_dio_rw+0x2e0/0x400
 ? iomap_to_fiemap+0x80/0x80
 ? xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
 xfs_file_dio_aio_write+0x133/0x4a0 [xfs]
 xfs_file_write_iter+0x7b/0xb0 [xfs]
 __vfs_write+0x16f/0x1f0
 vfs_write+0xc8/0x1c0
 ksys_pwrite64+0x74/0x90
 do_syscall_64+0x56/0x180
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
f62cb48e43 xfs: don't allow insert-range to shift extents past the maximum offset
Zorro Lang reports that generic/485 blows an assert on a filesystem with
512 byte blocks.  The test tries to fallocate a post-eof extent at the
maximum file size and calls insert range to shift the extents right by
two blocks.  On a 512b block filesystem this causes startoff to overflow
the 54-bit startoff field, leading to the assert.

Therefore, always check the rightmost extent to see if it would overflow
prior to invoking the insert range machinery.

Reported-by: zlang@redhat.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200137
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
aafe12cee0 xfs: don't trip over negative free space in xfs_reserve_blocks
If we somehow end up with a filesystem that has fewer free blocks than
the blocks set aside to avoid ENOSPC deadlocks, it's possible that the
free space calculation in xfs_reserve_blocks will spit out a negative
number (because percpu_counter_sum returns s64).  We fail to notice
this negative number and set fdblks_delta to it.  Now we increment
fdblocks(!) and the unsigned type of m_resblks means that we end up
setting a ridiculously huge m_resblks reservation.

Avoid this comedy of errors by detecting the negative free space and
returning -ENOSPC.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:36 -07:00
Darrick J. Wong
10ee25268e xfs: allow empty transactions while frozen
In commit e89c041338 ("xfs: implement the GETFSMAP ioctl") we
created the ability to obtain empty transactions.  These transactions
have no log or block reservations and therefore can't modify anything.
Since they're also NO_WRITECOUNT they can run while the fs is frozen,
so we don't need to WARN_ON about that usage.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-06-24 11:56:35 -07:00
Deepa Dinamani
6ff8473507 time: Change types to new y2038 safe __kernel_itimerspec
timer_set/gettime and timerfd_set/get apis use struct itimerspec at the
user interface layer.  struct itimerspec is not y2038-safe.  Change these
interfaces to use y2038-safe struct __kernel_itimerspec instead.  This will
help define new syscalls when 32bit architectures select CONFIG_64BIT_TIME.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: viro@zeniv.linux.org.uk
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: y2038@lists.linaro.org
Link: https://lkml.kernel.org/r/20180617051144.29756-4-deepa.kernel@gmail.com
2018-06-24 14:39:47 +02:00
Al Viro
50f3074011 hostfs_lookup: switch to d_splice_alias()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-23 20:27:29 -04:00
Al Viro
63a67a926e kill dentry_update_name_case()
the last user is gone

Spotted-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-23 17:16:44 -04:00
Filipe Manana
c5b4a50b74 Btrfs: fix return value on rename exchange failure
If we failed during a rename exchange operation after starting/joining a
transaction, we would end up replacing the return value, stored in the
local 'ret' variable, with the return value from btrfs_end_transaction().
So this could end up returning 0 (success) to user space despite the
operation having failed and aborted the transaction, because if there are
multiple tasks having a reference on the transaction at the time
btrfs_end_transaction() is called by the rename exchange, that function
returns 0 (otherwise it returns -EIO and not the original error value).
So fix this by not overwriting the return value on error after getting
a transaction handle.

Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-22 12:59:08 +02:00
Linus Torvalds
894b8c000a \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlssqQcACgkQnJ2qBz9k
 QNl2XwgAhtb2LvBhWKh3OkBVczICgc9eQho03KiHJkEcR1OCkgjOJhVBDrQmJbBa
 tAuMmUcnTpLJG8CI5TyR0G4hNbttE2LuTu+6RV67hXOjhiYAQ5P9wYLyUqfZf/01
 myrPEewr9qqx3h2htqufiLQIyO1M4FeM37VqdH7vZhQOb+B+FUw7JB9a/HbCpNh/
 7NUDf2GbLtLjK+2Xh0ttXvfWjgbLjC4wMmaPaa5+Nabn+URtvX9aHgQ/dYrOjQ16
 7oD5K5x3k3sOT6Ix7xRGLHgE6Xl6MlTtxpyt5ldb96RwWO/GAuMhSCBd14nS2hmx
 fEx5N2vbs3v8Ux+ZdMauzAKZI6Fz2g==
 =oFo+
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull udf, quota, ext2 fixes from Jan Kara:
 "UDF:
   - fix an oops due to corrupted disk image
   - two small cleanups

  quota:
   - a fixfor lru handling
   - cleanup

  ext2:
   - a warning about a deprecated mount option"

* tag 'for_v4.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Drop unused arguments of udf_delete_aext()
  udf: Provide function for calculating dir entry length
  udf: Detect incorrect directory size
  ext2: add warning when specifying nocheck option
  quota: Cleanup list iteration in dqcache_shrink_scan()
  quota: reclaim least recently used dquots
2018-06-22 18:04:56 +09:00
Dave Chinner
e53946dbd3 xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
When a corrupt inode is detected during xfs_iflush_cluster, we can
get a shutdown ASSERT failure like this:

XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
XFS (pmem1): Please umount the filesystem and rectify the problem(s)
XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
.....
Call Trace:
 xfs_iflush_abort+0x10a/0x110
 xfs_iflush+0xf3/0x390
 xfs_inode_item_push+0x126/0x1e0
 xfsaild+0x2c5/0x890
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Essentially, xfs_iflush_abort() has been called twice on the
original inode that that was flushed. This happens because the
inode has been flushed to teh buffer successfully via
xfs_iflush_int(), and so when another inode is detected as corrupt
in xfs_iflush_cluster, the buffer is marked stale and EIO, and
iodone callbacks are run on it.

Running the iodone callbacks walks across the original inode and
calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
to xfs_iflush(), it runs the error path for that function, and that
calls xfs_iflush_abort() on the inode a second time, leading to the
above assert failure as the inode is not flush locked anymore.

This bug has been there a long time.

The simple fix would be to just avoid calling xfs_iflush_abort() in
xfs_iflush() if we've got a failure from xfs_iflush_cluster().
However, xfs_iflush_cluster() has magic delwri buffer handling that
means it may or may not have run IO completion on the buffer, and
hence sometimes we have to call xfs_iflush_abort() from
xfs_iflush(), and sometimes we shouldn't.

After reading through all the error paths and the delwri buffer
code, it's clear that the error handling in xfs_iflush_cluster() is
unnecessary. If the buffer is delwri, it leaves it on the delwri
list so that when the delwri list is submitted it sees a shutdown
fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
when it's not a delwri buffer. Further, marking a buffer stale
clears the _XBF_DELWRI_Q flag on the buffer, which means when
submission of the buffer occurs, it just skips over it and releases
it.

IOWs, the error handling in xfs_iflush_cluster doesn't need to care
if the buffer is already on a the delwri queue or not - it just
needs to mark the buffer stale, EIO and run completions. That means
we can just use the easy fix for xfs_iflush() to avoid the double
abort.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:31:38 -07:00
Dave Chinner
23fcb3340d xfs: More robust inode extent count validation
When the inode is in extent format, it can't have more extents that
fit in the inode fork. We don't currenty check this, and so this
corruption goes unnoticed by the inode verifiers. This can lead to
crashes operating on invalid in-memory structures.

Attempts to access such a inode will now error out in the verifier
rather than allowing modification operations to proceed.

Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: fix a typedef, add some braces and breaks to shut up compiler warnings]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:25:57 -07:00
Christoph Hellwig
e2ac836307 xfs: simplify xfs_bmap_punch_delalloc_range
Instead of using xfs_bmapi_read to find delalloc extents and then punch
them out using xfs_bunmapi, opencode the loop to iterate over the extents
and call xfs_bmap_del_extent_delay directly.  This both simplifies the
code and reduces the number of extent tree lookups required.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-21 23:24:38 -07:00
Linus Torvalds
27db64f65f NFS client bugfixes for Linux 4.18
Hightlights include:
 
 Bugfixes:
 - Fix an rcu deadlock in nfs_delegation_find_inode()
 - Fix NFSv4 deadlocks due to not freeing the session slot in layoutget
 - Don't send layoutreturn if the layout is already invalid
 - Prevent duplicate XID allocation
 - flexfiles: Don't tie up all the rpciod threads in resends
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbK9avAAoJEA4mA3inWBJclq0P/1VCigyDlsbtdby3z2leV84k
 l0asrGjOQndljJ7I21awAgEo8KvXOd66cMv6YT3+UqEW18aNblH4/ngjyId6hPVb
 RZDX7tsG16ZHEqfe9f9irNZo90mdvuSC4ChJ/CbbPesaK9pblE1d76b/qUVr4FUX
 Gj7JPAC5ckoiZXPFfRWfc+o7JnGvs5wkEuDTy+ig6v7BRdL64hdPG3veRNpmLIAZ
 uS/NCyRpO+nFN/ukmvuoI2ZQ3qfHubHBD+rHxr1UKT/ad7dywLmL2UBaYQ0Tl3bq
 /iSQHutgJYj/80VaRTqdlLt/m4ebUZg+9BEZgM5MvqBWkXcpXND51zxExVJN4cGW
 BOytqjLz0gP1OGb8w+Oow58K8l4XyEgHe2CtZ6Yz8Vwof7nchkpv7RSX50hJFIcA
 YlikeDyDzfOmTT6ove5kF31WQSa3Bk6OMEei0of6hWU3UVHyEdr9az73pm/CLSHE
 /R7w0osU3B9tmQD4btQeJ2DxP+syQwhelOYodyVTwOlkmmGg7DSV7fehnGyH8t8f
 I4Yp8f0raiYGbwonYVE2+zDO140VRETEfTE4XQZnn41fZUfB74oIqk77JtgvGMk2
 /+XFNCYBGadHdSBdxyJmhSjhoAWrhgChEIz1G12SiHrNvqIRY/uHhdCX1Ut5vlPf
 5aqyn/yXm6rUH7aNh/Gd
 =tz0M
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Hightlights include:

   - fix an rcu deadlock in nfs_delegation_find_inode()

   - fix NFSv4 deadlocks due to not freeing the session slot in
     layoutget

   - don't send layoutreturn if the layout is already invalid

   - prevent duplicate XID allocation

   - flexfiles: Don't tie up all the rpciod threads in resends"

* tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  pNFS/flexfiles: Process writeback resends from nfsiod context as well
  pNFS/flexfiles: Don't tie up all the rpciod threads in resends
  sunrpc: Prevent duplicate XID allocation
  pNFS: Don't send layoutreturn if the layout is already invalid
  pNFS: Always free the session slot on error in nfs4_layoutget_handle_exception
  NFS: Fix an rcu deadlock in nfs_delegation_find_inode()
2018-06-22 06:21:34 +09:00
Lu Fengqi
22883ddc66 btrfs: fix invalid-free in btrfs_extent_same
If this condition ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
		   (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
is hit, we will go to free the uninitialized cmp.src_pages and
cmp.dst_pages.

Fixes: 67b07bd4be ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-21 19:21:13 +02:00
Filipe Manana
f098631848 Btrfs: fix physical offset reported by fiemap for inline extents
Commit 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when
fm_extent_count is zero") introduced a regression where we no longer
report 0 as the physical offset for inline extents (and other extents
with a special block_start value). This is because it always sets the
variable used to report the physical offset ("disko") as em->block_start
plus some offset, and em->block_start has the value 18446744073709551614
((u64) -2) for inline extents.

This made the btrfs test 004 (from fstests) often fail, for example, for
a file with an inline extent we have the following items in the subvolume
tree:

    item 101 key (418 INODE_ITEM 0) itemoff 11029 itemsize 160
           generation 25 transid 38 size 1525 nbytes 1525
           block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
           sequence 0 flags 0x2(none)
           atime 1529342058.461891730 (2018-06-18 18:14:18)
           ctime 1529342058.461891730 (2018-06-18 18:14:18)
           mtime 1529342058.461891730 (2018-06-18 18:14:18)
           otime 1529342055.869892885 (2018-06-18 18:14:15)
    item 102 key (418 INODE_REF 264) itemoff 11016 itemsize 13
           index 25 namelen 3 name: fc7
    item 103 key (418 EXTENT_DATA 0) itemoff 9470 itemsize 1546
           generation 38 type 0 (inline)
           inline extent data size 1525 ram_bytes 1525 compression 0 (none)

Then when test 004 invoked fiemap against the file it got a non-zero
physical offset:

 $ filefrag -v /mnt/p0/d4/d7/fc7
 Filesystem type is: 9123683e
 File size of /mnt/p0/d4/d7/fc7 is 1525 (1 block of 4096 bytes)
  ext:     logical_offset:        physical_offset: length:   expected: flags:
    0:        0..    4095: 18446744073709551614..      4093:   4096:             last,not_aligned,inline,eof
 /mnt/p0/d4/d7/fc7: 1 extent found

This resulted in the test failing like this:

btrfs/004 49s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad)
    --- tests/btrfs/004.out	2016-08-23 10:17:35.027012095 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad	2018-06-18 18:15:02.385872155 +0100
    @@ -1,3 +1,10 @@
     QA output created by 004
     *** test backref walking
    -*** done
    +./tests/btrfs/004: line 227: [: 7.55578637259143e+22: integer expression expected
    +ERROR: 7.55578637259143e+22 is not a valid numeric value.
    +unexpected output from
    +	/home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -s 65536 -P 7.55578637259143e+22 /home/fdmanana/btrfs-tests/scratch_1
    ...
    (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests/results//btrfs/004.out.bad'  to see the entire diff)
Ran: btrfs/004

The large number in scientific notation reported as an invalid numeric
value is the result from the filter passed to perl which multiplies the
physical offset by the block size reported by fiemap.

So fix this by ensuring the physical offset is always set to 0 when we
are processing an extent with a special block_start value.

Fixes: 9d311e11fc ("Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-21 19:21:13 +02:00
Arnd Bergmann
ee9c7f9ae3 gfs2: call ktime_get_coarse_real_ts64() directly
current_kernel_time64() is now just a deprecated wrapper around
ktime_get_coarse_real_ts64(), so let's just call that directly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-21 07:40:23 -05:00
Andreas Gruenbacher
00251a16d7 gfs2: Minor clarification to __gfs2_punch_hole
Rename end_off to end_len to make the code less confusing.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-21 07:40:00 -05:00
Andreas Gruenbacher
9e1a9ecd13 gfs2: Don't withdraw under a spin lock
In two places, the gfs2_io_error_bh macro is called while holding the
sd_ail_lock spin lock.  This isn't allowed because gfs2_io_error_bh
withdraws the filesystem, which can sleep because it issues a uevent.
To fix that, add a gfs2_io_error_bh_wd macro that does withdraw the
filesystem and change gfs2_io_error_bh to not withdraw the filesystem.
In those places where the new gfs2_io_error_bh is used, withdraw the
filesystem after releasing sd_ail_lock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Andrew Price <anprice@redhat.com>
2018-06-21 07:39:44 -05:00
Bob Peterson
f85c10e24a gfs2: eliminate rs_inum and reduce the size of gfs2 inodes
Before this patch, block reservations kept track of the inode
number. At one point, that was a valid thing to do. However, since
we made the reservation a part of the inode (rather than a pointer
to a separate allocated object) the reservation can determine the
inode number by using container_of. This saves us a little memory
in our inode.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-06-21 07:39:31 -05:00
Mark Rutland
bfc18e389c atomics/treewide: Rename __atomic_add_unless() => atomic_fetch_add_unless()
While __atomic_add_unless() was originally intended as a building-block
for atomic_add_unless(), it's now used in a number of places around the
kernel. It's the only common atomic operation named __atomic*(), rather
than atomic_*(), and for consistency it would be better named
atomic_fetch_add_unless().

This lack of consistency is slightly confusing, and gets in the way of
scripting atomics. Given that, let's clean things up and promote it to
an official part of the atomics API, in the form of
atomic_fetch_add_unless().

This patch converts definitions and invocations over to the new name,
including the instrumented version, using the following script:

  ----
  git grep -w __atomic_add_unless | while read line; do
  sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}";
  done
  git grep -w __arch_atomic_add_unless | while read line; do
  sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}";
  done
  ----

Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
be introduced by later patches.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-21 14:22:32 +02:00
Christoph Hellwig
c03cea4214 iomap: add initial support for writes without buffer heads
For now just limited to blocksize == PAGE_SIZE, where we can simply read
in the full page in write begin, and just set the whole page dirty after
copying data into it.  This code is enabled by default and XFS will now
be feed pages without buffer heads in ->writepage and ->writepages.

If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
path will still be used, this both helps the transition in XFS and
prepares for the gfs2 migration to the iomap infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-20 09:32:41 -07:00
Jan Kara
6c1e4d06a3 udf: Drop unused arguments of udf_delete_aext()
udf_delete_aext() uses its last two arguments only as local variables.
Drop them.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-20 11:05:49 +02:00
Jan Kara
f2e8334711 udf: Provide function for calculating dir entry length
Provide function for calculating directory entry length and use to
reduce code duplication.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-20 11:05:49 +02:00
Jan Kara
fa65653e57 udf: Detect incorrect directory size
Detect when a directory entry is (possibly partially) beyond directory
size and return EIO in that case since it means the filesystem is
corrupted. Otherwise directory operations can further corrupt the
directory and possibly also oops the kernel.

CC: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
CC: stable@vger.kernel.org
Reported-and-tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-20 11:05:31 +02:00
Chengguang Xu
27e6ed54a3 ext2: add warning when specifying nocheck option
The option nocheck(nocheck/check=none) is useless but considering
backwards compatibility it's better to print warning for a while
before completely remove from the code.

This patch add proper warning message for option 'nocheck' and
remove unnecessary comment/function declaration which is used for
removed option 'check'.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-20 11:04:26 +02:00
Jan Kara
1822193b5d quota: Cleanup list iteration in dqcache_shrink_scan()
Use list_first_entry() and list_empty() instead of opencoded variants.

Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-20 11:04:26 +02:00
Greg Thelen
9560ba306d quota: reclaim least recently used dquots
The dquots in the free_dquots list are not reclaimed in LRU way.
put_dquot_last() puts entries to the tail and dqcache_shrink_scan()
frees from the tail. Free unreferenced dquots in LRU order because it
seems more reasonable than freeing most recently used.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-06-20 11:04:26 +02:00
Linus Torvalds
f5b65348fd proc: fix missing final NUL in get_mm_cmdline() rewrite
The rewrite of the cmdline fetching missed the fact that we used to also
return the final terminating NUL character of the last argument.  I
hadn't noticed, and none of the tools I tested cared, but something
obviously must care, because Michal Kubecek noticed the change in
behavior.

Tweak the "find the end" logic to actually include the NUL character,
and once past the eend of argv, always start the strnlen() at the
expected (original) argument end.

This whole "allow people to rewrite their arguments in place" is a nasty
hack and requires that odd slop handling at the end of the argv array,
but it's our traditional model, so we continue to support it.

Repored-and-bisected-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-and-tested-by: Michal Kubecek <mkubecek@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-20 15:38:28 +09:00
Christoph Hellwig
72b4daa241 iomap: add an iomap-based readpage and readpages implementation
Simply use iomap_apply to iterate over the file and a submit a bio for
each non-uptodate but mapped region and zero everything else.  Note that
as-is this can not be used for file systems with a blocksize smaller than
the page size, but that support will be added later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-19 15:10:57 -07:00
Christoph Hellwig
63899c6f88 iomap: add a page_done callback
This will be used by gfs2 to attach data to transactions for the journaled
data mode.  But the concept is generic enough that we might be able to
use it for other purposes like encryption/integrity post-processing in the
future.

Based on a patch from Andreas Gruenbacher.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-19 15:10:56 -07:00
Andreas Gruenbacher
19e0c58f65 iomap: generic inline data handling
Add generic inline data handling by adding a pointer to the inline data
region to struct iomap.  When handling a buffered IOMAP_INLINE write,
iomap_write_begin will copy the current inline data from the inline data
region into the page cache, and iomap_write_end will copy the changes in
the page cache back to the inline data region.

This doesn't cover inline data reads and direct I/O yet because so far,
we have no users.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
[hch: small cleanups to better fit in with other iomap work]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-19 15:10:56 -07:00
Andreas Gruenbacher
ebf00be37d iomap: complete partial direct I/O writes synchronously
According to xfstest generic/240, applications seem to expect direct I/O
writes to either complete as a whole or to fail; short direct I/O writes
are apparently not appreciated.  This means that when only part of an
asynchronous direct I/O write succeeds, we can either fail the entire
write, or we can wait for the partial write to complete and retry the
remaining write as buffered I/O.  The old __blockdev_direct_IO helper
has code for waiting for partial writes to complete; the new
iomap_dio_rw iomap helper does not.

The above mentioned fallback mode is needed for gfs2, which doesn't
allow block allocations under direct I/O to avoid taking cluster-wide
exclusive locks.  As a consequence, an asynchronous direct I/O write to
a file range that contains a hole will result in a short write.  In that
case, wait for the short write to complete to allow gfs2 to recover.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-19 15:10:55 -07:00
Andreas Gruenbacher
3d7b6b21f6 iomap: mark newly allocated buffer heads as new
In iomap_to_bh, not only mark buffer heads in IOMAP_UNWRITTEN maps as
new, but also buffer heads in IOMAP_MAPPED maps with the IOMAP_F_NEW
flag set.  This will be used by filesystems like gfs2, which allocate
blocks in iomap->begin.

Minor corrections to the comment for IOMAP_UNWRITTEN maps.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-19 15:10:55 -07:00
Christoph Hellwig
a6d639da63 fs: factor out a __generic_write_end helper
Bits of the buffer.c based write_end implementations that don't know
about buffer_heads and can be reused by other implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-19 15:10:55 -07:00
Arnd Bergmann
bd646104ac jfs: use time64_t for otime
The file creation time in the inode uses time_t which is defined
differently on 32-bit and 64-bit architectures and deprecated. The
representation in the inode uses an unsigned 32-bit number, but this
gets wrapped around after year 2038 when assigned to a time_t.

This changes the type to time64_t, so we can support the full range of
timestamps between 1970 and 2106 on 32-bit systems like we do on 64-bit
systems already, and matching what we do for the atime/ctime/mtime stamps
since the introduction of 64-bit timestamps in VFS.

Note: the otime stamp is not actually used anywhere at the moment in
the kernel, it is just set when writing a file, so none of this really
makes a difference unless we implement setting the btime field in the
getattr() callback.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-06-19 14:09:30 -05:00
Benjamin Coddington
8163496e78 nfsd: don't advertise a SCSI layout for an unsupported request_queue
Commit 30181faae3 ("nfsd: Check queue type before submitting a SCSI
request") did the work of ensuring that we don't send SCSI requests to a
request queue that won't support them, but that check is in the
GETDEVICEINFO path.  Let's not set the SCSI layout in fs_layout_type in the
first place, and then we'll have less clients sending GETDEVICEINFO for
non-SCSI request queues and less unnecessary WARN_ONs.

While we're in here, remove some outdated comments that refer to
"overwriting" layout seletion because commit 8a4c392688 ("nfsd: allow
nfsd to advertise multiple layout types") changed things to no longer
overwrite the layout type.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-19 10:12:31 -04:00
Trond Myklebust
7b0df92ac1 pNFS/flexfiles: Process writeback resends from nfsiod context as well
Although the writeback resends are more robust than the reads, since they
are not immediately rescheduled by the same thread, we are better off
processing them in the same place as the reads.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 09:25:27 -04:00
Trond Myklebust
42f86b44a4 pNFS/flexfiles: Don't tie up all the rpciod threads in resends
We do not want to have rpciod threads perform recursive calls into the
RPC layer since that can deadlock. In particular, having to wait for
a layoutget can be nasty... We want rather to defer scheduling those
retries until we're in the rpc_release() callback, since that is
called from the nfsiod workqueue.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 09:25:27 -04:00
Trond Myklebust
c8bf707353 pNFS: Don't send layoutreturn if the layout is already invalid
If the layout was invalidated due to a reboot, then don't try to send
a layoutreturn for it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 08:52:27 -04:00
Trond Myklebust
2dbf8dffbf pNFS: Always free the session slot on error in nfs4_layoutget_handle_exception
Right now, we can call nfs_commit_inode() while holding the session slot,
which could lead to NFSv4 deadlocks. Ensure we only keep the slot if
the server returned a layout that we have to process.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-19 08:52:27 -04:00
Bart Van Assche
707c623529 configfs: use kvasprintf() instead of open-coding it
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-19 07:08:12 +02:00
Linus Torvalds
ba4dbdedd3 This fixes a too-small allocation in the xattr code.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAlsnrccACgkQNqiEXrVA
 jGQ8rBAAjpdJS9Rzi92wn+M204/E578q/vvech5Xok4+sW2JNNSMvlOsF6siNsHd
 gia8Z0AZcoQdN5wuzjqIoY2fFBFZanZYQbmObxJ69YE0b8CBY4CQlgIB1P/WLW/J
 dpqBvlyXr/clukCehM2AoWTGbSN+rnqGrZ0ro9cYBN7MmL0fmGb3+3riGJAipS9G
 qyuhafQnenmMQUUruT+8kuWy72gxK9/rvHPlLO9O/2AZ3q8HMByTE+hKhGFazBvv
 Wlp2NomQisR4PenH4u1lXpb5QjCK1WdvUJ0bh1q8f7qyYB0x5EgS/aZr+T0FX6fS
 1Aio7T0SzBzV5tKtX1LHwyz936aIR6Mtgt7IGiQ453Va7JyFQCVBIkK4lojBZF2/
 pvlGi5/dK/9yJ3+uls88dnx/++wZzwdhB80fRRniOmoygCnAlgVurbyzYByKFS+C
 iA7M0Bha5nHOmAKb9vg4kfH+ini5b+45jXjgR6syCoa0VPXIlYXNkbSjJSS+S/gM
 4VNWkvkg0TgBRkGpigpCaIn7otMNaJ5xILSoKwup+Ocg9ITvF2cWn8zdlVrfz28h
 MEF1LmFAs2e1PHjm69KW326M3Bs8Wkrd8lGOc7vR0Dst7VlxwrddQWMJrx2j7Yau
 +r9E3kRM4PvGMsAX4DF1mcnEIGsPkrN6q9XRL9ufNT92d9ww6Vg=
 =fEwn
 -----END PGP SIGNATURE-----

Merge tag 'jfs-4.18' of git://github.com/kleikamp/linux-shaggy

Pull jfs fix from Dave Kleikamp:
 "This fixes a too-small allocation in the xattr code"

* tag 'jfs-4.18' of git://github.com/kleikamp/linux-shaggy:
  jfs: Fix inconsistency between memory allocation and ea_buf->max_size
2018-06-19 07:47:32 +09:00
Steve Wise
33023fb85a IB/core: add max_send_sge and max_recv_sge attributes
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths.  For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16.  Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.

Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Acked-by: Shiraz Saleem <shiraz.saleem@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18 13:17:28 -06:00
Linus Torvalds
9ffc59d572 Misc. SMB3 fixes, including particularly important ones for signing, some minor documentation and debug improvements and another posix smb3.11 fix
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlsmhmUACgkQiiy9cAdy
 T1FgTQwAqwTrVg6cf8EDqvWqk4MJUmGt8dvG0f8ikwa/oou8FspdQNSLnRpM5id5
 xH/VbAULkCzNYqC8KO5d3SRAB/ZidJOBqzNPvpOdNiSvu+VfrC2kA9+NePXs41lf
 Ib+shOwGH2q2HKf8seLA2ivFZDBZLuAsKkYxQN4w4PPKEC8k3WZRYqEFw4OL1FDB
 v02+X3H5QDOFpVEIA69meWx8ezLnLtVI5PHWCj58/wbXOZpU3XO6klxl2XpccGQa
 MPBxOt3Ln1HhzEMPjcaB0TXhig8V0pOAhI6vsCasT6yN8orev5c1z5U2tz32Rgq0
 U6qrTG1mUSD1Jl45kl0CDj2clzkA60XbG2nFkczQ0twoDvwK41xbn4HL+DcrzDWY
 FzRbMZ2wNb7UPUTUxBFh4DrBPdH97deyUMIn1wEYwin2MNveIE5qtS8nxSgJc8zG
 3Tzed1SWGV/YEB794vMlFIRb2DZXWnzvjziKQy8aVHgmCcgWusl+75yKfvfJr8xx
 +D5LdNmu
 =KTNX
 -----END PGP SIGNATURE-----

Merge tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Misc SMB3 fixes, including particularly important ones for signing,
  some minor documentation and debug improvements and another posix
  smb3.11 fix"

* tag '4.18-rc1-more-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix invalid check in __cifs_calc_signature()
  cifs: Use correct packet length in SMB2_TRANSFORM header
  smb3: fix corrupt path in subdirs on smb311 with posix
  smb3: do not display empty interface list
  smb3: Fix mode on mkdir on smb311 mounts
  cifs: Fix kernel oops when traceSMB is enabled
  CIFS: dump every session iface info
  CIFS: parse and store info on iface queries
  CIFS: add iface info to struct cifs_ses
  CIFS: complete PDU definitions for interface queries
  CIFS: move default port definitions to cifsglob.h
  cifs: Fix encryption/signing
  cifs: update __smb_send_rqst() to take an array of requests
  cifs: remove smb2_send_recv()
  cifs: push rfc1002 generation down the stack
  smb3: increase initial number of credits requested to allow write
  cifs: minor documentation updates
  cifs: add lease tracking to the cached root fid
  smb3: note that smb3.11 posix extensions mount option is experimental
2018-06-18 14:28:19 +09:00
Theodore Ts'o
bfe0a5f47a ext4: add more mount time checks of the superblock
The kernel's ext4 mount-time checks were more permissive than
e2fsprogs's libext2fs checks when opening a file system.  The
superblock is considered too insane for debugfs or e2fsck to operate
on it, the kernel has no business trying to mount it.

This will make file system fuzzing tools work harder, but the failure
cases that they find will be more useful and be easier to evaluate.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-17 18:11:20 -04:00
J. Bruce Fields
5b7b15aee6 nfsd: fix corrupted reply to badly ordered compound
We're encoding a single op in the reply but leaving the number of ops
zero, so the reply makes no sense.

Somewhat academic as this isn't a case any real client will hit, though
in theory perhaps that could change in a future protocol extension.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:43:07 -04:00
J. Bruce Fields
7a04cfda7d nfsd: clarify check_op_ordering
Document a couple things that confused me on a recent reading.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:43:02 -04:00
J. Bruce Fields
4a269efb6f nfsd: update obselete comment referencing the BKL
It's inode->i_lock that's now taken in setlease and break_lease, instead
of the big kernel lock.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:42:56 -04:00
J. Bruce Fields
ca0552f464 nfsd4: cleanup sessionid in nfsd4_destroy_session
The name of this variable doesn't fit the type.  And we only ever use
one field of it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:42:51 -04:00
J. Bruce Fields
665d507276 nfsd4: less confusing nfsd4_compound_in_session
Make the function prototype match the name a little better.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:42:27 -04:00
J. Bruce Fields
a85857633b nfsd4: support change_attr_type attribute
The change attribute is what is used by clients to revalidate their
caches.  Our server may use i_version or ctime for that purpose.  Those
choices behave slightly differently, and it may be useful to the client
to know which we're using.  This attribute tells the client that.  The
Linux client doesn't yet use this attribute yet, though.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:41:31 -04:00
J. Bruce Fields
16945141c3 nfsd: fix NFSv4 time_delta attribute
Currently we return the worst-case value of 1 second in the time delta
attribute.  That's not terribly useful.  Instead, return a value
calculated from the time granularity supported by the filesystem and the
system clock.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:41:11 -04:00
J. Bruce Fields
d6ebf5088f nfsd4: return default lease period
I don't have a good rationale for the lease period, but 90 seconds seems
long, and as long as we're allowing the server to extend the grace
period up to double the lease period, let's half the default to 45.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:20:47 -04:00
J. Bruce Fields
03f318ca65 nfsd4: extend reclaim period for reclaiming clients
If the client is only renewing state a little sooner than once a lease
period, then it might not discover the server has restarted till close
to the end of the grace period, and might run out of time to do the
actual reclaim.

Extend the grace period by a second each time we notice there are
clients still trying to reclaim, up to a limit of another whole lease
period.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-17 10:20:47 -04:00
Theodore Ts'o
c37e9e0134 ext4: add more inode number paranoia checks
If there is a directory entry pointing to a system inode (such as a
journal inode), complain and declare the file system to be corrupted.

Also, if the superblock's first inode number field is too small,
refuse to mount the file system.

This addresses CVE-2018-10882.

https://bugzilla.kernel.org/show_bug.cgi?id=200069

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-17 00:41:14 -04:00
Theodore Ts'o
8bc1379b82 ext4: avoid running out of journal credits when appending to an inline file
Use a separate journal transaction if it turns out that we need to
convert an inline file to use an data block.  Otherwise we could end
up failing due to not having journal credits.

This addresses CVE-2018-10883.

https://bugzilla.kernel.org/show_bug.cgi?id=200071

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-16 23:41:59 -04:00
Theodore Ts'o
e09463f220 jbd2: don't mark block as modified if the handle is out of credits
Do not set the b_modified flag in block's journal head should not
until after we're sure that jbd2_journal_dirty_metadat() will not
abort with an error due to there not being enough space reserved in
the jbd2 handle.

Otherwise, future attempts to modify the buffer may lead a large
number of spurious errors and warnings.

This addresses CVE-2018-10883.

https://bugzilla.kernel.org/show_bug.cgi?id=200071

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-16 20:21:45 -04:00
Linus Torvalds
5e7b9212a4 Solve a series of broken links for files under Documentation:
- can.rst: fix a footnote reference;
 - crypto_engine.rst: Fix two parsing warnings;
 - Fix a lot of broken references to Documentation/*;
 - Improves the scripts/documentation-file-ref-check script,
   in order to help detecting/fixing broken references,
   preventing false-positives.
 
 After this patch series, only 33 broken references to doc files are
 detected by scripts/documentation-file-ref-check.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbJC2aAAoJEAhfPr2O5OEVPmMP/2rN5m9LZ048oRWlg4hCwo73
 4FpWqDg18hbWCMHXYHIN1UACIMUkIUfgLhF7WE3D/XqRMuxHoiE5u7DUdak7+VNt
 wunpksKFJbgyfFMHRvykHcZV+jQFVbM7eFvXVPIvoSaAeGH6zx4imHTyeDn3x/nL
 gdtBqM4bvEhmBjotBTRR4PB8+oPrT/HIT5npHepx3UnFFFAzDQGEZ/I67/el2G5C
 pVmYdBXvr7iqrvUs6FilHLTEfe1quCI4UaKNfLHKrxXrTkiJQFOwugYuobZfNmxT
 GwjWzfpNy9HMlKJFYipcByALxel1Mnpqz5mIxFQaCTygBuEsORCWzW5MoKIsIUJ0
 KOoG76v0rUyMvLBRvaoao3CHYHdzxhQbtVV9DjyDuDksa2G5IoCAF1t6DyIOitRw
 9plMnGckk+FJ/MXJKYWXHszFS8NhI0SF2zHe3s1DmRTD8P6oxkxvxBFz6iqqADmL
 W6XHd8CcqJItaS9ctPen91TFuysN1HFpdzLLY+xwWmmKOcWC/jFjhTm8pj7xLQHM
 5yuuEcefsajf+Xk4w2fSQmRfXnuq+oOlPuWpwSvEy+59cHGI0ms18P1nHy/yt3II
 CJywwdx6fjwDon57RFKH7kkGd7px317zMqWdIv9gUj/qZAy9gcdLdvEQLhx9u0aV
 4F+hLKFDFEpf58xqRT1R
 =/ozx
 -----END PGP SIGNATURE-----

Merge tag 'docs-broken-links' of git://linuxtv.org/mchehab/experimental

Pull documentation fixes from Mauro Carvalho Chehab:
 "This solves a series of broken links for files under Documentation,
  and improves a script meant to detect such broken links (see
  scripts/documentation-file-ref-check).

  The changes on this series are:

   - can.rst: fix a footnote reference;

   - crypto_engine.rst: Fix two parsing warnings;

   - Fix a lot of broken references to Documentation/*;

   - improve the scripts/documentation-file-ref-check script, in order
     to help detecting/fixing broken references, preventing
     false-positives.

  After this patch series, only 33 broken references to doc files are
  detected by scripts/documentation-file-ref-check"

* tag 'docs-broken-links' of git://linuxtv.org/mchehab/experimental: (26 commits)
  fix a series of Documentation/ broken file name references
  Documentation: rstFlatTable.py: fix a broken reference
  ABI: sysfs-devices-system-cpu: remove a broken reference
  devicetree: fix a series of wrong file references
  devicetree: fix name of pinctrl-bindings.txt
  devicetree: fix some bindings file names
  MAINTAINERS: fix location of DT npcm files
  MAINTAINERS: fix location of some display DT bindings
  kernel-parameters.txt: fix pointers to sound parameters
  bindings: nvmem/zii: Fix location of nvmem.txt
  docs: Fix more broken references
  scripts/documentation-file-ref-check: check tools/*/Documentation
  scripts/documentation-file-ref-check: get rid of false-positives
  scripts/documentation-file-ref-check: hint: dash or underline
  scripts/documentation-file-ref-check: add a fix logic for DT
  scripts/documentation-file-ref-check: accept more wildcards at filenames
  scripts/documentation-file-ref-check: fix help message
  media: max2175: fix location of driver's companion documentation
  media: v4l: fix broken video4linux docs locations
  media: dvb: point to the location of the old README.dvb-usb file
  ...
2018-06-17 05:25:18 +09:00
Linus Torvalds
dbb2816fc7 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlsielwACgkQnJ2qBz9k
 QNlN0Af/Q82iP3EqrT3w+CT7w0gER2su+Df2riDpo0/XYRQLxuyW+kYtLsQwovvB
 Q7Tt+WTSO5OIqoJxwGMmd6VO5ICblhP+uHVC6+JlWy17DgccjwFBE/sUopxPqJaK
 9utwXZhqqOEoikNpDABcptNnWVILRl0yppkQrVV/pKkyZFp2F8vO4roUHFFYkJJt
 /uXJfLDQx6pBLTwqfQBFyiz0dCSsvCHUVnlw7Hu5JfE6xPtkMlk6F/M0Y0rvyEOg
 8KmH5jUX/BXKIijg+ycOzS3CCdvm0UhrtiH5YWy4qGaI8eczT31Epfl08Sk8pvkv
 n2rnxNnJP5sjPPNQhXvHJqy9qRCB6g==
 =bLjN
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify updates from Jan Kara:
 "fsnotify cleanups unifying handling of different watch types.

  This is the shortened fsnotify series from Amir with the last five
  patches pulled out. Amir has modified those patches to not change
  struct inode but obviously it's too late for those to go into this
  merge window"

* tag 'fsnotify_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: add fsnotify_add_inode_mark() wrappers
  fanotify: generalize fanotify_should_send_event()
  fsnotify: generalize send_to_group()
  fsnotify: generalize iteration of marks by object type
  fsnotify: introduce marks iteration helpers
  fsnotify: remove redundant arguments to handle_event()
  fsnotify: use type id to identify connector object type
2018-06-17 05:06:18 +09:00
Theodore Ts'o
8cdb5240ec ext4: never move the system.data xattr out of the inode body
When expanding the extra isize space, we must never move the
system.data xattr out of the inode body.  For performance reasons, it
doesn't make any sense, and the inline data implementation assumes
that system.data xattr is never in the external xattr block.

This addresses CVE-2018-10880

https://bugzilla.kernel.org/show_bug.cgi?id=200005

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-16 15:40:48 -04:00
Linus Torvalds
35773c9381 Merge branch 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull AFS updates from Al Viro:
 "Assorted AFS stuff - ended up in vfs.git since most of that consists
  of David's AFS-related followups to Christoph's procfs series"

* 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  afs: Optimise callback breaking by not repeating volume lookup
  afs: Display manually added cells in dynamic root mount
  afs: Enable IPv6 DNS lookups
  afs: Show all of a server's addresses in /proc/fs/afs/servers
  afs: Handle CONFIG_PROC_FS=n
  proc: Make inline name size calculation automatic
  afs: Implement network namespacing
  afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
  afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
  proc: Add a way to make network proc files writable
  afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
  afs: Rearrange fs/afs/proc.c to move the show routines up
  afs: Rearrange fs/afs/proc.c by moving fops and open functions down
  afs: Move /proc management functions to the end of the file
2018-06-16 16:32:04 +09:00
Linus Torvalds
29d6849d88 Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat updates from Al Viro:
 "Some biarch patches - getting rid of assorted (mis)uses of
  compat_alloc_user_space().

  Not much in that area this cycle..."

* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  orangefs: simplify compat ioctl handling
  signalfd: lift sigmask copyin and size checks to callers of do_signalfd4()
  vmsplice(): lift importing iovec into vmsplice(2) and compat counterpart
2018-06-16 16:21:50 +09:00
Linus Torvalds
a5b729ea18 Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio fixes from Al Viro:
 "Assorted AIO followups and fixes"

* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  eventpoll: switch to ->poll_mask
  aio: only return events requested in poll_mask() for IOCB_CMD_POLL
  eventfd: only return events requested in poll_mask()
  aio: mark __aio_sigset::sigmask const
2018-06-16 16:11:40 +09:00
Paulo Alcantara
83ffdeadb4 cifs: Fix invalid check in __cifs_calc_signature()
The following check would never evaluate to true:
  > if (i == 0 && iov[0].iov_len <= 4)

Because 'i' always starts at 1.

This patch fixes it and also move the header checks outside the for loop
- which makes more sense.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 19:17:40 -05:00
Paulo Alcantara
35e2cc1ba7 cifs: Use correct packet length in SMB2_TRANSFORM header
In smb3_init_transform_rq(), 'orig_len' was only counting the request
length, but forgot to count any data pages in the request.

Writing or creating files with the 'seal' mount option was broken.

In addition, do some code refactoring by exporting smb2_rqst_len() to
calculate the appropriate packet size and avoid duplicating the same
calculation all over the code.

The start of the io vector is either the rfc1002 length (4 bytes) or a
SMB2 header which is always > 4. Use this fact to check and skip the
rfc1002 length if requested.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 19:17:40 -05:00
Mauro Carvalho Chehab
44348e8ac1 fix a series of Documentation/ broken file name references
As files move around, their previous links break. Fix the
references for them.

Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
2018-06-15 18:10:01 -03:00
Mauro Carvalho Chehab
34962fb807 docs: Fix more broken references
As we move stuff around, some doc references are broken. Fix some of
them via this script:
	./scripts/documentation-file-ref-check --fix

Manually checked that produced results are valid.

Acked-by: Matthias Brugger <matthias.bgg@gmail.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
2018-06-15 18:11:26 -03:00
Theodore Ts'o
6e8ab72a81 ext4: clear i_data in ext4_inode_info when removing inline data
When converting from an inode from storing the data in-line to a data
block, ext4_destroy_inline_data_nolock() was only clearing the on-disk
copy of the i_blocks[] array.  It was not clearing copy of the
i_blocks[] in ext4_inode_info, in i_data[], which is the copy actually
used by ext4_map_blocks().

This didn't matter much if we are using extents, since the extents
header would be invalid and thus the extents could would re-initialize
the extents tree.  But if we are using indirect blocks, the previous
contents of the i_blocks array will be treated as block numbers, with
potentially catastrophic results to the file system integrity and/or
user data.

This gets worse if the file system is using a 1k block size and
s_first_data is zero, but even without this, the file system can get
quite badly corrupted.

This addresses CVE-2018-10881.

https://bugzilla.kernel.org/show_bug.cgi?id=200015

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-15 12:28:16 -04:00
Theodore Ts'o
bdbd6ce01a ext4: include the illegal physical block in the bad map ext4_error msg
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-15 12:27:16 -04:00
David Howells
47ea0f2ebf afs: Optimise callback breaking by not repeating volume lookup
At the moment, afs_break_callbacks calls afs_break_one_callback() for each
separate FID it was given, and the latter looks up the volume individually
for each one.

However, this is inefficient if two or more FIDs have the same vid as we
could reuse the volume.  This is complicated by cell aliasing whereby we
may have multiple cells sharing a volume and can therefore have multiple
callback interests for any particular volume ID.

At the moment afs_break_one_callback() scans the entire list of volumes
we're getting from a server and breaks the appropriate callback in every
matching volume, regardless of cell.  This scan is done for every FID.

Optimise callback breaking by the following means:

 (1) Sort the FID list by vid so that all FIDs belonging to the same volume
     are clumped together.

     This is done through the use of an indirection table as we cannot do
     an insertion sort on the afs_callback_break array as we decode FIDs
     into it as we subsequently also have to decode callback info into it
     that corresponds by array index only.

     We also don't really want to bubblesort afterwards if we can avoid it.

 (2) Sort the server->cb_interests array by vid so that all the matching
     volumes are grouped together.  This permits the scan to stop after
     finding a record that has a higher vid.

 (3) When breaking FIDs, we try to keep server->cb_break_lock as long as
     possible, caching the start point in the array for that volume group
     as long as possible.

     It might make sense to add another layer in that list and have a
     refcounted volume ID anchor that has the matching interests attached
     to it rather than being in the list.  This would allow the lock to be
     dropped without losing the cursor.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-15 15:27:09 +01:00
David Howells
0da0b7fd73 afs: Display manually added cells in dynamic root mount
Alter the dynroot mount so that cells created by manipulation of
/proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
cell as a module parameter will cause directories for those cells to be
created in the dynamic root superblock for the network namespace[*].

To this end:

 (1) Only one dynamic root superblock is now created per network namespace
     and this is shared between all attempts to mount it.  This makes it
     easier to find the superblock to modify.

 (2) When a dynamic root superblock is created, the list of cells is walked
     and directories created for each cell already defined.

 (3) When a new cell is added, if a dynamic root superblock exists, a
     directory is created for it.

 (4) When a cell is destroyed, the directory is removed.

 (5) These directories are created by calling lookup_one_len() on the root
     dir which automatically creates them if they don't exist.

[*] Inasmuch as network namespaces are currently supported here.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-15 15:27:09 +01:00
David Howells
c88d5a7fff afs: Enable IPv6 DNS lookups
Remove the restriction on DNS lookup upcalls that prevents ipv6 addresses
from being looked up.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-15 15:27:09 +01:00
Steve French
d819d298c7 smb3: fix corrupt path in subdirs on smb311 with posix
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Steve French
115d5d288d smb3: do not display empty interface list
If server does not support listing interfaces then do not
display empty "Server interfaces" line to avoid confusing users.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Aurelien Aptel <aaptel@suse.com>
2018-06-15 02:38:08 -05:00
Steve French
bea851b8ba smb3: Fix mode on mkdir on smb311 mounts
mkdir was not passing the mode on smb3.11 mounts with posix extensions

Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Paulo Alcantara
662bf5bc0a cifs: Fix kernel oops when traceSMB is enabled
When traceSMB is enabled through 'echo 1 > /proc/fs/cifs/traceSMB', after a
mount, the following oops is triggered:

[   27.137943] BUG: unable to handle kernel paging request at
ffff8800f80c268b
[   27.143396] PGD 2c6b067 P4D 2c6b067 PUD 0
[   27.145386] Oops: 0000 [#1] SMP PTI
[   27.146186] CPU: 2 PID: 2655 Comm: mount.cifs Not tainted 4.17.0+ #39
[   27.147174] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.0.0-prebuilt.qemu-project.org 04/01/2014
[   27.148969] RIP: 0010:hex_dump_to_buffer+0x413/0x4b0
[   27.149738] Code: 48 8b 44 24 08 31 db 45 31 d2 48 89 6c 24 18 44 89
6c 24 24 48 c7 c1 78 b5 23 82 4c 89 64 24 10 44 89 d5 41 89 dc 4c 8d 58
02 <44> 0f b7 00 4d 89 dd eb 1f 83 c5 01 41 01 c4 41 39 ef 0f 84 48 fe
[   27.152396] RSP: 0018:ffffc9000058f8c0 EFLAGS: 00010246
[   27.153129] RAX: ffff8800f80c268b RBX: 0000000000000000 RCX:
ffffffff8223b578
[   27.153867] RDX: 0000000000000000 RSI: ffffffff81a55496 RDI:
0000000000000008
[   27.154612] RBP: 0000000000000000 R08: 0000000000000020 R09:
0000000000000083
[   27.155355] R10: 0000000000000000 R11: ffff8800f80c268d R12:
0000000000000000
[   27.156101] R13: 0000000000000002 R14: ffffc9000058f94d R15:
0000000000000008
[   27.156838] FS:  00007f1693a6b740(0000) GS:ffff88007fd00000(0000)
knlGS:0000000000000000
[   27.158354] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   27.159093] CR2: ffff8800f80c268b CR3: 00000000798fa001 CR4:
0000000000360ee0
[   27.159892] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   27.160661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   27.161464] Call Trace:
[   27.162123]  print_hex_dump+0xd3/0x160
[   27.162814] journal-offline (2658) used greatest stack depth: 13144
bytes left
[   27.162824]  ? __release_sock+0x60/0xd0
[   27.165344]  ? tcp_sendmsg+0x31/0x40
[   27.166177]  dump_smb+0x39/0x40
[   27.166972]  ? vsnprintf+0x236/0x490
[   27.167807]  __smb_send_rqst.constprop.12+0x103/0x430
[   27.168554]  ? apic_timer_interrupt+0xa/0x20
[   27.169306]  smb_send_rqst+0x48/0xc0
[   27.169984]  cifs_send_recv+0xda/0x420
[   27.170639]  SMB2_negotiate+0x23d/0xfa0
[   27.171301]  ? vsnprintf+0x236/0x490
[   27.171961]  ? smb2_negotiate+0x19/0x30
[   27.172586]  smb2_negotiate+0x19/0x30
[   27.173257]  cifs_negotiate_protocol+0x70/0xd0
[   27.173935]  ? kstrdup+0x43/0x60
[   27.174551]  cifs_get_smb_ses+0x295/0xbe0
[   27.175260]  ? lock_timer_base+0x67/0x80
[   27.175936]  ? __internal_add_timer+0x1a/0x50
[   27.176575]  ? add_timer+0x10f/0x230
[   27.177267]  cifs_mount+0x101/0x1190
[   27.177940]  ? cifs_smb3_do_mount+0x144/0x5c0
[   27.178575]  cifs_smb3_do_mount+0x144/0x5c0
[   27.179270]  mount_fs+0x35/0x150
[   27.179930]  vfs_kern_mount.part.28+0x54/0xf0
[   27.180567]  do_mount+0x5ad/0xc40
[   27.181234]  ? kmem_cache_alloc_trace+0xed/0x1a0
[   27.181916]  ksys_mount+0x80/0xd0
[   27.182535]  __x64_sys_mount+0x21/0x30
[   27.183220]  do_syscall_64+0x4e/0x100
[   27.183882]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   27.184535] RIP: 0033:0x7f169339055a
[   27.185192] Code: 48 8b 0d 41 d9 2b 00 f7 d8 64 89 01 48 83 c8 ff c3
66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 0e d9 2b 00 f7 d8 64 89 01 48
[   27.187268] RSP: 002b:00007fff7b44eb58 EFLAGS: 00000202 ORIG_RAX:
00000000000000a5
[   27.188515] RAX: ffffffffffffffda RBX: 00007f1693a7e70e RCX:
00007f169339055a
[   27.189244] RDX: 000055b9f97f64e5 RSI: 000055b9f97f652c RDI:
00007fff7b45074f
[   27.189974] RBP: 000055b9fb8c9260 R08: 000055b9fb8ca8f0 R09:
0000000000000000
[   27.190721] R10: 0000000000000000 R11: 0000000000000202 R12:
000055b9fb8ca8f0
[   27.191429] R13: 0000000000000000 R14: 00007f1693a7c000 R15:
00007f1693a7e91d
[   27.192167] Modules linked in:
[   27.192797] CR2: ffff8800f80c268b
[   27.193435] ---[ end trace 67404c618badf323 ]---

The problem was that dump_smb() had been called with an invalid pointer,
that is, in __smb_send_rqst(), iov[1] doesn't exist (n_vec == 1).

This patch fixes it by relying on the n_vec value to dump out the smb
packets.

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-06-15 02:38:08 -05:00
Aurelien Aptel
bc0fe8b207 CIFS: dump every session iface info
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Aurelien Aptel
fe856be475 CIFS: parse and store info on iface queries
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Aurelien Aptel
b6f0dd5d75 CIFS: add iface info to struct cifs_ses
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Aurelien Aptel
bead042ccc CIFS: complete PDU definitions for interface queries
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Aurelien Aptel
e2292430c4 CIFS: move default port definitions to cifsglob.h
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Paulo Alcantara
cd2dca60be cifs: Fix encryption/signing
Since the rfc1002 generation was moved down to __smb_send_rqst(),
the transform header is now in rqst->rq_iov[0].

Correctly assign the transform header pointer in crypt_message().

Signed-off-by: Paulo Alcantara <palcantara@suse.de>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:08 -05:00
Ronnie Sahlberg
07cd952f3a cifs: update __smb_send_rqst() to take an array of requests
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-15 02:38:08 -05:00
Ronnie Sahlberg
40eff45b5d cifs: remove smb2_send_recv()
Now that we have the plumbing to pass request without an rfc1002
header all the way down to the point we write to the socket we no
longer need the smb2_send_recv() function.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-15 02:38:08 -05:00
Ronnie Sahlberg
c713c8770f cifs: push rfc1002 generation down the stack
Move the generation of the 4 byte length field down the stack and
generate it immediately before we start writing the data to the socket.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-15 02:38:08 -05:00
Steve French
d409014e4f smb3: increase initial number of credits requested to allow write
Compared to other clients the Linux smb3 client ramps up
credits very slowly, taking more than 128 operations before a
maximum size write could be sent (since the number of credits
requested is only 2 per small operation, causing the credit
limit to grow very slowly).

This lack of credits initially would impact large i/o performance,
when large i/o is tried early before enough credits are built up.

Signed-off-by: Steve French <stfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-06-15 02:38:08 -05:00
Ronnie Sahlberg
a93864d939 cifs: add lease tracking to the cached root fid
Use a read lease for the cached root fid so that we can detect
when the content of the directory changes (via a break) at which time
we close the handle. On next access to the root the handle will be reopened
and cached again.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:07 -05:00
Steve French
2fbb56446f smb3: note that smb3.11 posix extensions mount option is experimental
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-15 02:38:07 -05:00
David Howells
0aac4bce4b afs: Show all of a server's addresses in /proc/fs/afs/servers
Show all of a server's addresses in /proc/fs/afs/servers, placing the
second plus addresses on padded lines of their own.  The current address is
marked with a star.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-15 00:52:59 -04:00
David Howells
b6cfbecafb afs: Handle CONFIG_PROC_FS=n
The AFS filesystem depends at the moment on /proc for configuration and
also presents information that way - however, this causes a compilation
failure if procfs is disabled.

Fix it so that the procfs bits aren't compiled in if procfs is disabled.

This means that you can't configure the AFS filesystem directly, but it is
still usable provided that an up-to-date keyutils is installed to look up
cells by SRV or AFSDB DNS records.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-15 00:52:55 -04:00
David Howells
24074a35c5 proc: Make inline name size calculation automatic
Make calculation of the size of the inline name in struct proc_dir_entry
automatic, rather than having to manually encode the numbers and failing to
allow for lockdep.

Require a minimum inline name size of 33+1 to allow for names that look
like two hex numbers with a dash between.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-15 00:48:57 -04:00
Al Viro
430ff79170 orangefs: simplify compat ioctl handling
no need to mess with copy_in_user(), etc...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-15 00:23:55 -04:00
Al Viro
5ed0127fc3 signalfd: lift sigmask copyin and size checks to callers of do_signalfd4()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-15 00:23:50 -04:00
Ben Noordhuis
11c5ad0ec4 eventpoll: switch to ->poll_mask
Signed-off-by: Ben Noordhuis <info@bnoordhuis.nl>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-14 20:09:28 -04:00
Christoph Hellwig
2739b807b0 aio: only return events requested in poll_mask() for IOCB_CMD_POLL
The ->poll_mask() operation has a mask of events that the caller
is interested in, but not all implementations might take it into
account.  Mask the return value to only the requested events,
similar to what the poll and epoll code does.

Reported-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-14 20:08:14 -04:00
Avi Kivity
4d572d9f46 eventfd: only return events requested in poll_mask()
The ->poll_mask() operation has a mask of events that the caller
is interested in, but we're returning all events regardless.

Change to return only the events the caller is interested in. This
fixes aio IO_CMD_POLL returning immediately when called with POLLIN
on an eventfd, since an eventfd is almost always ready for a write.

Signed-off-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-14 20:07:38 -04:00
Linus Torvalds
b5d903c2d6 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - MM remainders

 - various misc things

 - kcov updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
  lib/test_printf.c: call wait_for_random_bytes() before plain %p tests
  hexagon: drop the unused variable zero_page_mask
  hexagon: fix printk format warning in setup.c
  mm: fix oom_kill event handling
  treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAX
  mm: use octal not symbolic permissions
  ipc: use new return type vm_fault_t
  sysvipc/sem: mitigate semnum index against spectre v1
  fault-injection: reorder config entries
  arm: port KCOV to arm
  sched/core / kcov: avoid kcov_area during task switch
  kcov: prefault the kcov_area
  kcov: ensure irq code sees a valid area
  kernel/relay.c: change return type to vm_fault_t
  exofs: avoid VLA in structures
  coredump: fix spam with zero VMA process
  fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block()
  proc: skip branch in /proc/*/* lookup
  mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns
  mm/memblock: add missing include <linux/bootmem.h>
  ...
2018-06-15 08:51:42 +09:00
Kees Cook
20fe935358 exofs: avoid VLA in structures
On the quest to remove all VLAs from the kernel[1] this adjusts several
cases where allocation is made after an array of structures that points
back into the allocation.  The allocations are changed to perform
explicit calculations instead of using a Variable Length Array in a
structure.

Additionally, this lets Clang compile this code now, since Clang does
not support VLAIS[2].

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
[2] https://lkml.kernel.org/r/CA+55aFy6h1c3_rP_bXFedsTXzwW+9Q9MfJaW7GUmMBrAp-fJ9A@mail.gmail.com

[keescook@chromium.org: v2]
  Link: http://lkml.kernel.org/r/20180418163546.GA45794@beast
Link: http://lkml.kernel.org/r/20180327203904.GA1151@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Boaz Harrosh <ooo@electrozaur.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:24 +09:00
Alexey Dobriyan
86a2bb5ad8 coredump: fix spam with zero VMA process
Nobody ever tried to self destruct by unmapping whole address space at
once:

	munmap((void *)0, (1ULL << 47) - 4096);

Doing this produces 2 warnings for zero-length vmalloc allocations:

  a.out[1353]: segfault at 7f80bcc4b757 ip 00007f80bcc4b757 sp 00007fff683939b8 error 14
  a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null)
	...
  a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null)
	...

Fix is to switch to kvmalloc().

Steps to reproduce:

	// vsyscall=none
	#include <sys/mman.h>
	#include <sys/resource.h>
	int main(void)
	{
		setrlimit(RLIMIT_CORE, &(struct rlimit){RLIM_INFINITY, RLIM_INFINITY});
		munmap((void *)0, (1ULL << 47) - 4096);
		return 0;
	}

Link: http://lkml.kernel.org/r/20180410180353.GA2515@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:24 +09:00
OGAWA Hirofumi
c2574aaa5d fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block()
If file size and FAT cluster chain is not matched (corrupted image), we
can hit BUG_ON(!phys) in __fat_get_block().

So, use fat_fs_error() instead.

[hirofumi@mail.parknet.co.jp: fix printk warning]
  Link: http://lkml.kernel.org/r/87po12aq5p.fsf@mail.parknet.co.jp
Link: http://lkml.kernel.org/r/874lilcu67.fsf@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:24 +09:00
Alexey Dobriyan
26b95137d6 proc: skip branch in /proc/*/* lookup
Code is structured like this:

	for ( ... p < last; p++) {
		if (memcmp == 0)
			break;
	}
	if (p >= last)
		ERROR
	OK

gcc doesn't see that if if lookup succeeds than post loop branch will
never be taken and skip it.

[akpm@linux-foundation.org: proc_pident_instantiate() no longer takes an inode*]
Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:24 +09:00
Linus Torvalds
7a932516f5 vfs/y2038: inode timestamps conversion to timespec64
This is a late set of changes from Deepa Dinamani doing an automated
 treewide conversion of the inode and iattr structures from 'timespec'
 to 'timespec64', to push the conversion from the VFS layer into the
 individual file systems.
 
 There were no conflicts between this and the contents of linux-next
 until just before the merge window, when we saw multiple problems:
 
 - A minor conflict with my own y2038 fixes, which I could address
   by adding another patch on top here.
 - One semantic conflict with late changes to the NFS tree. I addressed
   this by merging Deepa's original branch on top of the changes that
   now got merged into mainline and making sure the merge commit includes
   the necessary changes as produced by coccinelle.
 - A trivial conflict against the removal of staging/lustre.
 - Multiple conflicts against the VFS changes in the overlayfs tree.
   These are still part of linux-next, but apparently this is no longer
   intended for 4.18 [1], so I am ignoring that part.
 
 As Deepa writes:
 
   The series aims to switch vfs timestamps to use struct timespec64.
   Currently vfs uses struct timespec, which is not y2038 safe.
 
   The series involves the following:
   1. Add vfs helper functions for supporting struct timepec64 timestamps.
   2. Cast prints of vfs timestamps to avoid warnings after the switch.
   3. Simplify code using vfs timestamps so that the actual
      replacement becomes easy.
   4. Convert vfs timestamps to use struct timespec64 using a script.
      This is a flag day patch.
 
   Next steps:
   1. Convert APIs that can handle timespec64, instead of converting
      timestamps at the boundaries.
   2. Update internal data structures to avoid timestamp conversions.
 
 Thomas Gleixner adds:
 
   I think there is no point to drag that out for the next merge window.
   The whole thing needs to be done in one go for the core changes which
   means that you're going to play that catchup game forever. Let's get
   over with it towards the end of the merge window.
 
 [1] https://www.spinics.net/lists/linux-fsdevel/msg128294.html
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbInZAAAoJEGCrR//JCVInReoQAIlVIIMt5ZX6wmaKbrjy9Itf
 MfgbFihQ/djLnuSPVQ3nztcxF0d66BKHZ9puVjz6+mIHqfDvJTRwZs9nU+sOF/T1
 g78fRkM1cxq6ZCkGYAbzyjyo5aC4PnSMP/NQLmwqvi0MXqqrbDoq5ZdP9DHJw39h
 L9lD8FM/P7T29Fgp9tq/pT5l9X8VU8+s5KQG1uhB5hii4VL6pD6JyLElDita7rg+
 Z7/V7jkxIGEUWF7vGaiR1QTFzEtpUA/exDf9cnsf51OGtK/LJfQ0oiZPPuq3oA/E
 LSbt8YQQObc+dvfnGxwgxEg1k5WP5ekj/Wdibv/+rQKgGyLOTz6Q4xK6r8F2ahxs
 nyZQBdXqHhJYyKr1H1reUH3mrSgQbE5U5R1i3My0xV2dSn+vtK5vgF21v2Ku3A1G
 wJratdtF/kVBzSEQUhsYTw14Un+xhBLRWzcq0cELonqxaKvRQK9r92KHLIWNE7/v
 c0TmhFbkZA+zR8HdsaL3iYf1+0W/eYy8PcvepyldKNeW2pVk3CyvdTfY2Z87G2XK
 tIkK+BUWbG3drEGG3hxZ3757Ln3a9qWyC5ruD3mBVkuug/wekbI8PykYJS7Mx4s/
 WNXl0dAL0Eeu1M8uEJejRAe1Q3eXoMWZbvCYZc+wAm92pATfHVcKwPOh8P7NHlfy
 A3HkjIBrKW5AgQDxfgvm
 =CZX2
 -----END PGP SIGNATURE-----

Merge tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground

Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
 "This is a late set of changes from Deepa Dinamani doing an automated
  treewide conversion of the inode and iattr structures from 'timespec'
  to 'timespec64', to push the conversion from the VFS layer into the
  individual file systems.

  As Deepa writes:

   'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
       timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
       becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
       This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
       timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

  Thomas Gleixner adds:

   'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

* tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
  pstore: Remove bogus format string definition
  vfs: change inode times to use struct timespec64
  pstore: Convert internal records to timespec64
  udf: Simplify calls to udf_disk_stamp_to_time
  fs: nfs: get rid of memcpys for inode times
  ceph: make inode time prints to be long long
  lustre: Use long long type to print inode time
  fs: add timespec64_truncate()
2018-06-15 07:31:07 +09:00
Linus Torvalds
dc594c39f7 The main piece is a set of libceph changes that revamps how OSD
requests are aborted, improving CephFS ENOSPC handling and making
 "umount -f" actually work (Zheng and myself).  The rest is mostly
 mount option handling cleanups from Chengguang and assorted fixes
 from Zheng, Luis and Dongsheng.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJbIkigAAoJEEp/3jgCEfOL3EUH/1s7Ib3FgFzG/SPPKISxZOGr
 ndZGg0rPT9mPIQ4rp6t0z/cDlMrluPmCK3sWrAPe//sZz9iZiuip+mCL0gUFXFNr
 1kL2xDKkJzGxtP3UlUvr5CC6bnxLdeBXJRBDLk/swtphuqArKndlbN/iLZnCZivT
 uJDk+vZTwNJ3UhQP4QdnOQLV60NYs+q4euTqbZF3+pDiRiONbxRfXC3adFsc8zL9
 zlie3CHPbrQHWMsfNvbfM3rBH1WhTwEssDm+IEFlKl19q9SKP2WPZfmBcE1pmZ58
 AhIMoNGdQha1FXS6N96kaPaqFgeysPnEPoyHDqLxsUMKqsvJlOEZsK1jujza4rE=
 =EfXm
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main piece is a set of libceph changes that revamps how OSD
  requests are aborted, improving CephFS ENOSPC handling and making
  "umount -f" actually work (Zheng and myself).

  The rest is mostly mount option handling cleanups from Chengguang and
  assorted fixes from Zheng, Luis and Dongsheng.

* tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client: (31 commits)
  rbd: flush rbd_dev->watch_dwork after watch is unregistered
  ceph: update description of some mount options
  ceph: show ino32 if the value is different with default
  ceph: strengthen rsize/wsize/readdir_max_bytes validation
  ceph: fix alignment of rasize
  ceph: fix use-after-free in ceph_statfs()
  ceph: prevent i_version from going back
  ceph: fix wrong check for the case of updating link count
  libceph: allocate the locator string with GFP_NOFAIL
  libceph: make abort_on_full a per-osdc setting
  libceph: don't abort reads in ceph_osdc_abort_on_full()
  libceph: avoid a use-after-free during map check
  libceph: don't warn if req->r_abort_on_full is set
  libceph: use for_each_request() in ceph_osdc_abort_on_full()
  libceph: defer __complete_request() to a workqueue
  libceph: move more code into __complete_request()
  libceph: no need to call flush_workqueue() before destruction
  ceph: flush pending works before shutdown super
  ceph: abort osd requests on force umount
  libceph: introduce ceph_osdc_abort_requests()
  ...
2018-06-15 07:24:58 +09:00
Linus Torvalds
e7655d2b25 for-4.18-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsg598ACgkQxWXV+ddt
 WDtG1w//R9/nvAk5A5cbForTXNyxwXAQnz0t2/4Lh5igbfJloqoTZtr47Gvqsvy+
 DITU+3BPcyupBuUFLoeivPC+ruKOUmle27Vm62mdZRtyt96kiUiV/m1gbPhFy8lW
 DKHK8tvtIoZObo5oNGvRxiuJPjefiChYZMDHokB+MY5ZALSRaIW9opj2WsM+ZAt7
 g9CEjeitQcBY68CCpSEVBlQSz+BTZYEDFJGNCmGDxGhaBGZr/ganrkDJ75cG6U10
 LnOZ6LDHNxGMqUm4wnhfmpHtVcIJiBF+gOyTumBPtFSoLnBverl684xizglpoq6d
 fnUP8Y6XR9JA4OCZvo310yvX9nyqgb0H2h+APO0f7jRRcJo0QSKZ/qZR+XZCk3PU
 91HtBopcGs8gGQUkRdAE7TMCiIEzL1eNOXHvsiILCObq1i7iNCe7Dzx6M6Gfgep8
 a3IcoVmSw1DFpln2ZxTQw9viAib41iU46XHXz7W7rGPulF5QdXGo5ScORRG6HBLE
 nZsXdTkrCPMJyN2bIhU6YJOK9rb9TjD4lUtnvzT8t1CfUsxQT4AsJykaYr9BwF2D
 Z4rBruUAQ3OmvXJDfGG4T5YCAdPBN+xBcxCeyrDZSD0r6YPQGoF0dlvHmvP0J78D
 oGkD1bb/gjcvsPJxtTQ4QWEh0oqZiDfRt4qdgO46vhba0onzQFw=
 =X2zA
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - error handling fixup for one of the new ioctls from 1st pull

 - fix for device-replace that incorrectly uses inode pages and can mess
   up compressed extents in some cases

 - fiemap fix for reporting incorrect number of extents

 - vm_fault_t type conversion

* tag 'for-4.18-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: Don't use inode pages for device replace
  btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
  Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero
  btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user
2018-06-15 07:23:00 +09:00
Anna Schumaker
d5681f59ee NFS: Fix an rcu deadlock in nfs_delegation_find_inode()
I was able to reproduce this pretty regularily using xfstests
generic/013 on NFS v4.0.

Reported-by: Ross Zwisler <Ross.Zwisler@linux.intel.com>
Fixes: 6c34265502 (NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab())
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-14 14:05:38 -04:00
Theodore Ts'o
bc890a6024 ext4: verify the depth of extent tree in ext4_find_extent()
If there is a corupted file system where the claimed depth of the
extent tree is -1, this can cause a massive buffer overrun leading to
sadness.

This addresses CVE-2018-10877.

https://bugzilla.kernel.org/show_bug.cgi?id=199417

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-14 12:55:10 -04:00
Arnd Bergmann
e264abeaf9 pstore: Remove bogus format string definition
The pstore conversion to timespec64 introduces its own method of passing
seconds into sscanf() and sprintf() type functions to work around the
timespec64 definition on 64-bit systems that redefine it to 'timespec'.

That hack is now finally getting removed, but that means we get a (harmless)
warning once both patches are merged:

fs/pstore/ram.c: In function 'ramoops_read_kmsg_hdr':
fs/pstore/ram.c:39:29: error: format '%ld' expects argument of type 'long int *', but argument 3 has type 'time64_t *' {aka 'long long int *'} [-Werror=format=]
 #define RAMOOPS_KERNMSG_HDR "===="
                             ^~~~~~
fs/pstore/ram.c:167:21: note: in expansion of macro 'RAMOOPS_KERNMSG_HDR'

This removes the pstore specific workaround and uses the same method that
we have in place for all other functions that print a timespec64.

Related to this, I found that the kasprintf() output contains an incorrect
nanosecond values for any number starting with zeroes, and I adapt the
format string accordingly.

Link: https://lkml.org/lkml/2018/5/19/115
Link: https://lkml.org/lkml/2018/5/16/1080
Fixes: 0f0d83b99ef7 ("pstore: Convert internal records to timespec64")
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-14 14:57:24 +02:00
Arnd Bergmann
15eefe2a99 Merge branch 'vfs_timespec64' of https://github.com/deepa-hub/vfs into vfs-timespec64
Pull the timespec64 conversion from Deepa Dinamani:
 "The series aims to switch vfs timestamps to use
  struct timespec64. Currently vfs uses struct timespec,
  which is not y2038 safe.

  The flag patch applies cleanly. I've not seen the timestamps
  update logic change often. The series applies cleanly on 4.17-rc6
  and linux-next tip (top commit: next-20180517).

  I'm not sure how to merge this kind of a series with a flag patch.
  We are targeting 4.18 for this.
  Let me know if you have other suggestions.

  The series involves the following:
  1. Add vfs helper functions for supporting struct timepec64 timestamps.
  2. Cast prints of vfs timestamps to avoid warnings after the switch.
  3. Simplify code using vfs timestamps so that the actual
     replacement becomes easy.
  4. Convert vfs timestamps to use struct timespec64 using a script.
     This is a flag day patch.

  I've tried to keep the conversions with the script simple, to
  aid in the reviews. I've kept all the internal filesystem data
  structures and function signatures the same.

  Next steps:
  1. Convert APIs that can handle timespec64, instead of converting
     timestamps at the boundaries.
  2. Update internal data structures to avoid timestamp conversions."

I've pulled it into a branch based on top of the NFS changes that
are now in mainline, so I could resolve the non-obvious conflict
between the two while merging.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-06-14 14:54:00 +02:00
Konstantin Khorenko
1cf8e5de40 fs/lock: show locks taken by processes from another pidns
Currently if we face a lock taken by a process invisible in the current
pidns we skip the lock completely, but this

1) makes the output not that nice
    (root@vz7)/: cat /proc/${PID_A2}/fdinfo/3
    pos:    4
    flags:  02100002
    mnt_id: 257
    lock:   (root@vz7)/:

2) makes it more difficult to debug issues with leaked flocks
   if you get error on lock, but don't see any locks in /proc/$id/fdinfo/$file

Let's show information about such locks again as previously, but
show zero in the owner pid field.

After the patch:
===============
(root@vz7)/:cat /proc/${PID_A2}/fdinfo/3
pos:    4
flags:  02100002
mnt_id: 295
lock:   1: FLOCK  ADVISORY  WRITE 0 b6:f8a61:529946 0 EOF

Fixes: 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks")
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-06-14 07:37:50 -04:00
Konstantin Khorenko
826d7bc9f0 fs/lock: skip lock owner pid translation in case we are in init_pid_ns
If the flock owner process is dead and its pid has been already freed,
pid translation won't work, but we still want to show flock owner pid
number when expecting /proc/$PID/fdinfo/$FD in init pidns.

Reproducer:
process A	process A1	process A2
fork()--------->
exit()		open()
		flock()
		fork()--------->
		exit()		sleep()

Before the patch:
================
(root@vz7)/: cat /proc/${PID_A2}/fdinfo/3
pos:    4
flags:  02100002
mnt_id: 257
lock:   (root@vz7)/:

After the patch:
===============
(root@vz7)/:cat /proc/${PID_A2}/fdinfo/3
pos:    4
flags:  02100002
mnt_id: 295
lock:   1: FLOCK  ADVISORY  WRITE ${PID_A1} b6:f8a61:529946 0 EOF

Fixes: 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks")
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Acked-by: Andrey Vagin <avagin@openvz.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-06-14 07:36:40 -04:00
Theodore Ts'o
8844618d8a ext4: only look at the bg_flags field if it is valid
The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled.  We were not
consistently looking at this field; fix this.

Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up.  Check for these conditions and mark the
file system as corrupted if they are detected.

This addresses CVE-2018-10876.

https://bugzilla.kernel.org/show_bug.cgi?id=199403

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-14 00:58:00 -04:00
Theodore Ts'o
77260807d1 ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
It's really bad when the allocation bitmaps and the inode table
overlap with the block group descriptors, since it causes random
corruption of the bg descriptors.  So we really want to head those off
at the pass.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-13 23:08:26 -04:00
Theodore Ts'o
819b23f1c5 ext4: always check block group bounds in ext4_init_block_bitmap()
Regardless of whether the flex_bg feature is set, we should always
check to make sure the bits we are setting in the block bitmap are
within the block group bounds.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-13 23:00:48 -04:00
Theodore Ts'o
513f86d738 ext4: always verify the magic number in xattr blocks
If there an inode points to a block which is also some other type of
metadata block (such as a block allocation bitmap), the
buffer_verified flag can be set when it was validated as that other
metadata block type; however, it would make a really terrible external
attribute block.  The reason why we use the verified flag is to avoid
constantly reverifying the block.  However, it doesn't take much
overhead to make sure the magic number of the xattr block is correct,
and this will avoid potential crashes.

This addresses CVE-2018-10879.

https://bugzilla.kernel.org/show_bug.cgi?id=200001

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
2018-06-13 00:51:28 -04:00
Theodore Ts'o
5369a762c8 ext4: add corruption check in ext4_xattr_set_entry()
In theory this should have been caught earlier when the xattr list was
verified, but in case it got missed, it's simple enough to add check
to make sure we don't overrun the xattr buffer.

This addresses CVE-2018-10879.

https://bugzilla.kernel.org/show_bug.cgi?id=200001

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
2018-06-13 00:23:11 -04:00
Linus Torvalds
f5b7769eb0 Revert "debugfs: inode: debugfs_create_dir uses mode permission from parent"
This reverts commit 95cde3c599.

The commit had good intentions, but it breaks kvm-tool and qemu-kvm.

With it in place, "lkvm run" just fails with

  Error: KVM_CREATE_VM ioctl
  Warning: Failed init: kvm__init

which isn't a wonderful error message, but bisection pinpointed the
problematic commit.

The problem is almost certainly due to the special kvm debugfs entries
created dynamically by kvm under /sys/kernel/debug/kvm/.  See
kvm_create_vm_debugfs()

Bisected-and-reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Thomas Richter <tmricht@linux.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-12 20:52:16 -07:00
Theodore Ts'o
327eaf738f ext4: add warn_on_error mount option
This is very handy when debugging bugs handling maliciously corrupted
file systems.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-06-12 23:34:57 -04:00
Linus Torvalds
b08fc5277a - Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
 - Explicitly reported overflow fixes (Silvio, Kees)
 - Add missing kvcalloc() function (Kees)
 - Treewide conversions of allocators to use either 2-factor argument
   variant when available, or array_size() and array3_size() as needed (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
 oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
 bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
 oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
 EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
 fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
 zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
 ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
 tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
 BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
 LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
 AtfvEeaPHaOyD8/h2Q==
 =zUUp
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull more overflow updates from Kees Cook:
 "The rest of the overflow changes for v4.18-rc1.

  This includes the explicit overflow fixes from Silvio, further
  struct_size() conversions from Matthew, and a bug fix from Dan.

  But the bulk of it is the treewide conversions to use either the
  2-factor argument allocators (e.g. kmalloc(a * b, ...) into
  kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
  b) into vmalloc(array_size(a, b)).

  Coccinelle was fighting me on several fronts, so I've done a bunch of
  manual whitespace updates in the patches as well.

  Summary:

   - Error path bug fix for overflow tests (Dan)

   - Additional struct_size() conversions (Matthew, Kees)

   - Explicitly reported overflow fixes (Silvio, Kees)

   - Add missing kvcalloc() function (Kees)

   - Treewide conversions of allocators to use either 2-factor argument
     variant when available, or array_size() and array3_size() as needed
     (Kees)"

* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
  treewide: Use array_size in f2fs_kvzalloc()
  treewide: Use array_size() in f2fs_kzalloc()
  treewide: Use array_size() in f2fs_kmalloc()
  treewide: Use array_size() in sock_kmalloc()
  treewide: Use array_size() in kvzalloc_node()
  treewide: Use array_size() in vzalloc_node()
  treewide: Use array_size() in vzalloc()
  treewide: Use array_size() in vmalloc()
  treewide: devm_kzalloc() -> devm_kcalloc()
  treewide: devm_kmalloc() -> devm_kmalloc_array()
  treewide: kvzalloc() -> kvcalloc()
  treewide: kvmalloc() -> kvmalloc_array()
  treewide: kzalloc_node() -> kcalloc_node()
  treewide: kzalloc() -> kcalloc()
  treewide: kmalloc() -> kmalloc_array()
  mm: Introduce kvcalloc()
  video: uvesafb: Fix integer overflow in allocation
  UBIFS: Fix potential integer overflow in allocation
  leds: Use struct_size() in allocation
  Convert intel uncore to struct_size
  ...
2018-06-12 18:28:00 -07:00
Kees Cook
9d2a789c1d treewide: Use array_size in f2fs_kvzalloc()
The f2fs_kvzalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:

        f2fs_kvzalloc(handle, a * b, gfp)

with:
        f2fs_kvzalloc(handle, array_size(a, b), gfp)

as well as handling cases of:

        f2fs_kvzalloc(handle, a * b * c, gfp)

with:

        f2fs_kvzalloc(handle, array3_size(a, b, c), gfp)

This does, however, attempt to ignore constant size factors like:

        f2fs_kvzalloc(handle, 4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@

(
  f2fs_kvzalloc(HANDLE,
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@

  f2fs_kvzalloc(HANDLE,
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@

(
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kvzalloc(HANDLE,
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  f2fs_kvzalloc(HANDLE, C1 * C2 * C3, ...)
|
  f2fs_kvzalloc(HANDLE,
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@

(
  f2fs_kvzalloc(HANDLE, C1 * C2, ...)
|
  f2fs_kvzalloc(HANDLE,
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
026f05079b treewide: Use array_size() in f2fs_kzalloc()
The f2fs_kzalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:

        f2fs_kzalloc(handle, a * b, gfp)

with:
        f2fs_kzalloc(handle, array_size(a, b), gfp)

as well as handling cases of:

        f2fs_kzalloc(handle, a * b * c, gfp)

with:

        f2fs_kzalloc(handle, array3_size(a, b, c), gfp)

This does, however, attempt to ignore constant size factors like:

        f2fs_kzalloc(handle, 4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@

(
  f2fs_kzalloc(HANDLE,
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@

  f2fs_kzalloc(HANDLE,
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@

(
  f2fs_kzalloc(HANDLE,
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kzalloc(HANDLE,
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  f2fs_kzalloc(HANDLE, C1 * C2 * C3, ...)
|
  f2fs_kzalloc(HANDLE,
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@

(
  f2fs_kzalloc(HANDLE, C1 * C2, ...)
|
  f2fs_kzalloc(HANDLE,
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
c86065938a treewide: Use array_size() in f2fs_kmalloc()
The f2fs_kmalloc() function has no 2-factor argument form, so
multiplication factors need to be wrapped in array_size(). This patch
replaces cases of:

        f2fs_kmalloc(handle, a * b, gfp)

with:
        f2fs_kmalloc(handle, array_size(a, b), gfp)

as well as handling cases of:

        f2fs_kmalloc(handle, a * b * c, gfp)

with:

        f2fs_kmalloc(handle, array3_size(a, b, c), gfp)

This does, however, attempt to ignore constant size factors like:

        f2fs_kmalloc(handle, 4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
expression HANDLE;
type TYPE;
expression THING, E;
@@

(
  f2fs_kmalloc(HANDLE,
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression HANDLE;
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
expression HANDLE;
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
expression HANDLE;
identifier SIZE, COUNT;
@@

  f2fs_kmalloc(HANDLE,
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression HANDLE;
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression HANDLE;
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
expression HANDLE;
identifier STRIDE, SIZE, COUNT;
@@

(
  f2fs_kmalloc(HANDLE,
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  f2fs_kmalloc(HANDLE,
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression HANDLE;
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  f2fs_kmalloc(HANDLE, C1 * C2 * C3, ...)
|
  f2fs_kmalloc(HANDLE,
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression HANDLE;
expression E1, E2;
constant C1, C2;
@@

(
  f2fs_kmalloc(HANDLE, C1 * C2, ...)
|
  f2fs_kmalloc(HANDLE,
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
fad953ce0b treewide: Use array_size() in vzalloc()
The vzalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vzalloc(a * b)

with:
        vzalloc(array_size(a, b))

as well as handling cases of:

        vzalloc(a * b * c)

with:

        vzalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vzalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vzalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vzalloc(C1 * C2 * C3, ...)
|
  vzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vzalloc(C1 * C2, ...)
|
  vzalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
42bc47b353 treewide: Use array_size() in vmalloc()
The vmalloc() function has no 2-factor argument form, so multiplication
factors need to be wrapped in array_size(). This patch replaces cases of:

        vmalloc(a * b)

with:
        vmalloc(array_size(a, b))

as well as handling cases of:

        vmalloc(a * b * c)

with:

        vmalloc(array3_size(a, b, c))

This does, however, attempt to ignore constant size factors like:

        vmalloc(4 * 1024)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  vmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  vmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  vmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  vmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_ID
+	array_size(COUNT_ID, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_ID)
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_ID
+	array_size(COUNT_ID, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT_CONST)
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT_CONST
+	array_size(COUNT_CONST, sizeof(THING))
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

  vmalloc(
-	SIZE * COUNT
+	array_size(COUNT, SIZE)
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  vmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  vmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  vmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  vmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  vmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  vmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  vmalloc(C1 * C2 * C3, ...)
|
  vmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants.
@@
expression E1, E2;
constant C1, C2;
@@

(
  vmalloc(C1 * C2, ...)
|
  vmalloc(
-	E1 * E2
+	array_size(E1, E2)
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
344476e16a treewide: kvmalloc() -> kvmalloc_array()
The kvmalloc() function has a 2-factor argument form, kvmalloc_array(). This
patch replaces cases of:

        kvmalloc(a * b, gfp)

with:
        kvmalloc_array(a * b, gfp)

as well as handling cases of:

        kvmalloc(a * b * c, gfp)

with:

        kvmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kvmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kvmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kvmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kvmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kvmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kvmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kvmalloc
+ kvmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kvmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kvmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kvmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kvmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kvmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kvmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kvmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kvmalloc(C1 * C2 * C3, ...)
|
  kvmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kvmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kvmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kvmalloc(sizeof(THING) * C2, ...)
|
  kvmalloc(sizeof(TYPE) * C2, ...)
|
  kvmalloc(C1 * C2 * C3, ...)
|
  kvmalloc(C1 * C2, ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kvmalloc
+ kvmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6396bb2215 treewide: kzalloc() -> kcalloc()
The kzalloc() function has a 2-factor argument form, kcalloc(). This
patch replaces cases of:

        kzalloc(a * b, gfp)

with:
        kcalloc(a * b, gfp)

as well as handling cases of:

        kzalloc(a * b * c, gfp)

with:

        kzalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kzalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kzalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kzalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kzalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kzalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kzalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc
+ kcalloc
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kzalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kzalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kzalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kzalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kzalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kzalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kzalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kzalloc(sizeof(THING) * C2, ...)
|
  kzalloc(sizeof(TYPE) * C2, ...)
|
  kzalloc(C1 * C2 * C3, ...)
|
  kzalloc(C1 * C2, ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kzalloc
+ kcalloc
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Kees Cook
6da2ec5605 treewide: kmalloc() -> kmalloc_array()
The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
patch replaces cases of:

        kmalloc(a * b, gfp)

with:
        kmalloc_array(a * b, gfp)

as well as handling cases of:

        kmalloc(a * b * c, gfp)

with:

        kmalloc(array3_size(a, b, c), gfp)

as it's slightly less ugly than:

        kmalloc_array(array_size(a, b), c, gfp)

This does, however, attempt to ignore constant size factors like:

        kmalloc(4 * 1024, gfp)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The tools/ directory was manually excluded, since it has its own
implementation of kmalloc().

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
  kmalloc(
-	(sizeof(TYPE)) * E
+	sizeof(TYPE) * E
  , ...)
|
  kmalloc(
-	(sizeof(THING)) * E
+	sizeof(THING) * E
  , ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
  kmalloc(
-	sizeof(u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * (COUNT)
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(__u8) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(char) * COUNT
+	COUNT
  , ...)
|
  kmalloc(
-	sizeof(unsigned char) * COUNT
+	COUNT
  , ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_ID)
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_ID
+	COUNT_ID, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (COUNT_CONST)
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * COUNT_CONST
+	COUNT_CONST, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_ID)
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_ID
+	COUNT_ID, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (COUNT_CONST)
+	COUNT_CONST, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * COUNT_CONST
+	COUNT_CONST, sizeof(THING)
  , ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kmalloc
+ kmalloc_array
  (
-	SIZE * COUNT
+	COUNT, SIZE
  , ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
  kmalloc(
-	sizeof(TYPE) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(TYPE) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(TYPE))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * (COUNT) * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * (STRIDE)
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
|
  kmalloc(
-	sizeof(THING) * COUNT * STRIDE
+	array3_size(COUNT, STRIDE, sizeof(THING))
  , ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
  kmalloc(
-	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(THING1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * COUNT
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
|
  kmalloc(
-	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
  , ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
  kmalloc(
-	(COUNT) * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * STRIDE * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	(COUNT) * (STRIDE) * (SIZE)
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
|
  kmalloc(
-	COUNT * STRIDE * SIZE
+	array3_size(COUNT, STRIDE, SIZE)
  , ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(
-	(E1) * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * E3
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	(E1) * (E2) * (E3)
+	array3_size(E1, E2, E3)
  , ...)
|
  kmalloc(
-	E1 * E2 * E3
+	array3_size(E1, E2, E3)
  , ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
  kmalloc(sizeof(THING) * C2, ...)
|
  kmalloc(sizeof(TYPE) * C2, ...)
|
  kmalloc(C1 * C2 * C3, ...)
|
  kmalloc(C1 * C2, ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * (E2)
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(TYPE) * E2
+	E2, sizeof(TYPE)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * (E2)
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	sizeof(THING) * E2
+	E2, sizeof(THING)
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * E2
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	(E1) * (E2)
+	E1, E2
  , ...)
|
- kmalloc
+ kmalloc_array
  (
-	E1 * E2
+	E1, E2
  , ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Silvio Cesare
353748a359 UBIFS: Fix potential integer overflow in allocation
There is potential for the size and len fields in ubifs_data_node to be
too large causing either a negative value for the length fields or an
integer overflow leading to an incorrect memory allocation. Likewise,
when the len field is small, an integer underflow may occur.

Signed-off-by: Silvio Cesare <silvio.cesare@gmail.com>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Matthew Wilcox
a3ac973076 Convert jffs2 acl to struct_size
Need to tell the compiler that the acl entries follow the acl header.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 16:19:22 -07:00
Linus Torvalds
a205f0c974 Changes since last update:
- Strengthen metadata checking to avoid ASSERTing on bad disk contents
 - Validate btree records that are being retrieved for clients
 - Strengthen root inode verification
 - Convert license blurbs to SPDX tags
 - Enable changing DAX flag on directories
 - Fix some writeback deadlocks in reflink
 - Refactor out some old xfs helpers
 - Move type verifiers to a separate file
 - Fix some fuzzer crashes
 - Various other bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsfUegACgkQ+H93GTRK
 tOv0sw/9HW63zhuzGs9uihmyGvtqNeeUBfKc+3ovJ80wnhYOa0n8yYCBORS1EaMS
 YPk74IbD0yAak0H9ePdOpont43gGVTDox6/K8+6rPFnWtX30Z2/ckb6BWE4UfoW8
 QeojpB2+aS6fqfO1wcSb3i//XRu4h90ORQY0xNkHYcN4GWwIDwPCyBf+AT9HH1E+
 GFHtB3QWANZg6LRT7X0GVgz5r68lzyxX1WisJ4uAm0NwKR5zVb9NWFCcOszQ45Ky
 +YFw4kfgithbIHlwTpo3LrvQk7+cBhlSpWuASZOYjugxcQ2d85B/+9mF/QDnLOey
 ddbO6WK+wo0KZImpFvOOQZY07cO7vtWwkWHraz0PkUdaEab5rcnooLoJg9UTMZa4
 WT8wM8CrX1kkFvJQCuAMV9jblovjETeYhHfG8ak8Z/lWc3WEnEBUFQiO9ZVQdiAv
 B02xMmpOkfi0fqRCg6li9u3CJtN+2vxPiNEME3lz5zdY5aE2aXSmCspvP3aPVZMt
 y1fZ90u5NONz6Q9WrIh0plEru4oynhwVuqRrnVRDPCT4X64IZXuf/fBmYqrfZGmJ
 K45P/LQDvfcHj3xBLhfkKv5OpXtyYgDtLSBNqYAYrcGS4sW7Z4Ts8ohqcOhF1OqR
 g3mFp75aO4Ekw6hFbg9CRX13G4mu80BmnRKDVwFjThkl6d0Xyxw=
 =SD3u
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs updates from Darrick Wong:
 "Here's the second round of patches for XFS for 4.18. Most of the
  commits are small cleanups, bug fixes, and continued strengthening of
  metadata verifiers; the bulk of the diff is the conversion of the
  fs/xfs/ tree to use SPDX tags.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Summary:

   - Strengthen metadata checking to avoid ASSERTing on bad disk
     contents

   - Validate btree records that are being retrieved for clients

   - Strengthen root inode verification

   - Convert license blurbs to SPDX tags

   - Enable changing DAX flag on directories

   - Fix some writeback deadlocks in reflink

   - Refactor out some old xfs helpers

   - Move type verifiers to a separate file

   - Fix some fuzzer crashes

   - Various other bug fixes"

* tag 'xfs-4.18-merge-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits)
  xfs: update incore per-AG inode count
  xfs: replace do_mod with native operations
  xfs: don't call xfs_da_shrink_inode with NULL bp
  xfs: clean up MIN/MAX
  xfs: move various type verifiers to common file
  xfs: xfs_reflink_convert_cow() memory allocation deadlock
  xfs: setup VFS i_rwsem lockdep state correctly
  xfs: fix string handling in label get/set functions
  xfs: convert to SPDX license tags
  xfs: validate btree records on retrieval
  xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
  xfs: verify root inode more thoroughly
  xfs: verify COW extent size hint is valid in inode verifier
  xfs: verify extent size hint is valid in inode verifier
  xfs: catch bad stripe alignment configurations
  iomap: fsync swap files before iterating mappings
  xfs: use xfs_trans_getsb in xfs_sync_sb_buf
  xfs: don't assert on corrupted unlinked inode list
  xfs: explicitly pass buffer size to xfs_corruption_error
  xfs: don't assert when on-disk btree pointers are garbage
  ...
2018-06-12 15:49:00 -07:00
Geert Uytterhoeven
ea8781e5e7 autofs: Fix typo s/thenew new/the new/ in AUTOFS4_FS description
Fixes: a2225d931f ("autofs: remove left-over autofs4 stubs")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-12 12:31:19 -07:00
Linus Torvalds
0725d4e1b8 NFS client updates for Linux 4.18
Highlights include:
 
 Stable fixes:
 - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
 - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
 - Revert an incorrect change to the NFSv4.1 callback channel
 - Fix a bug in the NFSv4.1 sequence error handling
 
 Features and optimisations:
 - Support for piggybacking a LAYOUTGET operation to the OPEN compound
 - RDMA performance enhancements to deal with transport congestion
 - Add proper SPDX tags for NetApp-contributed RDMA source
 - Do not request delegated file attributes (size+change) from the server
 - Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN
 - Optimise away unnecessary lookups for rename targets
 - Misc performance improvements when freeing NFSv4 delegations
 
 Bugfixes and cleanups:
 - Try to fail quickly if proto=rdma
 - Clean up RDMA receive trace points
 - Fix sillyrename to return the delegation when appropriate
 - Misc attribute revalidation fixes
 - Immediately clear the pNFS layout on a file when the server returns ESTALE
 - Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab()
 - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbH8gIAAoJEA4mA3inWBJcpzYQAJYY3ykt9oLQgm/2b/D/weDe
 6890M9W5nIeuZq5soWSpYsZTxqIFbGV4laG/eCTW1gUN1TitSZsoOp7kqhRHXOjq
 Rv3ZvjlZsP2qv2SnzsEmhJsynfyB46d19smSTJhgQ8dnXhaZv04Wsd4krLHx0z6p
 uUUis5Q1m+vL7HsFPp3iUareO/DFKeSkw2cQ2V5ksTIEiAzX7GC+Ex/KKWf82nrJ
 hm7+Nq7rLf1QHJkQvsc3fYCMR4gIzEwUu6F8RyxCoAVgD6O90Hx6NbxnINaHDD4N
 U0nRP5LwCyN9hbPWvwcH7Sn4ePDTos2yj2tFO5NP9btTLDVLFSGYZ2a74d9PRdAf
 9jn6f6juSDwI7T6NXvkHzzkJG6Or9ABAUZo+yX5JoD6lmgOcPUJpLRy6fu7UxAuN
 a5OZ7d9edYpOi0Kys8sDSIlLlxZtFkvybOMVuI3dSHsI+c0g39w8oarpqT2wXWMs
 /ZtFz0FCreHhKkNtz7Z49z1UQHDv/XYM0WkcO+eaeK58RLIEE0pZHoMvPKP63lkI
 nbbgHvBRAu38Jtvvu65Hpb/VpBcqNGM5hjN1cfW/BOqAPKW23s4vWVj+/1silfW/
 uw0MkNrDC9endoALp/YMCcMwPvEw9Awt9y4KjMgfVgSnKwXd0HaSZ2zE6aJU3Wry
 Fy2Tv0e0OH3z9Bi/LNuJ
 =YWSl
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message

   - Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()

   - Revert an incorrect change to the NFSv4.1 callback channel

   - Fix a bug in the NFSv4.1 sequence error handling

  Features and optimisations:

   - Support for piggybacking a LAYOUTGET operation to the OPEN compound

   - RDMA performance enhancements to deal with transport congestion

   - Add proper SPDX tags for NetApp-contributed RDMA source

   - Do not request delegated file attributes (size+change) from the
     server

   - Optimise away a GETATTR in the lookup revalidate code when doing
     NFSv4 OPEN

   - Optimise away unnecessary lookups for rename targets

   - Misc performance improvements when freeing NFSv4 delegations

  Bugfixes and cleanups:

   - Try to fail quickly if proto=rdma

   - Clean up RDMA receive trace points

   - Fix sillyrename to return the delegation when appropriate

   - Misc attribute revalidation fixes

   - Immediately clear the pNFS layout on a file when the server returns
     ESTALE

   - Return NFS4ERR_DELAY when delegation/layout recalls fail due to
     igrab()

   - Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY"

* tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits)
  skip LAYOUTRETURN if layout is invalid
  NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
  NFSv4: Fix a typo in nfs41_sequence_process
  NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
  NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
  NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
  NFSv4.0: Remove transport protocol name from non-UCS client ID
  NFSv4.0: Remove cl_ipaddr from non-UCS client ID
  NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
  NFS: Filter cache invalidation when holding a delegation
  NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
  NFS: Improve caching while holding a delegation
  NFS: Fix attribute revalidation
  NFS: fix up nfs_setattr_update_inode
  NFSv4: Ensure the inode is clean when we set a delegation
  NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
  NFSv4: Don't ask for delegated attributes when adding a hard link
  NFSv4: Don't ask for delegated attributes when revalidating the inode
  NFS: Pass the inode down to the getattr() callback
  NFSv4: Don't request size+change attribute if they are delegated to us
  ...
2018-06-12 10:09:03 -07:00
Linus Torvalds
89e255678f A relatively quiet cycle for nfsd. The largest piece is an RDMA update
from Chuck Lever with new trace points, miscellaneous cleanups, and
 streamlining of the send and receive paths.  Other than that, some
 miscellaneous bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbHtKUAAoJECebzXlCjuG+dfgP/2Z9PiJXlxKC2iISgkfMGmBd
 MmWZYekYMtCe5raoiI720W5cGL7uBLoKnc+r57+n7bEGxV9OFwtspmKGn17P/zrY
 YcBIdN7gjpqn8wrflLR4D09bGpnmaZG26jIt/v0TS+N1aFKO3gNXb0ZVSjUadlI0
 UsKRbYxr8qucIENVtXhfA0eRivddadsKopAEwflUrxf+8oEaYszPFUfNXcGDpdHK
 +6D2lFjr/Fn+z97Rbz/G3fMfldpYhUOpH28DOiCuKEpgamK3dYjx1WoGUANxcj3o
 RsbHGZnMR6842Nj5aHus0k6Ao9bgqt6lx+jKlkvWYK+G2EfMfV9Z1gAipPY+IMbd
 Zk5A4pnFpI1UG3sUlcnpaxAM/pHBs7heYGqj0hyocG8rB4V7SDZxp21Lv1fjTH/A
 XHAkdiT4iSgI11J8YbmDBR1S7bAnfNm7GT24DsAkZLzh2f5Miq5m/ZMxDxQLAFCJ
 3YKo2aNVjKvA/aOKDe5RMLZUhnmuhb8aMIDuQY2Ir1EK4S+7EYOiYAvqlbJrM3Ro
 aLmb9BUzRRWmRydMKOeGkWiMj49lHRW6oJxvb33PDZEEqW/AlvmYEyMGfjhXzPDE
 OZkvbdYrni4n5YboplxNnJyL0NJ6l5YAikV94SBWBknrnNv1psSZbDKoIgp2ghhQ
 rdP842qSmDiZiXVlTr3e
 =PuEk
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "A relatively quiet cycle for nfsd.

  The largest piece is an RDMA update from Chuck Lever with new trace
  points, miscellaneous cleanups, and streamlining of the send and
  receive paths.

  Other than that, some miscellaneous bugfixes"

* tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux: (26 commits)
  nfsd: fix error handling in nfs4_set_delegation()
  nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
  Fix 16-byte memory leak in gssp_accept_sec_context_upcall
  svcrdma: Fix incorrect return value/type in svc_rdma_post_recvs
  svcrdma: Remove unused svc_rdma_op_ctxt
  svcrdma: Persistently allocate and DMA-map Send buffers
  svcrdma: Simplify svc_rdma_send()
  svcrdma: Remove post_send_wr
  svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt
  svcrdma: Introduce svc_rdma_send_ctxt
  svcrdma: Clean up Send SGE accounting
  svcrdma: Refactor svc_rdma_dma_map_buf
  svcrdma: Allocate recv_ctxt's on CPU handling Receives
  svcrdma: Persistently allocate and DMA-map Receive buffers
  svcrdma: Preserve Receive buffer until svc_rdma_sendto
  svcrdma: Simplify svc_rdma_recv_ctxt_put
  svcrdma: Remove sc_rq_depth
  svcrdma: Introduce svc_rdma_recv_ctxt
  svcrdma: Trace key RDMA API events
  svcrdma: Trace key RPC/RDMA protocol events
  ...
2018-06-12 09:49:33 -07:00
Olga Kornievskaia
93b7f7ad20 skip LAYOUTRETURN if layout is invalid
Currently, when IO to DS fails, client returns the layout and
retries against the MDS. However, then on umounting (inode eviction)
it returns the layout again.

This is because pnfs_return_layout() was changed in
commit d78471d32b ("pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERROR")
to always set NFS_LAYOUT_RETURN_REQUESTED so even if we returned
the layout, it will be returned again. Instead, let's also check
if we have already marked the layout invalid.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-12 08:48:04 -04:00
Darrick J. Wong
89e9b5c091 xfs: update incore per-AG inode count
For whatever reason we never actually update pagi_count (the in-core
perag inode count) when we allocate or free inode chunks.  Online scrub
is going to use it, so we need to fix the accounting.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-06-11 21:53:52 -07:00
Linus Torvalds
d54d35c501 f2fs-for-4.18-rc1
In this round, we've mainly focused on discard, aka unmap, control along with
 fstrim for Android-specific usage model. In addition, we've fixed writepage flow
 which returned EAGAIN previously resulting in EIO of fsync(2) due to mapping's
 error state. In order to avoid old MM bug [1], we decided not to use __GFP_ZERO
 for the mapping for node and meta page caches. As always, we've cleaned up many
 places for future fsverity and symbol conflicts.
 
 Enhancement:
  - do discard/fstrim in lower priority considering fs utilization
  - split large discard commands into smaller ones for better responsiveness
  - add more sanity checks to address syzbot reports
  - add a mount option, fsync_mode=nobarrier, which can reduce # of cache flushes
  - clean up symbol namespace with modified function names
  - be strict on block allocation and IO control in corner cases
 
 Bug fix:
  - don't use __GFP_ZERO for mappings
  - fix error reports in writepage to avoid fsync() failure
  - avoid selinux denial on CAP_RESOURCE on resgid/resuid
  - fix some subtle race conditions in GC/atomic writes/shutdown
  - fix overflow bugs in sanity_check_raw_super
  - fix missing bits on get_flags
 
 Clean-up:
  - prepare the generic flow for future fsverity integration
  - fix some broken coding standard
 
 [1] https://lkml.org/lkml/2018/4/8/661
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlsepb8ACgkQQBSofoJI
 UNJdSw/+IhrYJFkJEN/pV4M5xSjYirl/P2WJ4AGi6HcpjEGmaDiBi2whod1Jw2NE
 1auSMiby7K91VAmPvxMmmLhOdC8XgJ8jwY1nEaZMfmMXohlaD3FDY5bzYf5rJDF4
 J184P6xUZ2IKlFVA4prwNQgYi3awPthVu1lxbFPp8GUHDbmr5ZXEysxPDzz2O0Em
 oE7WmklmyCHJPhmg/EcVXfF/Ekf3zMOVR+EI2otcDjnWIQioVetIK8CKi0MM4bkG
 X8Z318ANjGTd42woupXIzsiTrMRONY7zzkUvE+S6tfUjKZoIdofDM5OIXMdOxpxL
 DZ53WrwfeB74igD8jDZgqD6OaonIfDfCuKrwUASFAC2Ou4h3apj3ckUzoHtAhEUL
 z5yTSKTrtfuoSufhBp+nKKs3ijDgms76arw8x/pPdN6D6xDwIJtBPxC2sObPaj35
 damv4GyM4+sbhGO/Gbie2q6za55IvYFZc7JNCC2D2K5tnBmUaa7/XdvxcyigniGk
 AZgkaddHePkAZpa5AYYirZR8bd7IFds0+m6VcybG0/pYb0qPEcI6U4mujBSCIwVy
 kXuD7su3jNjj6hWnCl5PSQo8yBWS5H8c6/o+5XHozzYA91dsLAmD8entuCreg6Hp
 NaIFio0qKULweLK86f66qQTsRPMpYRAtqPS0Ew0+3llKMcrlRp4=
 =JrW7
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've mainly focused on discard, aka unmap, control
  along with fstrim for Android-specific usage model. In addition, we've
  fixed writepage flow which returned EAGAIN previously resulting in EIO
  of fsync(2) due to mapping's error state. In order to avoid old MM bug
  [1], we decided not to use __GFP_ZERO for the mapping for node and
  meta page caches. As always, we've cleaned up many places for future
  fsverity and symbol conflicts.

  Enhancements:
   - do discard/fstrim in lower priority considering fs utilization
   - split large discard commands into smaller ones for better responsiveness
   - add more sanity checks to address syzbot reports
   - add a mount option, fsync_mode=nobarrier, which can reduce # of cache flushes
   - clean up symbol namespace with modified function names
   - be strict on block allocation and IO control in corner cases

  Bug fixes:
   - don't use __GFP_ZERO for mappings
   - fix error reports in writepage to avoid fsync() failure
   - avoid selinux denial on CAP_RESOURCE on resgid/resuid
   - fix some subtle race conditions in GC/atomic writes/shutdown
   - fix overflow bugs in sanity_check_raw_super
   - fix missing bits on get_flags

  Clean-ups:
   - prepare the generic flow for future fsverity integration
   - fix some broken coding standard"

[1] https://lkml.org/lkml/2018/4/8/661

* tag 'f2fs-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (79 commits)
  f2fs: fix to clear FI_VOLATILE_FILE correctly
  f2fs: let sync node IO interrupt async one
  f2fs: don't change wbc->sync_mode
  f2fs: fix to update mtime correctly
  fs: f2fs: insert space around that ':' and ', '
  fs: f2fs: add missing blank lines after declarations
  fs: f2fs: changed variable type of offset "unsigned" to "loff_t"
  f2fs: clean up symbol namespace
  f2fs: make set_de_type() static
  f2fs: make __f2fs_write_data_pages() static
  f2fs: fix to avoid accessing cross the boundary
  f2fs: fix to let caller retry allocating block address
  disable loading f2fs module on PAGE_SIZE > 4KB
  f2fs: fix error path of move_data_page
  f2fs: don't drop dentry pages after fs shutdown
  f2fs: fix to avoid race during access gc_thread pointer
  f2fs: clean up with clear_radix_tree_dirty_tag
  f2fs: fix to don't trigger writeback during recovery
  f2fs: clear discard_wake earlier
  f2fs: let discard thread wait a little longer if dev is busy
  ...
2018-06-11 10:16:13 -07:00
Linus Torvalds
a2225d931f autofs: remove left-over autofs4 stubs
There's no need to retain the fs/autofs4 directory for backward
compatibility.

Adding an AUTOFS4_FS fragment to the autofs Kconfig and a module alias
for autofs4 is sufficient for almost all cases. Not keeping fs/autofs4
remnants will prevent "insmod <path>/autofs4/autofs4.ko" from working
but this shouldn't be used in automation scripts rather than
modprobe(8).

There were some comments about things to look out for with the module
rename in the fs/autofs4/Kconfig that is removed by this patch, see the
commit patch if you are interested.

One potential problem with this change is that when the
fs/autofs/Kconfig fragment for AUTOFS4_FS is removed any AUTOFS4_FS
entries will be removed from the kernel config, resulting in no autofs
file system being built if there is no AUTOFS_FS entry also.

This would have also happened if the fs/autofs4 remnants had remained
and is most likely to be a problem with automated builds.

Please check your build configurations before the removal which will
occur after the next couple of kernel releases.

Acked-by: Ian Kent <raven@themaw.net>
[ With edits and commit message from Ian Kent ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-11 08:22:34 -07:00
Qu Wenruo
ac0b4145d6 btrfs: scrub: Don't use inode pages for device replace
[BUG]
Btrfs can create compressed extent without checksum (even though it
shouldn't), and if we then try to replace device containing such extent,
the result device will contain all the uncompressed data instead of the
compressed one.

Test case already submitted to fstests:
https://patchwork.kernel.org/patch/10442353/

[CAUSE]
When handling compressed extent without checksum, device replace will
goe into copy_nocow_pages() function.

In that function, btrfs will get all inodes referring to this data
extents and then use find_or_create_page() to get pages direct from that
inode.

The problem here is, pages directly from inode are always uncompressed.
And for compressed data extent, they mismatch with on-disk data.
Thus this leads to corrupted compressed data extent written to replace
device.

[FIX]
In this attempt, we could just remove the "optimization" branch, and let
unified scrub_pages() to handle it.

Although scrub_pages() won't bother reusing page cache, it will be a
little slower, but it does the correct csum checking and won't cause
such data corruption caused by "optimization".

Note about the fix: this is the minimal fix that can be backported to
older stable trees without conflicts. The whole callchain from
copy_nocow_pages() can be deleted, and will be in followup patches.

Fixes: ff023aac31 ("Btrfs: add code to scrub to copy read data to another disk")
CC: stable@vger.kernel.org # 4.4+
Reported-by: James Harvey <jamespharvey20@gmail.com>
Reviewed-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ remove code removal, add note why ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-11 15:59:14 +02:00
Al Viro
87a3002af9 vmsplice(): lift importing iovec into vmsplice(2) and compat counterpart
... getting rid of transformations in the latter - just use
compat_import_iovec().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-11 02:14:04 -04:00
Linus Torvalds
ab0b2e5932 This pull request contains updates for both UBI and UBIFS:
- The UBI on-disk format header file is now dual licensed
 - New way to detect Fastmap problems during runtime
 - Bugfix for Fastmap
 - Minor updates for UBIFS (spelling, comments, vm_fault_t, ...)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlsdi6AWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVimEADam9dNKs7hVAlSR5G8fgoFMiPI
 PgP/cQ719PAYh8Zub0w8fIIvtSCaVZmPX/K40w4KAgP/jdQ/v4QD3ZYdHO3vn3/M
 6WI/cEB/Cn5lJp78tt14cGlqG3+xnlwyc9EEeiMCPDWmoSf66J7fwuL3xRq4dDet
 CBISYrF8ZMe6BJohT9kAYeqYoYLzALUmZoOx1DGhX7b2SEaLphCyW0A0rkxoHbYG
 sxAfUiR4/bdaFfqLAmEMgoovIHGYmOdnMJ3Hk+uUVrPEfBzxDQuGfpevDJ1u99hS
 D4qdvY3+VAcTpTK0XJUBgPaVpkYYMS5nfHtpObbyOT/kBmgCloCAmtvVcznxUwJ7
 R+8UBZH/SrxUw+R9JCP1hFgb/PlEAE0+Vsc2jgvqWOQ8xnOqojG5FdqAueGanX+l
 /on2/HNoZp5C8x15+cB9SiQwfdoZrH7vr9uzi1lqaWSV+ZMDpvG/QFMX1QX1gLl9
 TFeQJBojE41y+hjTnnUQZdy1oPaisffU44ymKlf1FR2SdzW6d7wIRBd0jnB+7Pjz
 KocSAA2kqDUcCjUtLFSSVfE7kUL9wwJQUfUxeQFViTGDvAnLcmy1a6mqQ1K5YWq9
 e5mpMQRWzoq0XrPyaVI5EEhvaqWalJwIa6GA7ZksOvS4W1pqFLLyFOKkHejUPou8
 lxcw0dDsyZdmnDRuMA==
 =b0lt
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.18-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI and UBIFS updates from Richard Weinberger:

 - the UBI on-disk format header file is now dual licensed

 - new way to detect Fastmap problems during runtime

 - bugfix for Fastmap

 - minor updates for UBIFS (spelling, comments, vm_fault_t, ...)

* tag 'upstream-4.18-rc1' of git://git.infradead.org/linux-ubifs:
  mtd: ubi: Update ubi-media.h to dual license
  ubi: fastmap: Detect EBA mismatches on-the-fly
  ubi: fastmap: Check each mapping only once
  ubi: fastmap: Correctly handle interrupted erasures in EBA
  ubi: fastmap: Cancel work upon detach
  ubifs: lpt: Fix wrong pnode number range in comment
  ubifs: gc: Fix typo
  ubifs: log: Some spelling fixes
  ubifs: Spelling fix someting -> something
  ubifs: journal: Remove wrong comment
  ubifs: remove set but never used variable
  ubifs, xattr: remove misguided quota flags
  fs: ubifs: Adding new return type vm_fault_t
2018-06-10 15:52:09 -07:00
Linus Torvalds
0c14e43a42 one smb3 (ACL related) fix for stable, one SMB3 security enhancement (when mounting -t smb3 forbid less secure dialects), and some RDMA and compounding fixes
-----BEGIN PGP SIGNATURE-----
 
 iQHHBAABCgAxFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlsdWXsTHHNtZnJlbmNo
 QGdtYWlsLmNvbQAKCRCKLL1wB3JPUXiyC/wJdnwM+R2JGtj1ygZqPfS6/xWbCI0F
 uXfn9a09nrMVbM2CQCDChiqoU2yvg9cKqc2d/UhssuwJYdfNNYSVe5MDkYYCADRn
 WSvCQ4+ZDQEl65pcjVKcOlhwl73cIVQmb6yaxMA/scZx+ikfe67EgJEEJZHJPhFO
 WJnMA5IUMAQFw5KxBdKtuVL0ixHBYvtrtQ9HAyPq0GTmCsTyb0GcbeMYaqRwpCMS
 R4NXyTr+s68UUi+iM9r+I6nYqml5L2EBFjrSwRbu7e6PIRF5fEMnKzj0WSmbgsSG
 KY2b/YWSskN5p5dkU+cLETXfjOjg8ugROJzz4LYkS/dUn+AJpg7IOwUNfr/a0yO5
 dHMwpa9SltPOslcHZgnBUsfqLHZSP0y/QODVhhc80uapxmllA4CHOc+lgG5i0VlB
 mTS4SQNRPw2NINoGCr+C65ghcJJyf6ivP6J3PoCjobo76yte8TiyauxjuDIJgY8T
 h2Wc/fWgN9lc9IUcpvedUjxoacG+rMSopQc=
 =yibA
 -----END PGP SIGNATURE-----

Merge tag '4.18-fixes-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:

 - one smb3 (ACL related) fix for stable

 - one SMB3 security enhancement (when mounting -t smb3 forbid less
   secure dialects)

 - some RDMA and compounding fixes

* tag '4.18-fixes-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix a buffer leak in smb2_query_symlink
  smb3: do not allow insecure cifs mounts when using smb3
  CIFS: Fix NULL ptr deref
  CIFS: fix encryption in SMB3.1.1
  CIFS: Pass page offset for encrypting
  CIFS: Pass page offset for calculating signature
  CIFS: SMBD: Support page offset in memory registration
  CIFS: SMBD: Support page offset in RDMA recv
  CIFS: SMBD: Support page offset in RDMA send
  CIFS: When sending data on socket, pass the correct page offset
  CIFS: Introduce helper function to get page offset and length in smb_rqst
  CIFS: Calculate the correct request length based on page offset and tail size
  cifs: For SMB2 security informaion query, check for minimum sized security descriptor instead of sizeof FileAllInformation class
  CIFS: Fix signing for SMB2/3
2018-06-10 10:53:31 -07:00
Linus Torvalds
d82991a868 Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull restartable sequence support from Thomas Gleixner:
 "The restartable sequences syscall (finally):

  After a lot of back and forth discussion and massive delays caused by
  the speculative distraction of maintainers, the core set of
  restartable sequences has finally reached a consensus.

  It comes with the basic non disputed core implementation along with
  support for arm, powerpc and x86 and a full set of selftests

  It was exposed to linux-next earlier this week, so it does not fully
  comply with the merge window requirements, but there is really no
  point to drag it out for yet another cycle"

* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq/selftests: Provide Makefile, scripts, gitignore
  rseq/selftests: Provide parametrized tests
  rseq/selftests: Provide basic percpu ops test
  rseq/selftests: Provide basic test
  rseq/selftests: Provide rseq library
  selftests/lib.mk: Introduce OVERRIDE_TARGETS
  powerpc: Wire up restartable sequences system call
  powerpc: Add syscall detection for restartable sequences
  powerpc: Add support for restartable sequences
  x86: Wire up restartable sequence system call
  x86: Add support for restartable sequences
  arm: Wire up restartable sequences system call
  arm: Add syscall detection for restartable sequences
  arm: Add restartable sequences support
  rseq: Introduce restartable sequences system call
  uapi/headers: Provide types_32_64.h
2018-06-10 10:17:09 -07:00
Trond Myklebust
f9312a5410 NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
If the server returns NFS4ERR_SEQ_FALSE_RETRY or NFS4ERR_RETRY_UNCACHED_REP,
then it thinks we're trying to replay an existing request. If so, then
let's just bump the sequence ID and retry the operation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-09 19:10:31 -04:00
Linus Torvalds
3ca24ce9ff Merge branch 'proc-cmdline'
Merge proc_cmdline simplifications.

This re-writes the get_mm_cmdline() logic to be rather simpler than it
used to be, and makes the semantics for "cmdline goes past the end of
the original area" more natural.

You _can_ use prctl(PR_SET_MM) to just point your command line somewhere
else entirely, but the traditional model is to just edit things in place
and that still needs to continue to work.  At least this way the code
makes some sense.

* proc-cmdline:
  fs/proc: simplify and clarify get_mm_cmdline() function
  fs/proc: re-factor proc_pid_cmdline_read() a bit
2018-06-09 15:31:35 -07:00
Mikulas Patocka
f72328d27f hpfs: Use EUCLEAN for filesystem errors
Use the error code EUCLEAN for filesystem errors because other
filesystems use this code too.

[ And remove unused EMEMERROR  - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-09 14:34:48 -07:00
Linus Torvalds
eafdca4d70 Staging/IIO patches for 4.18-rc1
Here is the big staging and IIO driver update for 4.18-rc1.
 
 It was delayed as I wanted to make sure the final driver deletions did
 not cause any major merge issues, and all now looks good.
 
 There are a lot of patches here, just over 1000.  The diffstat summary
 shows the major changes here:
 	1007 files changed, 16828 insertions(+), 227770 deletions(-)
 Because of this, we might be close to shrinking the overall kernel
 source code size for two releases in a row.
 
 There was loads of work in this release cycle, primarily:
 	- tons of ks7010 driver cleanups
 	- lots of mt7621 driver fixes and cleanups
 	- most driver cleanups
 	- wilc1000 fixes and cleanups
 	- lots and lots of IIO driver cleanups and new additions
 	- debugfs cleanups for all staging drivers
 	- lots of other staging driver cleanups and fixes, the shortlog
 	  has the full details.
 
 but the big user-visable things here are the removal of 3 chunks of
 code:
 	- ncpfs and ipx were removed on schedule, no one has cared about
 	  this code since it moved to staging last year, and if it needs
 	  to come back, it can be reverted.
 	- lustre file system is removed.  I've ranted at the lustre
 	  developers about once a year for the past 5 years, with no
 	  real forward progress at all to clean things up and get the
 	  code into the "real" part of the kernel.  Given that the
 	  lustre developers continue to work on an external tree and try
 	  to port those changes to the in-kernel tree every once in a
 	  while, this whole thing really really is not working out at
 	  all.  So I'm deleting it so that the developers can spend the
 	  time working in their out-of-tree location and get things
 	  cleaned up properly to get merged into the tree correctly at a
 	  later date.
 
 Because of these file removals, you will have merge issues on some of
 these files (2 in the ipx code, 1 in the ncpfs code, and 1 in the
 atomisp driver).  Just delete those files, it's a simple merge :)
 
 All of this has been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWxvjGQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymoEwCbBYnyUl3cwCszIJ3L3/zvUWpmqIgAn1DDsAim
 dM4lmKg6HX/JBSV4GAN0
 =zdta
 -----END PGP SIGNATURE-----

Merge tag 'staging-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO updates from Greg KH:
 "Here is the big staging and IIO driver update for 4.18-rc1.

  It was delayed as I wanted to make sure the final driver deletions did
  not cause any major merge issues, and all now looks good.

  There are a lot of patches here, just over 1000. The diffstat summary
  shows the major changes here:

	1007 files changed, 16828 insertions(+), 227770 deletions(-)

  Because of this, we might be close to shrinking the overall kernel
  source code size for two releases in a row.

  There was loads of work in this release cycle, primarily:

   - tons of ks7010 driver cleanups

   - lots of mt7621 driver fixes and cleanups

   - most driver cleanups

   - wilc1000 fixes and cleanups

   - lots and lots of IIO driver cleanups and new additions

   - debugfs cleanups for all staging drivers

   - lots of other staging driver cleanups and fixes, the shortlog has
     the full details.

  but the big user-visable things here are the removal of 3 chunks of
  code:

   - ncpfs and ipx were removed on schedule, no one has cared about this
     code since it moved to staging last year, and if it needs to come
     back, it can be reverted.

   - lustre file system is removed.

     I've ranted at the lustre developers about once a year for the past
     5 years, with no real forward progress at all to clean things up
     and get the code into the "real" part of the kernel.

     Given that the lustre developers continue to work on an external
     tree and try to port those changes to the in-kernel tree every once
     in a while, this whole thing really really is not working out at
     all. So I'm deleting it so that the developers can spend the time
     working in their out-of-tree location and get things cleaned up
     properly to get merged into the tree correctly at a later date.

  Because of these file removals, you will have merge issues on some of
  these files (2 in the ipx code, 1 in the ncpfs code, and 1 in the
  atomisp driver). Just delete those files, it's a simple merge :)

  All of this has been in linux-next for a while with no reported
  problems"

* tag 'staging-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1011 commits)
  staging: ipx: delete it from the tree
  ncpfs: remove uapi .h files
  ncpfs: remove Documentation
  ncpfs: remove compat functionality
  staging: ncpfs: delete it
  staging: lustre: delete the filesystem from the tree.
  staging: vc04_services: no need to save the log debufs dentries
  staging: vc04_services: vchiq_debugfs_log_entry can be a void *
  staging: vc04_services: remove struct vchiq_debugfs_info
  staging: vc04_services: move client dbg directory into static variable
  staging: vc04_services: remove odd vchiq_debugfs_top() wrapper
  staging: vc04_services: no need to check debugfs return values
  staging: mt7621-gpio: reorder includes alphabetically
  staging: mt7621-gpio: change gc_map to don't use pointers
  staging: mt7621-gpio: use GPIOF_DIR_OUT and GPIOF_DIR_IN macros instead of custom values
  staging: mt7621-gpio: change 'to_mediatek_gpio' to make just a one line return
  staging: mt7621-gpio: dt-bindings: update documentation for #interrupt-cells property
  staging: mt7621-gpio: update #interrupt-cells for the gpio node
  staging: mt7621-gpio: dt-bindings: complete documentation for the gpio
  staging: mt7621-dts: add missing properties to gpio node
  ...
2018-06-09 10:32:39 -07:00
Trond Myklebust
995891006c NFSv4: Fix a typo in nfs41_sequence_process
We want to compare the slot_id to the highest slot number advertised by the
server.

Fixes: 3be0f80b5f ("NFSv4.1: Fix up replays of interrupted requests")
Cc: stable@vger.kernel.org # 4.15+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-09 12:59:55 -04:00
Trond Myklebust
fc40724fc6 NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
The correct behaviour for NFSv4 sequence IDs is to wrap around
to the value 0 after 0xffffffff.
See https://tools.ietf.org/html/rfc5661#section-2.10.6.1

Fixes: 5f83d86cf5 ("NFSv4.x: Fix wraparound issues when validing...")
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-09 12:56:30 -04:00
Linus Torvalds
7d3bf613e9 libnvdimm for 4.18
* DAX broke a fundamental assumption of truncate of file mapped pages.
   The truncate path assumed that it is safe to disconnect a pinned page
   from a file and let the filesystem reclaim the physical block. With DAX
   the page is equivalent to the filesystem block. Introduce
   dax_layout_busy_page() to enable filesystems to wait for pinned DAX
   pages to be released. Without this wait a filesystem could allocate
   blocks under active device-DMA to a new file.
 
 * DAX arranges for the block layer to be bypassed and uses
   dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
   However, the memcpy_mcsafe() facility is available through the pmem
   block driver. In order to safely handle media errors, via the DAX
   block-layer bypass, introduce copy_to_iter_mcsafe().
 
 * Fix cache management policy relative to the ACPI NFIT Platform
   Capabilities Structure to properly elide cache flushes when they are not
   necessary. The table indicates whether CPU caches are power-fail
   protected. Clarify that a deep flush is always performed on
   REQ_{FUA,PREFLUSH} requests.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
 6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
 maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
 aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
 LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
 FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
 wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
 KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
 fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
 2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
 PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
 4GBvt9ri9i/Ww/A+ppWs
 =4bfw
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...
2018-06-08 17:21:52 -07:00
Dan Williams
930218affe Merge branch 'for-4.18/mcsafe' into libnvdimm-for-next 2018-06-08 15:16:44 -07:00
Dan Williams
b56845794e Merge branch 'for-4.18/dax' into libnvdimm-for-next 2018-06-08 15:16:40 -07:00
Andrew Elble
692ad280bf nfsd: fix error handling in nfs4_set_delegation()
I noticed a memory corruption crash in nfsd in
4.17-rc1. This patch corrects the issue.

Fix to return error if the delegation couldn't be hashed or there was
a recall in progress. Use the existing error path instead of
destroy_delegation() for readability.

Signed-off-by: Andrew Elble <aweits@rit.edu>
Fixes: 353601e7d3 ("nfsd: create a separate lease for each delegation")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-08 16:42:29 -04:00
Scott Mayhew
3171822fdc nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
When running a fuzz tester against a KASAN-enabled kernel, the following
splat periodically occurs.

The problem occurs when the test sends a GETDEVICEINFO request with a
malformed xdr array (size but no data) for gdia_notify_types and the
array size is > 0x3fffffff, which results in an overflow in the value of
nbytes which is passed to read_buf().

If the array size is 0x40000000, 0x80000000, or 0xc0000000, then after
the overflow occurs, the value of nbytes 0, and when that happens the
pointer returned by read_buf() points to the end of the xdr data (i.e.
argp->end) when really it should be returning NULL.

Fix this by returning NFS4ERR_BAD_XDR if the array size is > 1000 (this
value is arbitrary, but it's the same threshold used by
nfsd4_decode_bitmap()... in could really be any value >= 1 since it's
expected to get at most a single bitmap in gdia_notify_types).

[  119.256854] ==================================================================
[  119.257611] BUG: KASAN: use-after-free in nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd]
[  119.258422] Read of size 4 at addr ffff880113ada000 by task nfsd/538

[  119.259146] CPU: 0 PID: 538 Comm: nfsd Not tainted 4.17.0+ #1
[  119.259662] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[  119.261202] Call Trace:
[  119.262265]  dump_stack+0x71/0xab
[  119.263371]  print_address_description+0x6a/0x270
[  119.264609]  kasan_report+0x258/0x380
[  119.265854]  ? nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd]
[  119.267291]  nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd]
[  119.268549]  ? nfs4svc_decode_compoundargs+0xa5b/0x13c0 [nfsd]
[  119.269873]  ? nfsd4_decode_sequence+0x490/0x490 [nfsd]
[  119.271095]  nfs4svc_decode_compoundargs+0xa5b/0x13c0 [nfsd]
[  119.272393]  ? nfsd4_release_compoundargs+0x1b0/0x1b0 [nfsd]
[  119.273658]  nfsd_dispatch+0x183/0x850 [nfsd]
[  119.274918]  svc_process+0x161c/0x31a0 [sunrpc]
[  119.276172]  ? svc_printk+0x190/0x190 [sunrpc]
[  119.277386]  ? svc_xprt_release+0x451/0x680 [sunrpc]
[  119.278622]  nfsd+0x2b9/0x430 [nfsd]
[  119.279771]  ? nfsd_destroy+0x1c0/0x1c0 [nfsd]
[  119.281157]  kthread+0x2db/0x390
[  119.282347]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  119.283756]  ret_from_fork+0x35/0x40

[  119.286041] Allocated by task 436:
[  119.287525]  kasan_kmalloc+0xa0/0xd0
[  119.288685]  kmem_cache_alloc+0xe9/0x1f0
[  119.289900]  get_empty_filp+0x7b/0x410
[  119.291037]  path_openat+0xca/0x4220
[  119.292242]  do_filp_open+0x182/0x280
[  119.293411]  do_sys_open+0x216/0x360
[  119.294555]  do_syscall_64+0xa0/0x2f0
[  119.295721]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  119.298068] Freed by task 436:
[  119.299271]  __kasan_slab_free+0x130/0x180
[  119.300557]  kmem_cache_free+0x78/0x210
[  119.301823]  rcu_process_callbacks+0x35b/0xbd0
[  119.303162]  __do_softirq+0x192/0x5ea

[  119.305443] The buggy address belongs to the object at ffff880113ada000
                which belongs to the cache filp of size 256
[  119.308556] The buggy address is located 0 bytes inside of
                256-byte region [ffff880113ada000, ffff880113ada100)
[  119.311376] The buggy address belongs to the page:
[  119.312728] page:ffffea00044eb680 count:1 mapcount:0 mapping:0000000000000000 index:0xffff880113ada780
[  119.314428] flags: 0x17ffe000000100(slab)
[  119.315740] raw: 0017ffe000000100 0000000000000000 ffff880113ada780 00000001000c0001
[  119.317379] raw: ffffea0004553c60 ffffea00045c11e0 ffff88011b167e00 0000000000000000
[  119.319050] page dumped because: kasan: bad access detected

[  119.321652] Memory state around the buggy address:
[  119.322993]  ffff880113ad9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  119.324515]  ffff880113ad9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  119.326087] >ffff880113ada000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  119.327547]                    ^
[  119.328730]  ffff880113ada080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  119.330218]  ffff880113ada100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[  119.331740] ==================================================================

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-06-08 16:38:59 -04:00
Trond Myklebust
ce5624f7e6 NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-08 16:36:10 -04:00
Trond Myklebust
6c34265502 NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
If the attempt to recall the delegation fails because the inode is
in the process of being evicted from cache, then use NFS4ERR_DELAY
to ask the server to retry later.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-08 16:36:10 -04:00
Dave Chinner
0703a8e1c1 xfs: replace do_mod with native operations
do_mod() is a hold-over from when we have different sizes for file
offsets and and other internal values for 40 bit XFS filesystems.
Hence depending on build flags variables passed to do_mod() could
change size. We no longer support those small format filesystems and
hence everything is of fixed size theses days, even on 32 bit
platforms.

As such, we can convert all the do_mod() callers to platform
optimised modulus operations as defined by linux/math64.h.
Individual conversions depend on the types of variables being used.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Eric Sandeen
bb3d48dcf8 xfs: don't call xfs_da_shrink_inode with NULL bp
xfs_attr3_leaf_create may have errored out before instantiating a buffer,
for example if the blkno is out of range.  In that case there is no work
to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
if we try.

This also seems to fix a flaw where the original error from
xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
removes a pointless assignment to bp which isn't used after this.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969
Reported-by: Xu, Wen <wen.xu@gatech.edu>
Tested-by: Xu, Wen <wen.xu@gatech.edu>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Dave Chinner
9bb54cb56a xfs: clean up MIN/MAX
Get rid of the MIN/MAX macros and just use the native min/max macros
directly in the XFS code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:52 -07:00
Dave Chinner
86210fbeba xfs: move various type verifiers to common file
New verification functions like xfs_verify_fsbno() and
xfs_verify_agino() are spread across multiple files and different
header files. They really don't fit cleanly into the places they've
been put, and have wider scope than the current header includes.

Move the type verifiers to a new file in libxfs (xfs-types.c) and
the prototypes to xfs_types.h where they will be visible to all the
code that uses the types.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner
4a2d01b076 xfs: xfs_reflink_convert_cow() memory allocation deadlock
xfs_reflink_convert_cow() manipulates the incore extent list
in GFP_KERNEL context in the IO submission path whilst holding
locked pages under writeback. This is a memory reclaim deadlock
vector. This code is not in a transaction, so any memory allocations
it makes aren't protected via the memalloc_nofs_save() context that
transactions carry.

Hence we need to run this call under memalloc_nofs_save() context to
prevent potential memory allocations from being run as GFP_KERNEL
and deadlocking.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Dave Chinner
ef215e394e xfs: setup VFS i_rwsem lockdep state correctly
When lockdep is enabled, it changes the type of the inode i_rwsem
semaphore before unlocking a newly instantiated inode. THere is the
possibility that there is already a waiter on that inode lock by the
time we unlock the new inode, so having lockdep re-initialise the
lock is a vector for trouble.

Avoid this whole situation by setting up the i_rwsem lockdep class
at the same time we set up the XFS inode i_ilock classes and so the
VFS doesn't have to change the lock class itself when it is
potentially unsafe.

This change is necessary because the equivalent fixes to the VFS code
made in commit 1e2e547a93 ("do d_instantiate/unlock_new_inode
combinations safely") are not relevant to XFS as it has it's own
internal inode cache lookup and instantiation routines.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-08 10:07:51 -07:00
Linus Torvalds
4a189982e2 Merge branch 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio iopriority support from Al Viro:
 "The rest of aio stuff for this cycle - Adam's aio ioprio series"

* 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: aio ioprio use ioprio_check_cap ret val
  fs: aio ioprio add explicit block layer dependence
  fs: iomap dio set bio prio from kiocb prio
  fs: blkdev set bio prio from kiocb prio
  fs: Add aio iopriority support
  fs: Convert kiocb rw_hint from enum to u16
  block: add ioprio_check_cap function
2018-06-08 10:00:20 -07:00
Linus Torvalds
4189b863ba Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull proc_fill_cache regression fix from Al Viro:
 "Regression fix for proc_fill_cache() braino introduced when switching
  instantiate() callback to d_splice_alias()"

* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix proc_fill_cache() in case of d_alloc_parallel() failure
2018-06-08 09:56:38 -07:00
Al Viro
d85b399b64 fix proc_fill_cache() in case of d_alloc_parallel() failure
If d_alloc_parallel() returns ERR_PTR(...), we don't want to dput()
that.  Small reorganization allows to have all error-in-lookup
cases rejoin the main codepath after dput(child), avoiding the
entire problem.

Spotted-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Fixes: 0168b9e38c "procfs: switch instantiate_t to d_splice_alias()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-08 01:17:11 -04:00
Ronnie Sahlberg
9d874c3655 cifs: fix a buffer leak in smb2_query_symlink
This leak was introduced in 91cb74f514 and caused us
to leak one small buffer for every symlink query.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-07 23:39:41 -05:00
Dan Carpenter
016e92da03 autofs: small cleanup in autofs_getpath()
We don't set "*name" so it's slightly nicer to just pass "name" instead
of "&name".

Link: http://lkml.kernel.org/r/20180531064736.lnisb55eajwjynvk@kili.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:40 -07:00
Ian Kent
6471e93863 autofs: clean up includes
Remove includes that aren't needed from autofs (and fs/compat_ioctl.c).

Link: http://lkml.kernel.org/r/152635085258.5968.9743527195522188148.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:40 -07:00
Ian Kent
8240b716e2 autofs: comment on selinux changes needed for module autoload
Due to the autofs4 module using a file system type name of autofs
different from the module containing directory name autoload did not
function properly.  To work around this kernel configurations have often
elected to build the module into the kernel.

This can result in selinux policies that prohibit autoloading of the
autofs module which need to be changed.

Add a comment about this to "possible changes" section of the autofs4
module help.

Link: http://lkml.kernel.org/r/152686474171.6155.1239659539983577463.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
2a3ae0a121 autofs: create autofs Kconfig and Makefile
Create Makefile and Kconfig for autofs module.

[raven@themaw.net: make autofs4 Kconfig depend on AUTOFS_FS]
  Link: http://lkml.kernel.org/r/152687649097.8263.7046086367407522029.stgit@pluto.themaw.net
Link: http://lkml.kernel.org/r/152626705591.28589.356365986974038383.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
8547190490 autofs: delete fs/autofs4 source files
Delete the now unused autofs4 module files.

Link: http://lkml.kernel.org/r/152626707391.28589.3553309771262313504.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
f7e095f5d1 autofs: update fs/autofs4/Makefile
Update Makefile to build from source in fs/autofs instead of fs/autofs4.

Link: http://lkml.kernel.org/r/152626706824.28589.1915028175544560855.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
6ed3874604 autofs: update fs/autofs4/Kconfig
Update Kconfig and add a depricated warning.

[raven@themaw.net: make autofs4 Kconfig depend on AUTOFS_FS]
  Link: http://lkml.kernel.org/r/152687649097.8263.7046086367407522029.stgit@pluto.themaw.net
Link: http://lkml.kernel.org/r/152626706133.28589.11994171621899212952.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
ebc921ca9b autofs: copy autofs4 to autofs
Copy source files from the autofs4 directory to the autofs directory.

Link: http://lkml.kernel.org/r/152626705013.28589.931913083997578251.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
47206e012a autofs4: use autofs instead of autofs4 everywhere
Update naming within autofs source to be consistent by changing
occurrences of autofs4 to autofs.

Link: http://lkml.kernel.org/r/152626703688.28589.8315406711135226803.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Ian Kent
ef8b42f78e autofs4: merge auto_fs.h and auto_fs4.h
The autofs module has long since been removed so there's no need to have
two separate include files for autofs.

Link: http://lkml.kernel.org/r/152626703024.28589.9571964661718767929.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Thadeu Lima de Souza Cascardo
5cc41e0995 fs/binfmt_misc.c: do not allow offset overflow
WHen registering a new binfmt_misc handler, it is possible to overflow
the offset to get a negative value, which might crash the system, or
possibly leak kernel data.

Here is a crash log when 2500000000 was used as an offset:

  BUG: unable to handle kernel paging request at ffff989cfd6edca0
  IP: load_misc_binary+0x22b/0x470 [binfmt_misc]
  PGD 1ef3e067 P4D 1ef3e067 PUD 0
  Oops: 0000 [#1] SMP NOPTI
  Modules linked in: binfmt_misc kvm_intel ppdev kvm irqbypass joydev input_leds serio_raw mac_hid parport_pc qemu_fw_cfg parpy
  CPU: 0 PID: 2499 Comm: bash Not tainted 4.15.0-22-generic #24-Ubuntu
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
  RIP: 0010:load_misc_binary+0x22b/0x470 [binfmt_misc]
  Call Trace:
    search_binary_handler+0x97/0x1d0
    do_execveat_common.isra.34+0x667/0x810
    SyS_execve+0x31/0x40
    do_syscall_64+0x73/0x130
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Use kstrtoint instead of simple_strtoul.  It will work as the code
already set the delimiter byte to '\0' and we only do it when the field
is not empty.

Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX.  Also tested
with examples documented at Documentation/admin-guide/binfmt-misc.rst
and other registrations from packages on Ubuntu.

Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:39 -07:00
Alexey Dobriyan
5d008fb414 proc: use "unsigned int" for /proc/*/stack
struct stack_trace::nr_entries is defined as "unsigned int" (YAY!) so
the iterator should be unsigned as well.

It saves 1 byte of code or something like that.

Link: http://lkml.kernel.org/r/20180423215248.GG9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
197850a1e0 proc: use "unsigned int" for sigqueue length
It's defined as atomic_t and really long signal queues are unheard of.

Link: http://lkml.kernel.org/r/20180423215119.GF9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
a4ef389565 proc: use "unsigned int" in proc_fill_cache()
All those lengths are unsigned as they should be.

Link: http://lkml.kernel.org/r/20180423213751.GC9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
941169298a proc: smaller RCU section in ->getattr()
struct kstat is thread local.

Link: http://lkml.kernel.org/r/20180423213626.GB9043@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
3cb4e162e4 proc: deduplicate /proc/*/cmdline implementation
Code can be sonsolidated if a dummy region of 0 length is used in normal
case of \0-separated command line:

1) [arg_start, arg_end) + [dummy len=0]
2) [arg_start, arg_end) + [env_start, env_end)

Link: http://lkml.kernel.org/r/20180221193335.GB28678@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
6a6cbe75db proc: simpler iterations for /proc/*/cmdline
"rv" variable is used both as a counter of bytes transferred and an
error value holder but it can be reduced solely to error values if
original start of userspace buffer is stashed and used at the very end.

[akpm@linux-foundation.org: simplify cleanup code]
Link: http://lkml.kernel.org/r/20180221193009.GA28678@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
6a6b9c4c11 proc: somewhat simpler code for /proc/*/cmdline
"final" variable is OK but we can get away with less lines.

Link: http://lkml.kernel.org/r/20180221192751.GC28548@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Alexey Dobriyan
b42262af5e proc: more "unsigned int" in /proc/*/cmdline
access_remote_vm() doesn't return negative errors, it returns number of
bytes read/written (0 if error occurs).  This allows to delete some
comparisons which never trigger.

Reuse "nr_read" variable while I'm at it.

Link: http://lkml.kernel.org/r/20180221192605.GB28548@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Mike Rapoport
df2cc96e77 userfaultfd: prevent non-cooperative events vs mcopy_atomic races
If a process monitored with userfaultfd changes it's memory mappings or
forks() at the same time as uffd monitor fills the process memory with
UFFDIO_COPY, the actual creation of page table entries and copying of
the data in mcopy_atomic may happen either before of after the memory
mapping modifications and there is no way for the uffd monitor to
maintain consistent view of the process memory layout.

For instance, let's consider fork() running in parallel with
userfaultfd_copy():

process        		         |	uffd monitor
---------------------------------+------------------------------
fork()        		         | userfaultfd_copy()
...        		         | ...
    dup_mmap()        	         |     down_read(mmap_sem)
    down_write(mmap_sem)         |     /* create PTEs, copy data */
        dup_uffd()               |     up_read(mmap_sem)
        copy_page_range()        |
        up_write(mmap_sem)       |
        dup_uffd_complete()      |
            /* notify monitor */ |

If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
be present by the time copy_page_range() is called and they will appear
in the child's memory mappings.  However, if the fork() is the first to
take the mmap_sem, the new pages won't be mapped in the child's address
space.

If the pages are not present and child tries to access them, the monitor
will get page fault notification and everything is fine.  However, if
the pages *are present*, the child can access them without uffd
noticing.  And if we copy them into child it'll see the wrong data.
Since we are talking about background copy, we'd need to decide whether
the pages should be copied or not regardless #PF notifications.

Since userfaultfd monitor has no way to determine what was the order,
let's disallow userfaultfd_copy in parallel with the non-cooperative
events.  In such case we return -EAGAIN and the uffd monitor can
understand that userfaultfd_copy() clashed with a non-cooperative event
and take an appropriate action.

Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:38 -07:00
Matthew Wilcox
1d40a5ea01 mm: mark pages in use for page tables
Define a new PageTable bit in the page_type and use it to mark pages in
use as page tables.  This can be helpful when debugging crashdumps or
analysing memory fragmentation.  Add a KPF flag to report these pages to
userspace and update page-types.c to interpret that flag.

Note that only pages currently accounted as NR_PAGETABLES are tracked as
PageTable; this does not include pgd/p4d/pud/pmd pages.  Those will be the
subject of a later patch.

Link: http://lkml.kernel.org/r/20180518194519.3820-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:37 -07:00
Huang Ying
ab6ecf247a mm: /proc/pid/pagemap: hide swap entries from unprivileged users
In commit ab676b7d6f ("pagemap: do not leak physical addresses to
non-privileged userspace"), the /proc/PID/pagemap is restricted to be
readable only by CAP_SYS_ADMIN to address some security issue.

In commit 1c90308e7a ("pagemap: hide physical addresses from
non-privileged users"), the restriction is relieved to make
/proc/PID/pagemap readable, but hide the physical addresses for
non-privileged users.

But the swap entries are readable for non-privileged users too.  This
has some security issues.  For example, for page under migrating, the
swap entry has physical address information.  So, in this patch, the
swap entries are hided for non-privileged users too.

Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com
Fixes: 1c90308e7a ("pagemap: hide physical addresses from non-privileged users")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:36 -07:00
Mike Kravetz
5d752600a8 mm: restructure memfd code
With the addition of memfd hugetlbfs support, we now have the situation
where memfd depends on TMPFS -or- HUGETLBFS.  Previously, memfd was only
supported on tmpfs, so it made sense that the code resided in shmem.c.
In the current code, memfd is only functional if TMPFS is defined.  If
HUGETLFS is defined and TMPFS is not defined, then memfd functionality
will not be available for hugetlbfs.  This does not cause BUGs, just a
lack of potentially desired functionality.

Code is restructured in the following way:
- include/linux/memfd.h is a new file containing memfd specific
  definitions previously contained in shmem_fs.h.
- mm/memfd.c is a new file containing memfd specific code previously
  contained in shmem.c.
- memfd specific code is removed from shmem_fs.h and shmem.c.
- A new config option MEMFD_CREATE is added that is defined if TMPFS
  or HUGETLBFS is defined.

No functional changes are made to the code: restructuring only.

Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marc-Andr Lureau <marcandre.lureau@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:35 -07:00
Yang Shi
88aa7cc688 mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too.  It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
 Call Trace:
   schedule+0x36/0x80
   rwsem_down_read_failed+0xf0/0x150
   call_rwsem_down_read_failed+0x18/0x30
   down_read+0x20/0x40
   proc_pid_cmdline_read+0xd9/0x4e0
   __vfs_read+0x37/0x150
   vfs_read+0x96/0x130
   SyS_read+0x55/0xc0
   entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
scalability issue will be solved in the future.

[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
  Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:34 -07:00
Chengguang Xu
478ae0ca08 fs/9p: detect invalid options as much as possible
Currently when detecting invalid options in option parsing, some
options(e.g.  msize) just set errno and allow to continuously validate
other options so that it can detect invalid options as much as possible
and give proper error messages together.

This patch applies same rule to option 'cache' and 'access' when
detecting -EINVAL.

Link: http://lkml.kernel.org/r/1525340676-34072-2-git-send-email-cgxu519@gmx.com
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:34 -07:00
Souptick Joarder
c6137fe36d fs: ocfs2: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduce inline function in 4.18.

Fix one checkpatch.pl warning by replacing BUG_ON() with WARN_ON()

[akpm@linux-foundation.org: undo BUG_ON->WARN_ON change]
Link: http://lkml.kernel.org/r/20180523153258.GA28451@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:34 -07:00
Salvatore Mesoraca
64202a21a4 ocfs2: drop a VLA in ocfs2_orphan_del()
Avoid a VLA by using a real constant expression instead of a variable.
The compiler should be able to optimize the original code and avoid
using an actual VLA.  Anyway this change is useful because it will avoid
a false positive with -Wvla, it might also help the compiler generating
better code.

Link: http://lkml.kernel.org/r/1520970710-19732-1-git-send-email-s.mesoraca16@gmail.com
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:34 -07:00
Guozhonghua
f3797d8ae5 ocfs2: correct the comments position of struct ocfs2_dir_block_trailer
Correct the comments position of the structure ocfs2_dir_block_trailer.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071C5FDE@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:34 -07:00
Zhen Lei
731a40fab1 ocfs2: eliminate a misreported warning
The warning is invalid because the parameter chunksize passed from
ocfs2_info_freefrag_scan_chain-->ocfs2_info_update_ffg is guaranteed to
be positive.  So __ilog2_u32 cannot return -1.

  fs/ocfs2/ioctl.c: In function 'ocfs2_info_update_ffg':
  fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds]
    hist->fc_chunks[index]++;
                   ^
  fs/ocfs2/ioctl.c:411:17: warning: array subscript is below array bounds [-Warray-bounds]

Link: http://lkml.kernel.org/r/1524655799-12112-1-git-send-email-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:33 -07:00
Larry Chen
133b81f28e ocfs2: ocfs2_inode_lock_tracker does not distinguish lock level
ocfs2_inode_lock_tracker as a variant of ocfs2_inode_lock, is used to
prevent deadlock due to recursive lock acquisition.

But this function does not distinguish whether the requested level is EX
or PR.

If a RP lock has been attained, this function will immediately return
success afterwards even an EX lock is requested.

But actually the return value does not mean that the process got a EX
lock, because ocfs2_inode_lock has not been called.

When taking lock levels into account, we face some different situations:

1. no lock is held
   In this case, just lock the inode and return 0

2. We are holding a lock
   For this situation, things diverges into several cases

   wanted     holding	     what to do
   ex		ex	    see 2.1 below
   ex		pr	    see 2.2 below
   pr		ex	    see 2.1 below
   pr		pr	    see 2.1 below

   2.1 lock level that is been held is compatible
   with the wanted level, so no lock action will be tacken.

   2.2 Otherwise, an upgrade is needed, but it is forbidden.

Reason why upgrade within a process is forbidden is that lock upgrade
may cause dead lock.  The following illustrate how it happens.

        process 1                             process 2
ocfs2_inode_lock_tracker(ex=0)
                               <======   ocfs2_inode_lock_tracker(ex=1)

ocfs2_inode_lock_tracker(ex=1)

For the status quo of ocfs2, without this patch, neither a bug nor
end-user impact will be caused because the wrong logic is avoided.

But I'm afraid this generic interface, may be called by other developers
in future and used in this situation.

  a process
ocfs2_inode_lock_tracker(ex=0)
ocfs2_inode_lock_tracker(ex=1)

Link: http://lkml.kernel.org/r/20180510053230.17217-1-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:33 -07:00
Jia Guo
5bc55d654b ocfs2: clean up redundant function declarations
ocfs2_extend_allocation() has been deleted, clean up its declaration.
Also change the static function name from __ocfs2_extend_allocation() to
ocfs2_extend_allocation() to be consistent with the corresponding trace
events as well as comments for ocfs2_lock_allocators().

Link: http://lkml.kernel.org/r/09cf7125-6f12-e53e-20f5-e606b2c16b48@huawei.com
Fixes: 964f14a0d3 ("ocfs2: clean up some dead code")
Signed-off-by: Jia Guo <guojia12@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:33 -07:00
Souptick Joarder
ab77dab462 fs/dax.c: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

commit 1c8f422059 ("mm: change return type to vm_fault_t")

There was an existing bug inside dax_load_hole() if vm_insert_mixed had
failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of
VM_FAULT_OOM.  With new vmf_insert_mixed() this issue is addressed.

vm_insert_mixed_mkwrite has inefficiency when it returns an error value,
driver has to convert it to vm_fault_t type.  With new
vmf_insert_mixed_mkwrite() this limitation will be addressed.

Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:33 -07:00
Linus Torvalds
c90fca951e powerpc updates for 4.18
Notable changes:
 
  - Support for split PMD page table lock on 64-bit Book3S (Power8/9).
 
  - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
    patching again.
 
  - Add support for patching barrier_nospec in copy_from_user() and syscall entry.
 
  - A couple of fixes for our data breakpoints on Book3S.
 
  - A series from Nick optimising TLB/mm handling with the Radix MMU.
 
  - Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
 
  - Several series optimising various parts of the 32-bit code from Christophe Leroy.
 
  - Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
    which is why the diffstat has so many deletions.
 
 And many other small improvements & fixes.
 
 There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
 a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
 fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
 ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
 
 Thanks to:
   Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
   Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
   Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
   Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
   Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
   Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
   Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
   Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
   Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
   Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
   Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
   Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
   Yisheng Xie, YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
 TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
 zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
 ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
 aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
 Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
 EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
 jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
 9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
 46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
 q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
 Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
 =ZlBf
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "Notable changes:

   - Support for split PMD page table lock on 64-bit Book3S (Power8/9).

   - Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
     live patching again.

   - Add support for patching barrier_nospec in copy_from_user() and
     syscall entry.

   - A couple of fixes for our data breakpoints on Book3S.

   - A series from Nick optimising TLB/mm handling with the Radix MMU.

   - Numerous small cleanups to squash sparse/gcc warnings from Mathieu
     Malaterre.

   - Several series optimising various parts of the 32-bit code from
     Christophe Leroy.

   - Removal of support for two old machines, "SBC834xE" and "C2K"
     ("GEFanuc,C2K"), which is why the diffstat has so many deletions.

  And many other small improvements & fixes.

  There's a few out-of-area changes. Some minor ftrace changes OK'ed by
  Steve, and a fix to our powernv cpuidle driver. Then there's a series
  touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
  around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
  been in next for several weeks.

  Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
  Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
  Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
  Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
  Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
  Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
  Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
  Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
  Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
  Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
  Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
  Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
  Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
  Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"

* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
  powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
  cpuidle: powernv: Fix promotion from snooze if next state disabled
  powerpc: fix build failure by disabling attribute-alias warning in pci_32
  ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
  powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
  powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
  powerpc/pkeys: Detach execute_only key on !PROT_EXEC
  powerpc/powernv: copy/paste - Mask SO bit in CR
  powerpc: Remove core support for Marvell mv64x60 hostbridges
  powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
  powerpc/boot: Remove support for Marvell mv64x60 i2c controller
  powerpc/boot: Remove support for Marvell MPSC serial controller
  powerpc/embedded6xx: Remove C2K board support
  powerpc/lib: optimise PPC32 memcmp
  powerpc/lib: optimise 32 bits __clear_user()
  powerpc/time: inline arch_vtime_task_switch()
  powerpc/Makefile: set -mcpu=860 flag for the 8xx
  powerpc: Implement csum_ipv6_magic in assembly
  powerpc/32: Optimise __csum_partial()
  powerpc/lib: Adjust .balign inside string functions for PPC32
  ...
2018-06-07 10:23:33 -07:00
Linus Torvalds
d987f62cce \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlsZTnUACgkQnJ2qBz9k
 QNmSlwf6AwQFhSrrPTOwL9OIUzPnjvFC/Yk2CRmQzC+6jkFoS+cRPsCwDeGOP9KK
 9Col8Dr3M72PLKbyA3Z46YzXsRCB+pifgQKDjpMNKZJxuHSd8JmvAgEuLs8a2119
 4ZM0nNwhFJ2fnrpUwBDKoGB7m+Xb+xxnEY4hQ+jFPrDBuXmcjOz41al4EcZeb1YE
 Eu6zpmB4jIqkOY+LxMtHRSE7GlDAP6g9ERYodjsL/+Vg424wHLRQb/IVaXngalGS
 es1cMOoIqfT0rsODweXP4aGOY+z+Am+Htspqg7mrWk9zu/lP/57vS9Kwy/VsOQfp
 zRoDzuMnxumXVjTjHnd1Y9NIJw+gYg==
 =wZmz
 -----END PGP SIGNATURE-----

Merge tag 'udf_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull udf updates from Jan Kara:
 "UDF support for UTF-16 characters in file names"

* tag 'udf_for_v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Add support for decoding UTF-16 characters
  udf: Add support for encoding UTF-16 characters
  udf: Push sb argument to udf_name_[to|from]_CS0()
  udf: Convert ident strings to proper charset
  udf: Use UTF-32 <-> UTF-8 conversion functions from NLS
  udf: Always require NLS support
2018-06-07 09:36:29 -07:00
Linus Torvalds
091a0f2785 orangefs: fixes and cleanups
+ fix some sparse warnings
  + cleanup some code formatting
  + fix up some attribute/meta-data related code
 -----BEGIN PGP SIGNATURE-----
 
 iQIbBAABAgAGBQJbGUIEAAoJEM9EDqnrzg2+3ygP9RPTSd2Mh/vyzokxmBacT/iW
 CrJXdrJkG+Yb/adGuT2UAPIMg73aGLqAPMN8dHhfhzotdd1BgrhMK+zZECUClwOf
 dxQCA+RFn1YofXMojqWYAFtJ6ptJgMDJxje62hS+QCWzDz0IM1PYr2qCQPs1IRmb
 CWVuP1igwt82VIVLey8YEuI5yuOnFOSO+7rKi7TPsY991oPxxz2brmrMuiW2Dg0W
 6MCP19XUBpxdDJWtL2GWXAwgkP9JYkhokwkAloqWXoOpDUf0EniomuRMBLOtISoO
 tJfgnYIN+7sRTSbqsNtKQrm7f8/B01t+tGuzJYDGlUls02V3DkS9EjypvP+VZ6e9
 bsN3oQHD6xFx/7egxUVBRFD6Hw83j58DH1N+79whVrU3ospkj3z0QX1g6zqgY2ew
 mgileXBAJQD+SUr0F42QqeFVexjky9dKtDZLQe42jxLZXpPly2t737h4ZatdW3ro
 yKJLUzuHbLGQzgGJOSQsSfmwUkSeTilgCV6WFyJWinlxUWQEkp0F+1u5B4RX87BQ
 SHUz5O3pWDpz+kLtauzsVlXDbBQfZHimIbnKWHZoQ5dER/LQfWBfvOz+CT/5uBBg
 qhdPZ4wKLD4qvzByuZJKyYmPXpUoOWh/i5QcvJ8B8rNxCx8XEP/VoikjUqkCUajb
 4HFQwnZtnQOLfnufTTw=
 =vwvd
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.18-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Fixes and cleanups:

   - fix some sparse warnings

   - cleanup some code formatting

   - fix up some attribute/meta-data related code"

* tag 'for-linus-4.18-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: use sparse annotations for holding locks across function calls.
  orangefs: make debug_help_fops static
  orangefs: remove unused function orangefs_get_bufmap_init
  orangefs: specify user pointers when using dev_map_desc and bufmap
  orangefs: formatting cleanups
  orangefs: set i_size on new symlink
  orangefs: report attributes_mask and attributes for statx
  orangefs: make struct orangefs_file_vm_ops static
  orangefs: revamp block sizes
2018-06-07 09:23:12 -07:00
Linus Torvalds
70f2ae1f00 overlayfs fixes for 4.18
This contains a fix for the vfs_mkdir() issue discovered by Al, as well as
 other fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCWxatHQAKCRDh3BK/laaZ
 POg5AP95a/uUOrTJeTsENJwTmyAwHed9a6y4abKtvNErxUm4awD9FmhyYXodzJNq
 9/mheT4kV2XkR/KkxI5sizfT1uPuvgA=
 =+ljQ
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "This contains a fix for the vfs_mkdir() issue discovered by Al, as
  well as other fixes and cleanups"

* tag 'ovl-fixes-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: use inode_insert5() to hash a newly created inode
  ovl: Pass argument to ovl_get_inode() in a structure
  vfs: factor out inode_insert5()
  ovl: clean up copy-up error paths
  ovl: return EIO on internal error
  ovl: make ovl_create_real() cope with vfs_mkdir() safely
  ovl: create helper ovl_create_temp()
  ovl: return dentry from ovl_create_real()
  ovl: struct cattr cleanups
  ovl: strip debug argument from ovl_do_ helpers
  ovl: remove WARN_ON() real inode attributes mismatch
  ovl: Kconfig documentation fixes
  ovl: update documentation for unionmount-testsuite
2018-06-07 08:53:50 -07:00
Linus Torvalds
da315f6e03 fuse update for 4.18
The most interesting part of this update is user namespace support, mostly
 done by Eric Biederman.  This enables safe unprivileged fuse mounts within
 a user namespace.
 
 There are also a couple of fixes for bugs found by syzbot and miscellaneous
 fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCWxanJAAKCRDh3BK/laaZ
 PJjDAP4r6f4kL/5DZxK7JSnSue8BHESGD1LCMVgL57e9WmZukgD/cOtaO85ie3lh
 DWuhX5xGZVMMX4frIGLfBn8ogSS+egw=
 =3luD
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:
 "The most interesting part of this update is user namespace support,
  mostly done by Eric Biederman. This enables safe unprivileged fuse
  mounts within a user namespace.

  There are also a couple of fixes for bugs found by syzbot and
  miscellaneous fixes and cleanups"

* tag 'fuse-update-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: don't keep dead fuse_conn at fuse_fill_super().
  fuse: fix control dir setup and teardown
  fuse: fix congested state leak on aborted connections
  fuse: Allow fully unprivileged mounts
  fuse: Ensure posix acls are translated outside of init_user_ns
  fuse: add writeback documentation
  fuse: honor AT_STATX_FORCE_SYNC
  fuse: honor AT_STATX_DONT_SYNC
  fuse: Restrict allow_other to the superblock's namespace or a descendant
  fuse: Support fuse filesystems outside of init_user_ns
  fuse: Fail all requests with invalid uids or gids
  fuse: Remove the buggy retranslation of pids in fuse_dev_do_read
  fuse: return -ECONNABORTED on /dev/fuse read after abort
  fuse: atomic_o_trunc should truncate pagecache
2018-06-07 08:50:57 -07:00
Souptick Joarder
a528a24150 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t
Use the new return type vm_fault_t for fault handler. For now, this is
just documenting that the function returns a VM_FAULT value rather than
an errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Reference commit 1c8f422059 ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduced inline function in 4.17-rc6.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-07 17:27:45 +02:00
Sascha Hauer
e1db654d8e ubifs: lpt: Fix wrong pnode number range in comment
The comment above pnode_lookup claims the range for the pnode number is
from 0 to main_lebs - 1. This is wrong because every pnode has
informations about UBIFS_LPT_FANOUT LEBs, thus the corrent range is
0 to to (main_lebs - 1) / UBIFS_LPT_FANOUT.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:15 +02:00
Sascha Hauer
28e5dfd842 ubifs: gc: Fix typo
"point of view" makes more sense than "point of few". Fix this.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:15 +02:00
Sascha Hauer
71d561f026 ubifs: log: Some spelling fixes
- add missing article
- remove misplaced 'it'
- s/tress/trees

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:15 +02:00
Sascha Hauer
c7e593b3bd ubifs: Spelling fix someting -> something
Replace "someting" with "something"

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:14 +02:00
Sascha Hauer
671b9b75f6 ubifs: journal: Remove wrong comment
In the description of reserve_space() it is claimed that write_node()
and write_head() unlock the journal head. This is not true and has never
been true. All callers of write_node() and write_head() call
release_head() themselves. Remove the wrong comment.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:14 +02:00
Sascha Hauer
c971dad849 ubifs: remove set but never used variable
replay_sqnum is set but never used. Remove it.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:14 +02:00
Wang Shilong
422edacec0 ubifs, xattr: remove misguided quota flags
Originally, Yang Dongsheng added quota support
for ubifs, but it turned out upstream won't accept it.

Since ubifs don't touch any quota code, S_NOQUOTA flag
is misguided here, and currently it is mainly used to
avoid recursion for system quota files.

Let's make things clearly and remove unnecessary and
misguied quota flags here.

Reported-by: Rock Lee <rockdotlee@gmail.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:14 +02:00
Souptick Joarder
31c49eac78 fs: ubifs: Adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-06-07 15:53:13 +02:00
Steve French
c7c137b931 smb3: do not allow insecure cifs mounts when using smb3
if mounting as smb3 do not allow cifs (vers=1.0) or insecure vers=2.0
mounts.

For example:
root@smf-Thinkpad-P51:~/cifs-2.6# mount -t smb3 //127.0.0.1/scratch /mnt -o username=testuser,password=Testpass1
root@smf-Thinkpad-P51:~/cifs-2.6# umount /mnt
root@smf-Thinkpad-P51:~/cifs-2.6# mount -t smb3 //127.0.0.1/scratch /mnt -o username=testuser,password=Testpass1,vers=1.0
mount: /mnt: wrong fs type, bad option, bad superblock on //127.0.0.1/scratch ...
root@smf-Thinkpad-P51:~/cifs-2.6# dmesg | grep smb3
[ 4302.200122] CIFS VFS: vers=1.0 (cifs) not permitted when mounting with smb3
root@smf-Thinkpad-P51:~/cifs-2.6# mount -t smb3 //127.0.0.1/scratch /mnt -o username=testuser,password=Testpass1,vers=3.11

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Sachin Prabhu <sprabhu@redhat.com>
2018-06-07 08:36:39 -05:00
Aurelien Aptel
8ddecf5fd7 CIFS: Fix NULL ptr deref
cifs->master_tlink is NULL against Win Server 2016 (which is
strange.. not sure why) and is dereferenced in cifs_sb_master_tcon().

move master_tlink getter to cifsglob.h so it can be used from
smb2misc.c

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-06-07 08:31:31 -05:00
Robbie Ko
9d311e11fc Btrfs: fiemap: pass correct bytenr when fm_extent_count is zero
[BUG]
fm_mapped_extents is not correct when fm_extent_count is 0
Like:
   # mount /dev/vdb5 /mnt/btrfs
   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
   # xfs_io -c "fiemap -v" /mnt/btrfs/file
   /mnt/btrfs/file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..127]:        25088..25215       128   0x1

When user space wants to get the number of file extents,
set fm_extent_count to 0 to run fiemap and then read fm_mapped_extents.

In the above example, fiemap will return with fm_mapped_extents set to 4,
but it should be 1 since there's only one entry in the output.

[REASON]
The problem seems to be that disko is only set if
fieinfo->fi_extents_max is set. And this member is initialized, in the
generic ioctl_fiemap function, to the value of used-passed
fm_extent_count. So when the user passes 0 then fi_extent_max is also
set to zero and this causes btrfs to not initialize disko at all.
Eventually this leads emit_fiemap_extent being called with a bogus
'phys' argument preventing proper fiemap entries merging.

[FIX]
Move the disko initialization earlier in extent_fiemap making it
independent of user-passed arguments, allowing emit_fiemap_extent to
properly handle consecutive extent entries.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-07 14:26:29 +02:00
Linus Torvalds
1c8c5a9d38 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add Maglev hashing scheduler to IPVS, from Inju Song.

 2) Lots of new TC subsystem tests from Roman Mashak.

 3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

 5) Add ttl inherit support to vxlan, from Hangbin Liu.

 6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

 8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

11) Plumb extack down into fib_rules, from Roopa Prabhu.

12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

13) Add UDP GSO support, from Willem de Bruijn.

14) Add documentation for eBPF helpers, from Quentin Monnet.

15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

24) Add TCP SACK compression, from Eric Dumazet.

25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

28) Add lots of forwarding selftests, from Petr Machata.

29) Add generic network device failover driver, from Sridhar Samudrala.

* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
  strparser: Add __strp_unpause and use it in ktls.
  rxrpc: Fix terminal retransmission connection ID to include the channel
  net: hns3: Optimize PF CMDQ interrupt switching process
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Fix for VF mailbox cannot receiving PF response
  bnx2x: use the right constant
  Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
  net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
  enic: fix UDP rss bits
  netdev-FAQ: clarify DaveM's position for stable backports
  rtnetlink: validate attributes in do_setlink()
  mlxsw: Add extack messages for port_{un, }split failures
  netdevsim: Add extack error message for devlink reload
  devlink: Add extack to reload and port_{un, }split operations
  net: metrics: add proper netlink validation
  ipmr: fix error path when ipmr_new_table fails
  ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
  net: hns3: remove unused hclgevf_cfg_func_mta_filter
  netfilter: provide udp*_lib_lookup for nf_tproxy
  qed*: Utilize FW 8.37.2.0
  ...
2018-06-06 18:39:49 -07:00
Linus Torvalds
2857676045 - Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
 - Introduce overflow test module (Rasmus, Kees)
 - Introduce saturating size helper functions (Matthew, Kees)
 - Treewide use of struct_size() for allocators (Kees)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
 8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
 2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
 jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
 YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
 u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
 m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
 bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
 jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
 S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
 3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
 KCDVLYPxwQQqK1Mqig==
 =/3L8
 -----END PGP SIGNATURE-----

Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull overflow updates from Kees Cook:
 "This adds the new overflow checking helpers and adds them to the
  2-factor argument allocators. And this adds the saturating size
  helpers and does a treewide replacement for the struct_size() usage.
  Additionally this adds the overflow testing modules to make sure
  everything works.

  I'm still working on the treewide replacements for allocators with
  "simple" multiplied arguments:

     *alloc(a * b, ...) -> *alloc_array(a, b, ...)

  and

     *zalloc(a * b, ...) -> *calloc(a, b, ...)

  as well as the more complex cases, but that's separable from this
  portion of the series. I expect to have the rest sent before -rc1
  closes; there are a lot of messy cases to clean up.

  Summary:

   - Introduce arithmetic overflow test helper functions (Rasmus)

   - Use overflow helpers in 2-factor allocators (Kees, Rasmus)

   - Introduce overflow test module (Rasmus, Kees)

   - Introduce saturating size helper functions (Matthew, Kees)

   - Treewide use of struct_size() for allocators (Kees)"

* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  treewide: Use struct_size() for devm_kmalloc() and friends
  treewide: Use struct_size() for vmalloc()-family
  treewide: Use struct_size() for kmalloc()-family
  device: Use overflow helpers for devm_kmalloc()
  mm: Use overflow helpers in kvmalloc()
  mm: Use overflow helpers in kmalloc_array*()
  test_overflow: Add memory allocation overflow tests
  overflow.h: Add allocation size calculation helpers
  test_overflow: Report test failures
  test_overflow: macrofy some more, do more tests for free
  lib: add runtime test of check_*_overflow functions
  compiler.h: enable builtin overflow checkers and add fallback code
2018-06-06 17:27:14 -07:00
Aurelien Aptel
83210ba6f8 CIFS: fix encryption in SMB3.1.1
The smb2 hdr is now in iov 1

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Paulo Alcantara <palcantara@suse.de>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-06 16:50:31 -05:00
Arnd Bergmann
4bb8b65a04 xfs: fix string handling in label get/set functions
[sandeen: fix subject, avoid copy-out of uninit data in getlabel]

gcc-8 reports two warnings for the newly added getlabel/setlabel code:

fs/xfs/xfs_ioctl.c: In function 'xfs_ioc_getlabel':
fs/xfs/xfs_ioctl.c:1822:38: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]
  strncpy(label, sbp->sb_fname, sizeof(sbp->sb_fname));
                                      ^
In function 'strncpy',
    inlined from 'xfs_ioc_setlabel' at /git/arm-soc/fs/xfs/xfs_ioctl.c:1863:2,
    inlined from 'xfs_file_ioctl' at /git/arm-soc/fs/xfs/xfs_ioctl.c:1918:10:
include/linux/string.h:254:9: error: '__builtin_strncpy' output may be truncated copying 12 bytes from a string of length 12 [-Werror=stringop-truncation]
  return __builtin_strncpy(p, q, size);

In both cases, part of the problem is that one of the strncpy()
arguments is a fixed-length character array with zero-padding rather
than a zero-terminated string. In the first one case, we also get an
odd warning about sizeof-pointer-memaccess, which doesn't seem right
(the sizeof is for an array that happens to be the same as the second
strncpy argument).

To work around the bogus warning, I use a plain 'XFSLABEL_MAX' for
the strncpy() length when copying the label in getlabel. For setlabel(),
using memcpy() with the correct length that is already known avoids
the second warning and is slightly simpler.

In a related issue, it appears that we accidentally skip the trailing
\0 when copying a 12-character label back to user space in getlabel().
Using the correct sizeof() argument here copies the extra character.

Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85602
Fixes: f7664b3197 ("xfs: implement online get/set fs label")
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Martin Sebor <msebor@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Dave Chinner
0b61f8a407 xfs: convert to SPDX license tags
Remove the verbose license text from XFS files and replace them
with SPDX tags. This does not change the license of any of the code,
merely refers to the common, up-to-date license files in LICENSES/

This change was mostly scripted. fs/xfs/Makefile and
fs/xfs/libxfs/xfs_fs.h were modified by hand, the rest were detected
and modified by the following command:

for f in `git grep -l "GNU General" fs/xfs/` ; do
	echo $f
	cat $f | awk -f hdr.awk > $f.new
	mv -f $f.new $f
done

And the hdr.awk script that did the modification (including
detecting the difference between GPL-2.0 and GPL-2.0+ licenses)
is as follows:

$ cat hdr.awk
BEGIN {
	hdr = 1.0
	tag = "GPL-2.0"
	str = ""
}

/^ \* This program is free software/ {
	hdr = 2.0;
	next
}

/any later version./ {
	tag = "GPL-2.0+"
	next
}

/^ \*\// {
	if (hdr > 0.0) {
		print "// SPDX-License-Identifier: " tag
		print str
		print $0
		str=""
		hdr = 0.0
		next
	}
	print $0
	next
}

/^ \* / {
	if (hdr > 1.0)
		next
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
	next
}

/^ \*/ {
	if (hdr > 0.0)
		next
	print $0
	next
}

// {
	if (hdr > 0.0) {
		if (str != "")
			str = str "\n"
		str = str $0
		next
	}
	print $0
}

END { }
$

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 14:17:53 -07:00
Kees Cook
acafe7e302 treewide: Use struct_size() for kmalloc()-family
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:

// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
//                      sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@

- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)

// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@

- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)

Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-06 11:15:43 -07:00
Chuck Lever
025bb9f872 NFSv4.0: Remove transport protocol name from non-UCS client ID
Commit 69dd716c5f ("NFSv4: Add socket proto argument to
setclientid") (2007) added the transport protocol name to the client
ID string, but the patch description doesn't explain why this was
necessary.

At that time, the only transport protocol name that would have been
used is "tcp" (for both IPv4 and IPv6), resulting in no additional
distinctiveness of the client ID string.

Since there is one client instance, the server should recognize it's
state whether the client is connecting via TCP or RDMA. Same client,
same lease.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-06 11:45:52 -04:00
Chuck Lever
848a4eb2e3 NFSv4.0: Remove cl_ipaddr from non-UCS client ID
It is possible for two distinct clients to have the same cl_ipaddr:

 - if the client admin disables callback with clientaddr=0.0.0.0 on
   more than one client

 - if two clients behind separate NATs use the same private subnet
   number

 - if the client admin specifies the same address via clientaddr=
   mount option (pointing the server at the same NAT box, for
   example)

Because of the way the Linux NFSv4.0 client constructs its client
ID string by default, such clients could interfere with each others'
lease state when mounting the same server:

	scnprintf(str, len, "Linux NFSv4.0 %s/%s %s",
		clp->cl_ipaddr,
		rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_ADDR),
		rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_PROTO));

cl_ipaddr is set to the value of the clientaddr= mount option. Two
clients whose addresses are 192.168.3.77 that mount the same server
(whose public IP address is, say, 3.4.5.6) would both generate the
same client ID string when sending a SETCLIENTID:

  Linux NFSv4.0 192.168.3.77/3.4.5.6 tcp

and thus the server would not be able to distinguish the clients'
leases. If both clients are using AUTH_SYS when sending SETCLIENTID
then the server could possibly permit the two clients to interfere
with or purge each others' leases.

To better ensure that Linux's NFSv4.0 client ID strings are distinct
in these cases, remove cl_ipaddr from the client ID string and
replace it with something more likely to be unique. Note that the
replacement looks a lot like the uniform client ID string.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-06 11:45:44 -04:00
Dave Chinner
9e6c08d4a8 xfs: validate btree records on retrieval
So we don't check the validity of records as we walk the btree. When
there are corrupt records in the free space btree (e.g. zero
startblock/length or beyond EOAG) we just blindly use it and things
go bad from there. That leads to assert failures on debug kernels
like this:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 450
....
Call Trace:
 xfs_alloc_fixup_trees+0x368/0x5c0
 xfs_alloc_ag_vextent_near+0x79a/0xe20
 xfs_alloc_ag_vextent+0x1d3/0x330
 xfs_alloc_vextent+0x5e9/0x870

Or crashes like this:

XFS (loop0): xfs_buf_find: daddr 0x7fb28 out of range, EOFS 0x8000
.....
BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
....
Call Trace:
 xfs_bmap_add_extent_hole_real+0x67d/0x930
 xfs_bmapi_write+0x934/0xc90
 xfs_da_grow_inode_int+0x27e/0x2f0
 xfs_dir2_grow_inode+0x55/0x130
 xfs_dir2_sf_to_block+0x94/0x5d0
 xfs_dir2_sf_addname+0xd0/0x590
 xfs_dir_createname+0x168/0x1a0
 xfs_rename+0x658/0x9b0

By checking that free space records pulled from the trees are
within the valid range, we catch many of these corruptions before
they can do damage.

This is a generic btree record checking deficiency. We need to
validate the records we fetch from all the different btrees before
we use them to catch corruptions like this.

This patch results in a corrupt record emitting an error message and
returning -EFSCORRUPTED, and the higher layers catch that and abort:

 XFS (loop0): Size Freespace BTree record corruption in AG 0 detected!
 XFS (loop0): start block 0x0 block count 0x0
 XFS (loop0): Internal error xfs_trans_cancel at line 1012 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x42a/0x670
 .....
 Call Trace:
  dump_stack+0x85/0xcb
  xfs_trans_cancel+0x19f/0x1c0
  xfs_create+0x42a/0x670
  xfs_generic_create+0x1f6/0x2c0
  vfs_create+0xf9/0x180
  do_mknodat+0x1f9/0x210
  do_syscall_64+0x5a/0x180
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
.....
 XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1013 of file fs/xfs/xfs_trans.c.  Return address = ffffffff81500868
 XFS (loop0): Corruption of in-memory data detected.  Shutting down filesystem

Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:12:00 -07:00
Dave Chinner
29cad0b3ed xfs: push corruption -> ESTALE conversion to xfs_nfs_get_inode()
In xfs_imap_to_bp(), we convert a -EFSCORRUPTED error to -EINVAL if
we are doing an untrusted lookup. This is done because we need
failed filehandle lookups to report -ESTALE to the caller, and it
does this by converting -EINVAL and -ENOENT errors to -ESTALE.

The squashing of EFSCORRUPTED in imap_to_bp makes it impossible for
for xfs_iget(UNTRUSTED) callers to determine the difference between
"inode does not exist" and "corruption detected during lookup". We
realy need that distinction in places calling xfS_iget(UNTRUSTED),
so move the filehandle error case handling all the way out to
xfs_nfs_get_inode() where it is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
541b5acc85 xfs: verify root inode more thoroughly
When looking up the root inode at mount time, we don't actually do
any verification to check that the inode is allocated and accounted
for correctly in the INOBT. Make the checks on the root inode more
robust by making it an untrusted lookup. This forces the inode
lookup to use the inode btree to verify the inode is allocated
and mapped correctly to disk. This will also have the effect of
catching a significant number of AGI/INOBT related corruptions in
AG 0 at mount time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
02a0fda875 xfs: verify COW extent size hint is valid in inode verifier
There are rules for vald extent size hints. We enforce them when
applications set them, but fuzzers violate those rules and that
screws us over. Validate COW extent size hint rules in the inode
verifier to catch this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
7d71a671a2 xfs: verify extent size hint is valid in inode verifier
There are rules for vald extent size hints. We enforce them when
applications set them, but fuzzers violate those rules and that
screws us over.

This results in alignment assertion failures when setting up
allocations such as this in direct IO:

XFS: Assertion failed: ap->length, file: fs/xfs/libxfs/xfs_bmap.c, line: 3432
....
Call Trace:
 xfs_bmap_btalloc+0x415/0x910
 xfs_bmapi_write+0x71c/0x12e0
 xfs_iomap_write_direct+0x2a9/0x420
 xfs_file_iomap_begin+0x4dc/0xa70
 iomap_apply+0x43/0x100
 iomap_file_buffered_write+0x62/0x90
 xfs_file_buffered_aio_write+0xba/0x300
 __vfs_write+0xd5/0x150
 vfs_write+0xb6/0x180
 ksys_write+0x45/0xa0
 do_syscall_64+0x5a/0x180
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

And from xfs_db:

core.extsize = 10380288

Which is not an integer multiple of the block size, and so violates
Rule #7 for setting extent size hints. Validate extent size hint
rules in the inode verifier to catch this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Dave Chinner
fa4ca9c557 xfs: catch bad stripe alignment configurations
When stripe alignments are invalid, data alignment algorithms in the
allocator may not work correctly. Ensure we catch superblocks with
invalid stripe alignment setups at mount time. These data alignment
mismatches are now detected at mount time like this:

XFS (loop0): SB stripe unit sanity check failed
XFS (loop0): Metadata corruption detected at xfs_sb_read_verify+0xab/0x110, xfs_sb block 0xffffffffffffffff
XFS (loop0): Unmount and run xfs_repair
XFS (loop0): First 128 bytes of corrupted metadata buffer:
0000000091c2de02: 58 46 53 42 00 00 10 00 00 00 00 00 00 00 10 00  XFSB............
0000000023bff869: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000000cdd8c893: 17 32 37 15 ff ca 46 3d 9a 17 d3 33 04 b5 f1 a2  .27...F=...3....
000000009fd2844f: 00 00 00 00 00 00 00 04 00 00 00 00 00 00 06 d0  ................
0000000088e9b0bb: 00 00 00 00 00 00 06 d1 00 00 00 00 00 00 06 d2  ................
00000000ff233a20: 00 00 00 01 00 00 10 00 00 00 00 01 00 00 00 00  ................
000000009db0ac8b: 00 00 03 60 e1 34 02 00 08 00 00 02 00 00 00 00  ...`.4..........
00000000f7022460: 00 00 00 00 00 00 00 00 0c 09 0b 01 0c 00 00 19  ................
XFS (loop0): SB validate failed with error -117.

And the mount fails.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-06 08:10:26 -07:00
Mathieu Desnoyers
d7822b1e24 rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.

* Restartable sequences (per-cpu atomics)

Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.

The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path.

Here are benchmarks of various rseq use-cases.

Test hardware:

arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading

The following benchmarks were all performed on a single thread.

* Per-CPU statistic counter increment

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                344.0                 31.4          11.0
x86-64:                15.3                  2.0           7.7

* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
             per-cpu buffer

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:               2502.0                 2250.0         1.1
x86-64:               117.4                   98.0         1.2

* liburcu percpu: lock-unlock pair, dereference, read/compare word

                getcpu+atomic (ns/op)    rseq (ns/op)    speedup
arm32:                751.0                 128.5          5.8
x86-64:                53.4                  28.6          1.9

* jemalloc memory allocator adapted to use rseq

Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):

The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.

* Reading the current CPU number

Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.

Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:

- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
  executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
  assembly, which makes it a useful building block for restartable
  sequences.
- The approach of reading the cpu id through memory mapping shared
  between kernel and user-space is portable (e.g. ARM), which is not the
  case for the lsl-based x86 vdso.

On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.

Benchmarking various approaches for reading the current CPU number:

ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop):                                    8.4 ns
- Read CPU from rseq cpu_id:                               16.7 ns
- Read CPU from rseq cpu_id (lazy register):               19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
- getcpu system call:                                     234.9 ns

x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop):                                    0.8 ns
- Read CPU from rseq cpu_id:                                0.8 ns
- Read CPU from rseq cpu_id (lazy register):                0.8 ns
- Read using gs segment selector:                           0.8 ns
- "lsl" inline assembly:                                   13.0 ns
- glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
- getcpu system call:                                      53.9 ns

- Speed (benchmark taken on v8 of patchset)

Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:

Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.

* CONFIG_RSEQ=n

avg.:      41.37 s
std.dev.:   0.36 s

* CONFIG_RSEQ=y

avg.:      40.46 s
std.dev.:   0.33 s

- Size

On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.

[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Chris Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Hunter <ahh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
2018-06-06 11:58:31 +02:00
Linus Torvalds
af6c5d5e01 Merge branch 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:

 - make kworkers report the workqueue it is executing or has executed
   most recently in /proc/PID/comm (so they show up in ps/top)

 - CONFIG_SMP shuffle to move stuff which isn't necessary for UP builds
   inside CONFIG_SMP.

* 'for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: move function definitions within CONFIG_SMP block
  workqueue: Make sure struct worker is accessible for wq_worker_comm()
  workqueue: Show the latest workqueue name in /proc/PID/{comm,stat,status}
  proc: Consolidate task->comm formatting into proc_task_name()
  workqueue: Set worker->desc to workqueue name by default
  workqueue: Make worker_attach/detach_pool() update worker->pool
  workqueue: Replace pool->attach_mutex with global wq_pool_attach_mutex
2018-06-05 17:31:33 -07:00
Deepa Dinamani
95582b0083 vfs: change inode times to use struct timespec64
struct timespec is not y2038 safe. Transition vfs to use
y2038 safe struct timespec64 instead.

The change was made with the help of the following cocinelle
script. This catches about 80% of the changes.
All the header file and logic changes are included in the
first 5 rules. The rest are trivial substitutions.
I avoid changing any of the function signatures or any other
filesystem specific data structures to keep the patch simple
for review.

The script can be a little shorter by combining different cases.
But, this version was sufficient for my usecase.

virtual patch

@ depends on patch @
identifier now;
@@
- struct timespec
+ struct timespec64
  current_time ( ... )
  {
- struct timespec now = current_kernel_time();
+ struct timespec64 now = current_kernel_time64();
  ...
- return timespec_trunc(
+ return timespec64_trunc(
  ... );
  }

@ depends on patch @
identifier xtime;
@@
 struct \( iattr \| inode \| kstat \) {
 ...
-       struct timespec xtime;
+       struct timespec64 xtime;
 ...
 }

@ depends on patch @
identifier t;
@@
 struct inode_operations {
 ...
int (*update_time) (...,
-       struct timespec t,
+       struct timespec64 t,
...);
 ...
 }

@ depends on patch @
identifier t;
identifier fn_update_time =~ "update_time$";
@@
 fn_update_time (...,
- struct timespec *t,
+ struct timespec64 *t,
 ...) { ... }

@ depends on patch @
identifier t;
@@
lease_get_mtime( ... ,
- struct timespec *t
+ struct timespec64 *t
  ) { ... }

@te depends on patch forall@
identifier ts;
local idexpression struct inode *inode_node;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn_update_time =~ "update_time$";
identifier fn;
expression e, E3;
local idexpression struct inode *node1;
local idexpression struct inode *node2;
local idexpression struct iattr *attr1;
local idexpression struct iattr *attr2;
local idexpression struct iattr attr;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
@@
(
(
- struct timespec ts;
+ struct timespec64 ts;
|
- struct timespec ts = current_time(inode_node);
+ struct timespec64 ts = current_time(inode_node);
)

<+... when != ts
(
- timespec_equal(&inode_node->i_xtime, &ts)
+ timespec64_equal(&inode_node->i_xtime, &ts)
|
- timespec_equal(&ts, &inode_node->i_xtime)
+ timespec64_equal(&ts, &inode_node->i_xtime)
|
- timespec_compare(&inode_node->i_xtime, &ts)
+ timespec64_compare(&inode_node->i_xtime, &ts)
|
- timespec_compare(&ts, &inode_node->i_xtime)
+ timespec64_compare(&ts, &inode_node->i_xtime)
|
ts = current_time(e)
|
fn_update_time(..., &ts,...)
|
inode_node->i_xtime = ts
|
node1->i_xtime = ts
|
ts = inode_node->i_xtime
|
<+... attr1->ia_xtime ...+> = ts
|
ts = attr1->ia_xtime
|
ts.tv_sec
|
ts.tv_nsec
|
btrfs_set_stack_timespec_sec(..., ts.tv_sec)
|
btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
|
- ts = timespec64_to_timespec(
+ ts =
...
-)
|
- ts = ktime_to_timespec(
+ ts = ktime_to_timespec64(
...)
|
- ts = E3
+ ts = timespec_to_timespec64(E3)
|
- ktime_get_real_ts(&ts)
+ ktime_get_real_ts64(&ts)
|
fn(...,
- ts
+ timespec64_to_timespec(ts)
,...)
)
...+>
(
<... when != ts
- return ts;
+ return timespec64_to_timespec(ts);
...>
)
|
- timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
|
- timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
+ timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
|
- timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
+ timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
|
node1->i_xtime1 =
- timespec_trunc(attr1->ia_xtime1,
+ timespec64_trunc(attr1->ia_xtime1,
...)
|
- attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
+ attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
...)
|
- ktime_get_real_ts(&attr1->ia_xtime1)
+ ktime_get_real_ts64(&attr1->ia_xtime1)
|
- ktime_get_real_ts(&attr.ia_xtime1)
+ ktime_get_real_ts64(&attr.ia_xtime1)
)

@ depends on patch @
struct inode *node;
struct iattr *attr;
identifier fn;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
expression e;
@@
(
- fn(node->i_xtime);
+ fn(timespec64_to_timespec(node->i_xtime));
|
 fn(...,
- node->i_xtime);
+ timespec64_to_timespec(node->i_xtime));
|
- e = fn(attr->ia_xtime);
+ e = fn(timespec64_to_timespec(attr->ia_xtime));
)

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
identifier i_xtime =~ "^i_[acm]time$";
identifier ia_xtime =~ "^ia_[acm]time$";
identifier fn;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
fn (...,
- &attr->ia_xtime,
+ &ts,
...);
)
...+>
}

@ depends on patch forall @
struct inode *node;
struct iattr *attr;
struct kstat *stat;
identifier ia_xtime =~ "^ia_[acm]time$";
identifier i_xtime =~ "^i_[acm]time$";
identifier xtime =~ "^[acm]time$";
identifier fn, ret;
@@
{
+ struct timespec ts;
<+...
(
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(node->i_xtime);
ret = fn (...,
- &node->i_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime,
+ &ts,
...);
|
+ ts = timespec64_to_timespec(attr->ia_xtime);
ret = fn (...,
- &attr->ia_xtime);
+ &ts);
|
+ ts = timespec64_to_timespec(stat->xtime);
ret = fn (...,
- &stat->xtime);
+ &ts);
)
...+>
}

@ depends on patch @
struct inode *node;
struct inode *node2;
identifier i_xtime1 =~ "^i_[acm]time$";
identifier i_xtime2 =~ "^i_[acm]time$";
identifier i_xtime3 =~ "^i_[acm]time$";
struct iattr *attrp;
struct iattr *attrp2;
struct iattr attr ;
identifier ia_xtime1 =~ "^ia_[acm]time$";
identifier ia_xtime2 =~ "^ia_[acm]time$";
struct kstat *stat;
struct kstat stat1;
struct timespec64 ts;
identifier xtime =~ "^[acmb]time$";
expression e;
@@
(
( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
|
 node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
|
 stat->xtime = node2->i_xtime1;
|
 stat1.xtime = node2->i_xtime1;
|
( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
|
( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
|
- e = node->i_xtime1;
+ e = timespec64_to_timespec( node->i_xtime1 );
|
- e = attrp->ia_xtime1;
+ e = timespec64_to_timespec( attrp->ia_xtime1 );
|
node->i_xtime1 = current_time(...);
|
 node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
 node->i_xtime1 = node->i_xtime3 =
- e;
+ timespec_to_timespec64(e);
|
- node->i_xtime1 = e;
+ node->i_xtime1 = timespec_to_timespec64(e);
)

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: <anton@tuxera.com>
Cc: <balbi@kernel.org>
Cc: <bfields@fieldses.org>
Cc: <darrick.wong@oracle.com>
Cc: <dhowells@redhat.com>
Cc: <dsterba@suse.com>
Cc: <dwmw2@infradead.org>
Cc: <hch@lst.de>
Cc: <hirofumi@mail.parknet.co.jp>
Cc: <hubcap@omnibond.com>
Cc: <jack@suse.com>
Cc: <jaegeuk@kernel.org>
Cc: <jaharkes@cs.cmu.edu>
Cc: <jslaby@suse.com>
Cc: <keescook@chromium.org>
Cc: <mark@fasheh.com>
Cc: <miklos@szeredi.hu>
Cc: <nico@linaro.org>
Cc: <reiserfs-devel@vger.kernel.org>
Cc: <richard@nod.at>
Cc: <sage@redhat.com>
Cc: <sfrench@samba.org>
Cc: <swhiteho@redhat.com>
Cc: <tj@kernel.org>
Cc: <trond.myklebust@primarydata.com>
Cc: <tytso@mit.edu>
Cc: <viro@zeniv.linux.org.uk>
2018-06-05 16:57:31 -07:00
Kees Cook
7aaa822ed0 pstore: Convert internal records to timespec64
This prepares pstore for converting the VFS layer to timespec64.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
2018-06-05 16:57:31 -07:00
Linus Torvalds
ec064d3c6b Driver core changes for 4.18-rc1
Here is the driver core patchset for 4.18-rc1.
 
 The large chunk of these are firmware core documentation and api
 updates.  Nothing major there, just better descriptions for others to be
 able to understand the firmware code better.  There's also a user for a
 new firmware api call.
 
 Other than that, there are some minor updates for debugfs, kernfs, and
 the driver core itself.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWxbZAg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykbiACgu/2qqou1iV4GxOkvrj5wfsHD+lYAoNPNNDHu
 Qf3CCEKbxogF6YowDiAH
 =Dsqq
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the driver core patchset for 4.18-rc1.

  The large chunk of these are firmware core documentation and api
  updates. Nothing major there, just better descriptions for others to
  be able to understand the firmware code better. There's also a user
  for a new firmware api call.

  Other than that, there are some minor updates for debugfs, kernfs, and
  the driver core itself.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits)
  driver core: hold dev's parent lock when needed
  driver-core: return EINVAL error instead of BUG_ON()
  driver core: add __printf verification to device_create_groups_vargs
  mm: memory_hotplug: use put_device() if device_register fail
  base: core: fix typo 'can by' to 'can be'
  debugfs: inode: debugfs_create_dir uses mode permission from parent
  debugfs: Re-use kstrtobool_from_user()
  Documentation: clarify firmware_class provenance and why we can't rename the module
  Documentation: remove stale firmware API reference
  Documentation: fix few typos and clarifications for the firmware loader
  ath10k: re-enable the firmware fallback mechanism for testmode
  ath10k: use firmware_request_nowarn() to load firmware
  firmware: add firmware_request_nowarn() - load firmware without warnings
  firmware_loader: make firmware_fallback_sysfs() print more useful
  firmware_loader: move kconfig FW_LOADER entries to its own file
  firmware_loader: replace ---help--- with help
  firmware_loader: enhance Kconfig documentation over FW_LOADER
  firmware_loader: document firmware_sysfs_fallback()
  firmware: rename fw_sysfs_fallback to firmware_fallback_sysfs()
  firmware: use () to terminate kernel-doc function names
  ...
2018-06-05 16:29:19 -07:00
Steve French
d5f07fb3ef CIFS: Pass page offset for encrypting
Encryption function needs to read data starting page offset from input
buffer.

This doesn't affect decryption path since it allocates its own page
buffers.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:46:24 -05:00
Long Li
4c0d2a5a64 CIFS: Pass page offset for calculating signature
When calculating signature for the packet, it needs to read into the
correct page offset for the data.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:44:30 -05:00
Long Li
7cf20bce77 CIFS: SMBD: Support page offset in memory registration
Change code to pass the correct page offset during memory registration for
RDMA read/write.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:43:59 -05:00
Long Li
6509f50cd1 CIFS: SMBD: Support page offset in RDMA recv
RDMA recv function needs to place data to the correct place starting at
page offset.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:43:27 -05:00
Long Li
b6903bcf0a CIFS: SMBD: Support page offset in RDMA send
The RDMA send function needs to look at offset in the request pages, and
send data starting from there.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:42:03 -05:00
Long Li
e8157b2729 CIFS: When sending data on socket, pass the correct page offset
It's possible that the offset is non-zero in the page to send, change the
function to pass this offset to socket.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:41:27 -05:00
Long Li
7b7f2bdf82 CIFS: Introduce helper function to get page offset and length in smb_rqst
Introduce a function rqst_page_get_length to return the page offset and
length for a given page in smb_rqst. This function is to be used by
following patches.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:41:00 -05:00
Long Li
c06a0f2dff CIFS: Calculate the correct request length based on page offset and tail size
It's possible that the page offset is non-zero in the pages in a request,
change the function to calculate the correct data buffer length.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-05 17:39:46 -05:00
Linus Torvalds
fd59ccc530 Add bunch of cleanups, and add support for the Speck128/256
algorithms.  Yes, Speck is contrversial, but the intention is to use
 them only for the lowest end Android devices, where the alternative
 *really* is no encryption at all for data stored at rest.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlsW/RAACgkQ8vlZVpUN
 gaO1pwf/WOusoXBK5sUuiC8d9I5s+OlPhTKhrh+BcL7/xhOkyh2xDv2FEwsjhwUf
 qo26AMf7DsWKWgJ6wDQ1z+PIuPSNeQy5dCKbz2hbfNjET3vdk2NuvPWnIbFrmIek
 LB6Ii9jKlPJRO4T3nMrE9JzJZLsoX5OKRSgYTT3EviuW/wSXaFyi7onFnyKXBnF/
 e689tE50P42PgTEDKs4RDw43PwEGbcl5Vtj+Lnoh6VGX/pYvL/9ZbEYlKrgqSOU4
 DmckR8D8UU/Gy6G5bvMsVuJpLEU7vBxupOOHI/nJFwR6tuYi0Q1j7C/zH8BvWp5e
 o8P5GpOWk7Gm346FaUlkAZ+25bCU+A==
 =EBeE
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Add bunch of cleanups, and add support for the Speck128/256
  algorithms.

  Yes, Speck is contrversial, but the intention is to use them only for
  the lowest end Android devices, where the alternative *really* is no
  encryption at all for data stored at rest"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: log the crypto algorithm implementations
  fscrypt: add Speck128/256 support
  fscrypt: only derive the needed portion of the key
  fscrypt: separate key lookup from key derivation
  fscrypt: use a common logging function
  fscrypt: remove internal key size constants
  fscrypt: remove unnecessary check for non-logon key type
  fscrypt: make fscrypt_operations.max_namelen an integer
  fscrypt: drop empty name check from fname_decrypt()
  fscrypt: drop max_namelen check from fname_decrypt()
  fscrypt: don't special-case EOPNOTSUPP from fscrypt_get_encryption_info()
  fscrypt: don't clear flags on crypto transform
  fscrypt: remove stale comment from fscrypt_d_revalidate()
  fscrypt: remove error messages for skcipher_request_alloc() failure
  fscrypt: remove unnecessary NULL check when allocating skcipher
  fscrypt: clean up after fscrypt_prepare_lookup() conversions
  fs, fscrypt: only define ->s_cop when FS_ENCRYPTION is enabled
  fscrypt: use unbound workqueue for decryption
2018-06-05 15:15:32 -07:00
Linus Torvalds
6567af78ac Changes for 4.18:
- Strengthen inode number and structure validation when allocating inodes.
 - Reduce pointless buffer allocations during cache miss
 - Use FUA for pure data O_DSYNC directio writes
 - Various iomap refactorings
 - Strengthen quota metadata verification to avoid unfixable broken quota
 - Make AGFL block freeing a deferred operation to avoid blowing out
   transaction reservations when running complex operations
 - Get rid of the log item descriptors to reduce log overhead
 - Fix various reflink bugs where inodes were double-joined to
   transactions
 - Don't issue discards when trimming unwritten extents
 - Refactor incore dquot initialization and retrieval interfaces
 - Fix some locking problmes in the quota scrub code
 - Strengthen btree structure checks in scrub code
 - Rewrite swapfile activation to use iomap and support unwritten extents
 - Make scrub exit to userspace sooner when corruptions or
   cross-referencing problems are found
 - Make scrub invoke the data fork scrubber directly on metadata inodes
 - Don't do background reclamation of post-eof and cow blocks when the fs
   is suspended
 - Fix secondary superblock buffer lifespan hinting
 - Refactor growfs to use table-dispatched functions instead of long
   stringy functions
 - Move growfs code to libxfs
 - Implement online fs label getting and setting
 - Introduce online filesystem repair (in a very limited capacity)
 - Fix unit conversion problems in the realtime freemap iteration
   functions
 - Various refactorings and cleanups in preparation to remove buffer
   heads in a future release
 - Reimplement the old bmap call with iomap
 - Remove direct buffer head accesses from seek hole/data
 - Various bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsR9dEACgkQ+H93GTRK
 tOv0dw//cBwRgY4jhC6b9oMk2DNRWUiTt1F2yoqr28661GPo124iXAMLIwJe1DiV
 W/qpN3HUz7P46xKOVY+MXaj0JIDFxJ8c5tHAQMH/TkDc49S+mkcGyaoPJ39hnc6u
 yikG+Hq4m0YWhHaeUhKTe8pnhXBaziz5A2NtKtwh6lPOIW+Wds51T77DJnViqADq
 tZzmAq8fS9/ELpxe0Th/2D7iTWCr2c3FLsW2KgbbNvQ4e34zVE1ix1eBtEzQE+Mm
 GUjdQhYVS1oCzqZfCxJkzR4R/1TAFyS0FXOW7PHo8FAX/kas9aQbRlnHSAQ/08EE
 8Z2p3GsFip7dgmd6O6nAmFAStW6GRvgyycJ7Y+Y0IsJj6aDp9OxhRExyF+uocJR9
 b9ChOH6PMEtRB/RRlBg66pbS61abvNGutzl61ZQZGBHEvL3VqDcd68IomdD5bNSB
 pXo6mOJIcKuXsghZszsHAV9uuMe4zQAMbLy7QH6V8LyWeSAG9hTXOT9EA4MWktEJ
 SCQFf7RRPgU5pEAgOS8LgKrawqnBaqFcFvkvWsQhyiltTFz29cwxH7tjSXYMAOFE
 W+RMp8kbkPnGOaJJeKxT+/RGRB534URk0jIEKtRb679xkEF3HE58exXEVrnojJq6
 0m712+EYuZSYhFBwrvEnQjNHr0x2r/A/iBJZ6HhyV0aO1RWm4n4=
 =11pr
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "New features this cycle include the ability to relabel mounted
  filesystems, support for fallocated swapfiles, and using FUA for pure
  data O_DSYNC directio writes. With this cycle we begin to integrate
  online filesystem repair and refactor the growfs code in preparation
  for eventual subvolume support, though the road ahead for both
  features is quite long.

  There are also numerous refactorings of the iomap code to remove
  unnecessary log overhead, to disentangle some of the quota code, and
  to prepare for buffer head removal in a future upstream kernel.

  Metadata validation continues to improve, both in the hot path
  veifiers and the online filesystem check code. I anticipate sending a
  second pull request in a few days with more metadata validation
  improvements.

  This series has been run through a full xfstests run over the weekend
  and through a quick xfstests run against this morning's master, with
  no major failures reported.

  Summary:

   - Strengthen inode number and structure validation when allocating
     inodes.

   - Reduce pointless buffer allocations during cache miss

   - Use FUA for pure data O_DSYNC directio writes

   - Various iomap refactorings

   - Strengthen quota metadata verification to avoid unfixable broken
     quota

   - Make AGFL block freeing a deferred operation to avoid blowing out
     transaction reservations when running complex operations

   - Get rid of the log item descriptors to reduce log overhead

   - Fix various reflink bugs where inodes were double-joined to
     transactions

   - Don't issue discards when trimming unwritten extents

   - Refactor incore dquot initialization and retrieval interfaces

   - Fix some locking problmes in the quota scrub code

   - Strengthen btree structure checks in scrub code

   - Rewrite swapfile activation to use iomap and support unwritten
     extents

   - Make scrub exit to userspace sooner when corruptions or
     cross-referencing problems are found

   - Make scrub invoke the data fork scrubber directly on metadata
     inodes

   - Don't do background reclamation of post-eof and cow blocks when the
     fs is suspended

   - Fix secondary superblock buffer lifespan hinting

   - Refactor growfs to use table-dispatched functions instead of long
     stringy functions

   - Move growfs code to libxfs

   - Implement online fs label getting and setting

   - Introduce online filesystem repair (in a very limited capacity)

   - Fix unit conversion problems in the realtime freemap iteration
     functions

   - Various refactorings and cleanups in preparation to remove buffer
     heads in a future release

   - Reimplement the old bmap call with iomap

   - Remove direct buffer head accesses from seek hole/data

   - Various bug fixes"

* tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
  fs: use ->is_partially_uptodate in page_cache_seek_hole_data
  fs: remove the buffer_unwritten check in page_seek_hole_data
  fs: move page_cache_seek_hole_data to iomap.c
  xfs: use iomap_bmap
  iomap: add an iomap-based bmap implementation
  iomap: add a iomap_sector helper
  iomap: use __bio_add_page in iomap_dio_zero
  iomap: move IOMAP_F_BOUNDARY to gfs2
  iomap: fix the comment describing IOMAP_NOWAIT
  iomap: inline data should be an iomap type, not a flag
  mm: split ->readpages calls to avoid non-contiguous pages lists
  mm: return an unsigned int from __do_page_cache_readahead
  mm: give the 'ret' variable a better name __do_page_cache_readahead
  block: add a lower-level bio_add_page interface
  xfs: fix error handling in xfs_refcount_insert()
  xfs: fix xfs_rtalloc_rec units
  xfs: strengthen rtalloc query range checks
  xfs: xfs_rtbuf_get should check the bmapi_read results
  xfs: xfs_rtword_t should be unsigned, not signed
  dax: change bdev_dax_supported() to support boolean returns
  ...
2018-06-05 13:24:20 -07:00
Linus Torvalds
1434763ca5 A lot of cleanups and bug fixes, especially dealing with corrupted
file systems.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlsWmXcACgkQ8vlZVpUN
 gaMGmAf+JGK4XysAvlJuj9tJfFPHHwgXSBBe/GAgyjhW9XhtNRHprUM2SpvwpIdI
 Isl5O8Ec+FywBJB0I4AGy6yds6DE6jn38FFRFEhVmkp4EoROJiIr8+a7spfVuC3m
 cWrHBgc7FwK4qYlyuGtH2+6NYva+KNFr+wwbvvUusvldyZAWMzflfrcdHM6D+/JE
 +Sd5I7aniqnP5fICq0b4xrP2zWO4XJEKMbZO2dJ9yRtMmUnbaSj6G+bTGDRyfrNk
 L3wJhqIu93o7zjDaEC0UfXSLAXzoDGWHeq7fBssaJiXj/hNtAvAGPaRMbgFR9a3h
 uHmhvf84iyJuM+8GgG25UqeGwCuWiA==
 =b0VQ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A lot of cleanups and bug fixes, especially dealing with corrupted
  file systems"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: fix fencepost error in check for inode count overflow during resize
  ext4: correctly handle a zero-length xattr with a non-zero e_value_offs
  ext4: bubble errors from ext4_find_inline_data_nolock() up to ext4_iget()
  ext4: do not allow external inodes for inline data
  ext4: report delalloc reserve as non-free in statfs for project quota
  ext4: remove NULL check before calling kmem_cache_destroy()
  jbd2: remove NULL check before calling kmem_cache_destroy()
  jbd2: remove bunch of empty lines with jbd2 debug
  ext4: handle errors on ext4_commit_super
  ext4: do not update s_last_mounted of a frozen fs
  ext4: factor out helper ext4_sample_last_mounted()
  vfs: add the sb_start_intwrite_trylock() helper
  ext4: update mtime in ext4_punch_hole even if no blocks are released
  ext4: add verifier check for symlink with append/immutable flags
  fs: ext4: add new return type vm_fault_t
  ext4: fix hole length detection in ext4_ind_map_blocks()
  ext4: mark block bitmap corrupted when found
  ext4: mark inode bitmap corrupted when found
  ext4: add new ext4_mark_group_bitmap_corrupted() helper
  ext4: fix wrong return value in ext4_read_inode_bitmap()
  ...
2018-06-05 12:49:17 -07:00
Greg Kroah-Hartman
f0ac2abcc0 ncpfs: remove compat functionality
Now that ncpfs is gone from the tree, no need to have the compatibility
thunking layer around, it will not actually go anywhere :)

So delete that logic from fs/compat.c, it is no longer needed.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-06-05 19:23:26 +02:00
Darrick J. Wong
117a148ffe iomap: fsync swap files before iterating mappings
Swap files require that all the file mapping metadata be stable on disk.
It is insufficient to flush dirty pages in the page cache because that
won't necessarily result in filesystems pushing all their metadata out
to disk.  Therefore, call fsync from iomap_swapfile_activate.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-06-05 09:53:05 -07:00
Shankara Pailoor
92d3413419 jfs: Fix inconsistency between memory allocation and ea_buf->max_size
The code is assuming the buffer is max_size length, but we weren't
allocating enough space for it.

Signed-off-by: Shankara Pailoor <shankarapailoor@gmail.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-06-05 10:36:46 -05:00
Trond Myklebust
977294c719 NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
Fix a compiler warning:
fs/nfs/nfs4proc.c:910:13: warning: 'nfs4_layoutget_release' defined but not used [-Wunused-function]
 static void nfs4_layoutget_release(void *calldata)
             ^~~~~~~~~~~~~~~~~~~~~~

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-05 10:38:39 -04:00
Misono Tomohiro
3ca57bd620 btrfs: Check error of btrfs_iget in btrfs_search_path_in_tree_user
The patch introducing the ioctl was not the latest version at the time
of merging to the mainline and needs a fixup from this patch.

Fixes: ba637a252d30 ("btrfs: Check error of btrfs_iget() in btrfs_search_path_in_tree_user")
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-06-05 16:11:18 +02:00
Eric Sandeen
89c2e71123 xfs: use xfs_trans_getsb in xfs_sync_sb_buf
Use xfs_trans_getsb rather than reaching right in for
mp->m_sb_bp; I think this is more correct, and it facilitates
building this libxfs code in userspace as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
d2e7366542 xfs: don't assert on corrupted unlinked inode list
Use the per-ag inode number verifiers to detect corrupt lists and error
out, instead of using ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
2551a53053 xfs: explicitly pass buffer size to xfs_corruption_error
Explicitly pass the buffer length to xfs_corruption_error() instead of
assuming XFS_CORRUPTION_DUMP_LEN so that we avoid dumping off the end
of the buffer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
85ae01098c xfs: don't assert when on-disk btree pointers are garbage
Don't ASSERT when we encounter bad on-disk btree pointers in the debug
check functions.  Log the error to leave breadcrumbs and let the upper
layers deal with it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
e63a1008ee xfs: strengthen btree pointer checks before use
Instead of ASSERTing on null btree pointers in xfs_btree_ptr_to_daddr,
use the new block number verifiers to ensure that the btree pointer
doesn't point to any sensitive areas (AG headers, past-EOFS) and return
-EFSCORRUPTED if this is the case.  Remove the ASSERT because on-disk
corruptions shouldn't trigger ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
4cbae4b816 xfs: introduce xfs_btree_debug_check_ptr
Make xfs_btree_check_ptr a non-debug function and introduce a new _debug
version that only runs when #ifdef DEBUG.   This will enable us to reuse
the checking logic with other parts of the btree code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:05 -07:00
Darrick J. Wong
e4f45eff86 xfs: check directory bestfree information in the verifier
Create a variant of xfs_dir2_data_freefind that is suitable for use in a
verifier.  Because _freefind is called by the verifier, we simply
duplicate the _freefind function, convert the ASSERTs to return
__this_address, and modify the verifier to call our new function.  Once
we've made it impossible for directory blocks with bad bestfree data to
make it into the filesystem we can remove the DEBUG code from the
regular _freefind function.

Underlying argument: corruption of on-disk metadata should return
-EFSCORRUPTED instead of blowing ASSERTs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 18:25:04 -07:00
Shirish Pargaonkar
ee25c6dd7b cifs: For SMB2 security informaion query, check for minimum sized security descriptor instead of sizeof FileAllInformation class
Validate_buf () function checks for an expected minimum sized response
passed to query_info() function.
For security information, the size of a security descriptor can be
smaller (one subauthority, no ACEs) than the size of the structure
that defines FileInfoClass of FileAllInformation.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199725
Cc: <stable@vger.kernel.org>
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Noah Morrison <noah.morrison@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-04 19:19:24 -05:00
Aurelien Aptel
57f933ce9f CIFS: Fix signing for SMB2/3
It seems Ronnie's preamble removal broke signing.

the signing functions are called when:

A) we send a request (to sign it)
B) when we recv a response (to check the signature).

On code path A, the smb2 header is in iov[1] but on code path B, the
smb2 header is in iov[0] (and there's only one vector).

So we have different iov indexes for the smb2 header but the signing
function always use index 1. Fix this by checking the nb of io vectors
in the signing function as a hint.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-04 19:17:59 -05:00
Linus Torvalds
93e95fa574 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
 "This set of changes close the known issues with setting si_code to an
  invalid value, and with not fully initializing struct siginfo. There
  remains work to do on nds32, arc, unicore32, powerpc, arm, arm64, ia64
  and x86 to get the code that generates siginfo into a simpler and more
  maintainable state. Most of that work involves refactoring the signal
  handling code and thus careful code review.

  Also not included is the work to shrink the in kernel version of
  struct siginfo. That depends on getting the number of places that
  directly manipulate struct siginfo under control, as it requires the
  introduction of struct kernel_siginfo for the in kernel things.

  Overall this set of changes looks like it is making good progress, and
  with a little luck I will be wrapping up the siginfo work next
  development cycle"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
  signal/sh: Stop gcc warning about an impossible case in do_divide_error
  signal/mips: Report FPE_FLTUNK for undiagnosed floating point exceptions
  signal/um: More carefully relay signals in relay_signal.
  signal: Extend siginfo_layout with SIL_FAULT_{MCEERR|BNDERR|PKUERR}
  signal: Remove unncessary #ifdef SEGV_PKUERR in 32bit compat code
  signal/signalfd: Add support for SIGSYS
  signal/signalfd: Remove __put_user from signalfd_copyinfo
  signal/xtensa: Use force_sig_fault where appropriate
  signal/xtensa: Consistenly use SIGBUS in do_unaligned_user
  signal/um: Use force_sig_fault where appropriate
  signal/sparc: Use force_sig_fault where appropriate
  signal/sparc: Use send_sig_fault where appropriate
  signal/sh: Use force_sig_fault where appropriate
  signal/s390: Use force_sig_fault where appropriate
  signal/riscv: Replace do_trap_siginfo with force_sig_fault
  signal/riscv: Use force_sig_fault where appropriate
  signal/parisc: Use force_sig_fault where appropriate
  signal/parisc: Use force_sig_mceerr where appropriate
  signal/openrisc: Use force_sig_fault where appropriate
  signal/nios2: Use force_sig_fault where appropriate
  ...
2018-06-04 15:23:48 -07:00
Linus Torvalds
d8aed8415b Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull userns updates from Eric Biederman:
 "This is the last couple of vfs bits to enable root in a user namespace
  to mount and manipulate a filesystem with backing store (AKA not a
  virtual filesystem like proc, but a filesystem where the unprivileged
  user controls the content). The target filesystem for this work is
  fuse, and Miklos should be sending you the pull request for the fuse
  bits this merge window.

  The two key patches are "evm: Don't update hmacs in user ns mounts"
  and "vfs: Don't allow changing the link count of an inode with an
  invalid uid or gid". Those close small gaps in the vfs that would be a
  problem if an unprivileged fuse filesystem is mounted.

  The rest of the changes are things that are now safe to allow a root
  user in a user namespace to do with a filesystem they have mounted.
  The most interesting development is that remount is now safe"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
  capabilities: Allow privileged user in s_user_ns to set security.* xattrs
  fs: Allow superblock owner to access do_remount_sb()
  fs: Allow superblock owner to replace invalid owners of inodes
  vfs: Allow userns root to call mknod on owned filesystems.
  vfs: Don't allow changing the link count of an inode with an invalid uid or gid
  evm: Don't update hmacs in user ns mounts
2018-06-04 15:21:19 -07:00
Darrick J. Wong
924cade4df xfs: don't return garbage buffers in xfs_da3_node_read
If we're reading a node in a dir/attr btree and the buffer comes off the
disk with a magic number we don't recognize, don't ASSERT and don't set
a garbage buffer type (0 also triggers ASSERTs).  Instead, report the
corruption, release the buffer, and return -EFSCORRUPTED because that's
what the dabtree is -- corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
1f5c071d19 xfs: don't ASSERT on short form btree root pointer of zero
Don't ASSERT if the short form btree root pointer is zero.  Now that we
use xfs_verify_agbno to check all short form btree pointers, we'll let
that log the error and pass it to the upper layers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
eeee0d6a9b xfs: btree lookup shouldn't ASSERT on empty btree nodes
If a btree lookup encounters an empty btree node or an empty btree leaf
on a multi-level btree, that's evidence of a corrupt on-disk btree.
Therefore, we should return -EFSCORRUPTED to the upper levels, not an
ASSERT failure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
a37f7b127e xfs: xfs_alloc_get_rec should return EFSCORRUPTED for obvious bnobt corruption
Return -EFSCORRUPTED when the bnobt/cntbt return obviously corrupt
values, rather than letting them bounce around in the internal code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:30 -07:00
Darrick J. Wong
b3986010ce xfs: remove redundant ASSERT on insufficient bestfree length in _leaf_addname
In xfs_dir2_leaf_addname we ASSERT if the length of the unused space
described by bestfree[0] is less the amount of space we wish to consume.
Immediately after it is a call to xfs_dir2_data_use_free where the
offset parameter is offset of the unused space and the length parameter
is the amount of space we wish to consume.  Both values (and the unused
space pointer) are passed into xfs_dir2_data_check_free, which also
validates that the region of unused space is big enough to cover the
space we wish to consume.  This is effectively the same check that the
ASSERT covers, and since a check failure results in a corruption message
being logged we can remove the ASSERT.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Darrick J. Wong
17ba2cc7b5 xfs: don't assert when reporting on-disk corruption while loading btree
Don't bother ASSERTing when we're already going to log and return the
corruption status.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Darrick J. Wong
aaacdd257f xfs: don't forbid setting dax flag on directories if device doesn't dax
On a directory, the DAX flag is merely a hint that files created in the
directory should have the DAX flag set at creation time.  We don't care
if the underlying device supports DAX or not because directory metadata
are always cached in DRAM.  We don't care if new files get the flag even
if the device doesn't support DAX because we always check for DAX
support before setting the VFS flag (S_DAX).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-06-04 14:45:29 -07:00
Linus Torvalds
325520142b some smb3 fixes for stable, as well as addition of ftrace hooks for cifs.ko, and improvements in compounding and smbdirect (RDMA)
-----BEGIN PGP SIGNATURE-----
 
 iQHHBAABCgAxFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlsUs3QTHHNtZnJlbmNo
 QGdtYWlsLmNvbQAKCRCKLL1wB3JPUWFIDACpB/nxQT0l+SMfXe+7W5Txy6x1hasi
 fFtcQd03dSzTwGN6ANPmsszktu5LVU1htifwSDIMj9x6MLE+VRCFhuWEszs6N/WD
 kQO3DQmuyB3w+s8rkN2V+mjamDuf0jTiq2uLsTvicUsyEmJtJgKJdcmB+bDxXTcE
 AQbh5qjMtQIl4IyI6cqxIRm/6laeVyGurJKYGp2G8UcyLg6ddv/vp5mXDIdoCnrv
 /gvzi/A7a7UwTEZolFa2vNT/KTLofG8fstdtfqlSa9qTWZwjhoFPKVd3eQPJV3Fj
 SjTpwbnM4q0MSY7XijPMkSpnBniaF3e5bf3Cp6QmtQeU+FmfkoYOy7EUlZJ1ar15
 xNjLEl/EEfy1C3hIwsdoVB8iLFPhFlAMONUWGjs2cyiTRMZx2OP4VgbyN8pcZUcJ
 mGNbBx2SQDJo8iR579dRh2gl6Gy/i5DNOClyIPfEpzGTURrzoV0C9EAnoweyK1qh
 wxvdA7SfPRi2rw9a4Kg+PJxWnd8oNRyhzUw=
 =EXH7
 -----END PGP SIGNATURE-----

Merge tag '4.18-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs updates from Steve French:

 - smb3 fixes for stable

 - addition of ftrace hooks for cifs.ko

 - improvements in compounding and smbdirect (rdma)

* tag '4.18-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: (38 commits)
  CIFS: Add support for direct pages in wdata
  CIFS: Use offset when reading pages
  CIFS: Add support for direct pages in rdata
  cifs: update multiplex loop to handle compounded responses
  cifs: remove header_preamble_size where it is always 0
  cifs: remove struct smb2_hdr
  CIFS: 511c54a2f6 adds a check for session expiry, status STATUS_NETWORK_SESSION_EXPIRED, however the server can also respond with STATUS_USER_SESSION_DELETED in cases where the session has been idle for some time and the server reaps the session to recover resources.
  cifs: change smb2_get_data_area_len to take a smb2_sync_hdr as argument
  cifs: update smb2_calc_size to use smb2_sync_hdr instead of smb2_hdr
  cifs: remove struct smb2_oplock_break_rsp
  cifs: remove rfc1002 header from all SMB2 response structures
  smb3: on reconnect set PreviousSessionId field
  smb3: Add posix create context for smb3.11 posix mounts
  smb3: add tracepoints for smb2/smb3 open
  cifs: add debug output to show nocase mount option
  smb3: add define for id for posix create context and corresponding struct
  cifs: update smb2_check_message to handle PDUs without a 4 byte length header
  smb3: allow "posix" mount option to enable new SMB311 protocol extensions
  smb3: add support for posix negotiate context
  cifs: allow disabling less secure legacy dialects
  ...
2018-06-04 14:42:46 -07:00
Linus Torvalds
1e43938bfb We've got 9 more patches for this merge window.
1. Andreas Gruenbacher contributed a patch to remove sd_jheightsize to
    greatly simplify some code.
 2. Andreas fixed some comments.
 3. Andreas fixed a glock recursion bug when allocation errors occur.
 4. Andreas improved the hole_size function so it returns the entire hole
    rather than figuring it out piecemeal.
 5. Andreas cleaned up gfs2_stuffed_write_end to remove a lot of redundancy.
 6. Andreas clarified code with regard to the way ordered writes are processed.
 7. Andreas did a bunch of improvements and cleanups of the iomap code to
    pave the way for iomap writes, which is a future patch set.
 8. I fixed a bug where block reservations can run off the end of a bitmap.
 9. I added Andreas to the MAINTAINERS file.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbFXdOAAoJENeLYdPf93o7hcMH/RmU2Pd3QTGxB2IOtnO/NUIg
 K+aOXIa6bVjGyNgx9C/to0tzXQ+E/MQIrhpaDd4Tlas4gsdTeOYtqZCYa7tfJFCz
 bNDcH+ix9f6hiFz5rMv7tCT9K2K3I3gnmjTgXak9OMQXsUqpB/QFs4ZSESdgChHl
 K+06R6tLfBMNGysWW3OIwA7Fb5i7tkWbxbyidI/GHVxZurRAIosvic7EOJT6oypr
 /N9PrTmHg+L8Kp/c7Cpge5hebfC+funXbcPk0vmpYAz9Aue8uerUEEzrI1QyPenQ
 +41xGG55R/qxkmYa998JvETAeMGeQjOYUOmNT7/QnZOn21fK1Q0OJWvt+teYjmE=
 =/I7+
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.18.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got nine more patches for this merge window.

   - remove sd_jheightsize to greatly simplify some code (Andreas
     Gruenbacher)

   - fix some comments (Andreas)

   - fix a glock recursion bug when allocation errors occur (Andreas)

   - improve the hole_size function so it returns the entire hole rather
     than figuring it out piecemeal (Andreas)

   - clean up gfs2_stuffed_write_end to remove a lot of redundancy
     (Andreas)

   - clarify code with regard to the way ordered writes are processed
     (Andreas)

   - a bunch of improvements and cleanups of the iomap code to pave the
     way for iomap writes, which is a future patch set (Andreas)

   - fix a bug where block reservations can run off the end of a bitmap
     (Bob Peterson)

   - add Andreas to the MAINTAINERS file (Bob Peterson)"

* tag 'gfs2-4.18.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  MAINTAINERS: Add Andreas Gruenbacher as a maintainer for gfs2
  gfs2: Iomap cleanups and improvements
  gfs2: Remove ordered write mode handling from gfs2_trans_add_data
  gfs2: gfs2_stuffed_write_end cleanup
  gfs2: hole_size improvement
  GFS2: gfs2_free_extlen can return an extent that is too long
  GFS2: Fix allocation error bug with recursive rgrp glocking
  gfs2: Update find_metapath comment
  gfs2: Remove sdp->sd_jheightsize
2018-06-04 14:36:38 -07:00
Linus Torvalds
8a4631144b dlm for 4.18
These three commits fix and clean up the flags dlm was
 using on its SCTP sockets.  The result improves the
 performance and fixes some bad connection delays.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbFXmBAAoJEDgbc8f8gGmqDYsP/12AlXOdUJmhyK3/GwYGmayx
 +mnUkQ/PacfsjXIFt1HHCM9LcSCk9gTES8suoFHkP5wfMMqKz6VoVG6k2sXn2jMS
 Ig+xCCFsTIR1n8ZB7ZAhQFkYgOGsKujDgLQ/7dYBos331zdZSD8CqTYTcti5twdE
 C4uERBR2RLDHGTzgAVEWSE/wc3QK8Yw8O9Yh1tebBFRyMU/tQl53HlviMozhJGlZ
 vg6Mm8eVE9OSUBRp4LzG34xYj7YD05j5cJBkQWVjTlg+giNCZnlLWQOv0rbiy2+M
 IXf/TmYKqbtdc4ATMrJStIdJSjErZeuDtWvKEtpaW/w2TjiwOTS56KG/W4kC4Wjs
 5MuHVFI3jRmgrL1iCwIotqu+BkE1dlHP81DbLKRy53lvit+zEWoh3G1V9rUSGDcd
 tEs36BwJERpmiTw2roOzajdLyi6Z2Jlv/GJCXUyQ2YXBV8WvTKeQbb+VSc6rCz1E
 WBZ/vOIbA8JibXVuMHw6iPwy99M5lk/6q8Sa1CS9GiYmGfM3fDesKpYGnKIu77AH
 4pnm6OgPihU+dZratHIJMddNAyorkcaYuOq0TkLiTJZas/MpGKuKHncEARUfcHno
 e1RSz1yPuN0OEJoVJBiWfe8nSAV/6vXvEPlWTMfwADfpXTYxTphqRW3VvC4N1Cpi
 gNiAKXvdIjRpGjInPeI3
 =jiYf
 -----END PGP SIGNATURE-----

Merge tag 'dlm-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "These three commits fix and clean up the flags dlm was using on its
  SCTP sockets. This improves performance and fixes some bad connection
  delays"

* tag 'dlm-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: remove O_NONBLOCK flag in sctp_connect_to_sock
  dlm: make sctp_connect_to_sock() return in specified time
  dlm: fix a clerical error when set SCTP_NODELAY
2018-06-04 14:34:06 -07:00
Chao Yu
dfa742803f f2fs: fix to clear FI_VOLATILE_FILE correctly
Thread A			Thread B
- f2fs_release_file
 - clear_inode_flag(FI_VOLATILE_FILE)
				- wb_writeback
				 - writeback_sb_inodes
				  - __writeback_single_inode
				   - do_writepages
				    - f2fs_write_data_pages
				     - __write_data_page
				     all volatile file's pages
				     are writebacked to storage
 - set_inode_flag(FI_DROP_CACHE)
 - filemap_fdatawrite

There is a hole that mm can flush all dirty pages of volatile file as
inode is not tagged with both FI_VOLATILE_FILE and FI_DROP_CACHE flags,
we should never writeback the page #0 and also it's unneeded to writeback
other pages.

This patch adjusts to relocate clear_inode_flag(FI_VOLATILE_FILE), so that
FI_VOLATILE_FILE flag can be remained before all dirty pages were dropped
to avoid issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:33:27 -07:00
Chao Yu
c29fd0c0e2 f2fs: let sync node IO interrupt async one
Although mixed sync/async IOs can have continuous LBA, as they have
different IO priority, block IO scheduler will add them into different
queues and commit them separately, result in splited IOs which causes
wrose performance.

This patch gives high priority to synchronous IO of nodes, means that
once synchronous flow starts, it can interrupt asynchronous writeback
flow of system flusher, so more big IOs can be expected.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:33:20 -07:00
Chao Yu
aae764ece6 f2fs: don't change wbc->sync_mode
We should never falsify wbc->sync_mode passed from mm, otherwise
mm can trigger writeback with wrong IO priority.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:32:44 -07:00
Chao Yu
a1f72ac2c0 f2fs: fix to update mtime correctly
If we change system time to the past, get_mtime() will return a
overflowed time, and SIT_I(sbi)->max_mtime will be udpated
incorrectly, this patch fixes the two issues.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-06-04 14:31:11 -07:00
Linus Torvalds
704996566f for-4.18-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsVV6wACgkQxWXV+ddt
 WDsaOhAAnrG3o/Bkn4loifYjiWQoArswwVJ2m5MvG14oPcn/c6wCUlE6mHdEsT8l
 WgltSRSd/3hJ1BcuWjhcR8RSLBRSaPcw4FAHikAdOWMrLXSKfA+quT/upXD00iDS
 +bnHoknc5vvk+Ow0KyGODAeMYfJm8gGB43BeVDr/mgrVRrwYY9r14s+6LX897NOT
 qObdxBjtedua0uIYZX5673siFaSbvJNQyGn6iKO9Ww+7bkWjgKO8qY8/3/gEaJn1
 q1303Oo65VOXQQKZed+NjPNakpqRmABaN79qTwWMnUI4qYBKi651a7sr5HljDpmc
 szY3uF/E7AOfJSDeTBdglFR1eQgv20ceeIx7FN0n6iUbuIVH9P85mBF6f76Hivpq
 fA39uPYxWwWq8RxmC09yqoTkuNS0zAQhov4n8XO0q/81Gfek0RpR+LMK15W02Ur8
 +5Pazkgz+xRYRPhpExe75pbM5N31KFJdf+4wpXeZoXKGTeiiCl/SZV3NzyUtemGg
 eN7ltHD/Y9iNpx7sgHn+2veXwn6psDad8ORxIlRyKDeot2pNZ4lnekPbqMC/IwAz
 hF3ZpcEo8v2Q2kqwCqw5evq4NSOvfG0cEuIqAKir1AgTn2nPH99qRSR1He9kluap
 GBPTPXNgBrhhdsh5kltI6CJcGE46bHsLG4LI5WvRs+wOtyaG5Ww=
 =jIvf
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "User visible features:

   - added support for the ioctl FS_IOC_FSGETXATTR, per-inode flags,
     successor of GET/SETFLAGS; now supports only existing flags:
     append, immutable, noatime, nodump, sync

   - 3 new unprivileged ioctls to allow users to enumerate subvolumes

   - dedupe syscall implementation does not restrict the range to 16MiB,
     though it still splits the whole range to 16MiB chunks

   - on user demand, rmdir() is able to delete an empty subvolume,
     export the capability in sysfs

   - fix inode number types in tracepoints, other cleanups

   - send: improved speed when dealing with a large removed directory,
     measurements show decrease from 2000 minutes to 2 minutes on a
     directory with 2 million entries

   - pre-commit check of superblock to detect a mysterious in-memory
     corruption

   - log message updates

  Other changes:

   - orphan inode cleanup improved, does no keep long-standing
     reservations that could lead up to early ENOSPC in some cases

   - slight improvement of handling snapshotted NOCOW files by avoiding
     some unnecessary tree searches

   - avoid OOM when dealing with many unmergeable small extents at flush
     time

   - speedup conversion of free space tree representations from/to
     bitmap/tree

   - code refactoring, deletion, cleanups:
      + delayed refs
      + delayed iput
      + redundant argument removals
      + memory barrier cleanups
      + remove a redundant mutex supposedly excluding several ioctls to
        run in parallel

   - new tracepoints for blockgroup manipulation

   - more sanity checks of compressed headers"

* tag 'for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (183 commits)
  btrfs: Add unprivileged version of ino_lookup ioctl
  btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF
  btrfs: Add unprivileged ioctl which returns subvolume information
  Btrfs: clean up error handling in btrfs_truncate()
  btrfs: Factor out write portion of btrfs_get_blocks_direct
  btrfs: Factor out read portion of btrfs_get_blocks_direct
  btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
  btrfs: raid56: Remove VLA usage
  btrfs: return error value if create_io_em failed in cow_file_range
  btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot
  btrfs: drop unused parameter qgroup_reserved
  btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
  btrfs: lift some btrfs_cross_ref_exist checks in nocow path
  btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
  btrfs: Remove fs_info argument from btrfs_uuid_tree_add
  Btrfs: remove unused check of skip_locking
  Btrfs: remove always true check in unlock_up
  Btrfs: grab write lock directly if write_lock_level is the max level
  Btrfs: move get root out of btrfs_search_slot to a helper
  Btrfs: use more straightforward extent_buffer_uptodate check
  ...
2018-06-04 14:29:13 -07:00
Linus Torvalds
e3a44fd7e6 affs-for-4.18-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlsVSBYACgkQxWXV+ddt
 WDswpQ//aeSKpmGtrhEl0kh2zt+0yyg9aBO74eN8xd27ADYcF4W3vDH91YNgz/HT
 y0Lu+vO0Y1ztKUpf7ho1fZq2By5wNvq/7XCBOYIFquXxSm1jW4TxdIzblEGvaz1N
 vBfb3wwlgvL6cVmjZjlQSW5bDwdX8yeSVPRfu/aWH2SAi/vjSmcNBTOl8cX3rXNE
 ae2bILFFOJGTzQbMPe8ORD5PuL0n3DCs+3EI7FVay5RAekM9+nglnk/mVX7xZAON
 Y2a2QEUNI2MOlwA2s7wfjWIJqp7XZzCzkop62p8omYX5jA5CtLZtBKGrlwrBu1tw
 nZ1XlTQC7R95HBc3Y7XHE2kiXjQ/SPEcQp5fXUzjfZPVWuLTrq0r9OgFxcyCjAGx
 yaS8V6WMdcloFY1NxG3SIe+4OrxCrnbd9BpUmsr9nTQd2elaV5W1I3a5Wqtvx2eF
 B5d8I/ErmuIGRvyIBJVj5bmXqbaN+F2n+KSNvGQS4GNAV4jV21mnlj04Cjava5ma
 6yzwZu1xLCVSPyjFYrxDVV//qq4jfs7Ou2Eo6W7Bwk5K6fOm6TQzwH3yUA6gMUip
 SdC0tNy2j5fTS6EETsxYMNGY74mNgmroWJZ/7n5nrDBkV3LZFLfZFevTE+wenzip
 IhyTL9tpWxrchPHsA3GdRsxSXKDYArqFL9D09W3NDtpjjkxvKs0=
 =slnj
 -----END PGP SIGNATURE-----

Merge tag 'affs-for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull affs fix from David Sterba:
 "A potential memory leak fix for AFFS"

* tag 'affs-for-4.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  affs: fix potential memory leak when parsing option 'prefix'
2018-06-04 14:27:09 -07:00
Linus Torvalds
408afb8d78 Merge branch 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull aio updates from Al Viro:
 "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.

  The only thing I'm holding back for a day or so is Adam's aio ioprio -
  his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
  but let it sit in -next for decency sake..."

* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  aio: sanitize the limit checking in io_submit(2)
  aio: fold do_io_submit() into callers
  aio: shift copyin of iocb into io_submit_one()
  aio_read_events_ring(): make a bit more readable
  aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
  aio: take list removal to (some) callers of aio_complete()
  aio: add missing break for the IOCB_CMD_FDSYNC case
  random: convert to ->poll_mask
  timerfd: convert to ->poll_mask
  eventfd: switch to ->poll_mask
  pipe: convert to ->poll_mask
  crypto: af_alg: convert to ->poll_mask
  net/rxrpc: convert to ->poll_mask
  net/iucv: convert to ->poll_mask
  net/phonet: convert to ->poll_mask
  net/nfc: convert to ->poll_mask
  net/caif: convert to ->poll_mask
  net/bluetooth: convert to ->poll_mask
  net/sctp: convert to ->poll_mask
  net/tipc: convert to ->poll_mask
  ...
2018-06-04 13:57:43 -07:00
Linus Torvalds
b058efc1ac Merge branch 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache lookup cleanups from Al Viro:
 "Cleaning ->lookup() instances up - mostly d_splice_alias() conversions"

* 'work.lookup' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (29 commits)
  switch the rest of procfs lookups to d_splice_alias()
  procfs: switch instantiate_t to d_splice_alias()
  don't bother with tid_fd_revalidate() in lookups
  proc_lookupfd_common(): don't bother with instantiate unless the file is open
  procfs: get rid of ancient BS in pid_revalidate() uses
  cifs_lookup(): switch to d_splice_alias()
  cifs_lookup(): cifs_get_inode_...() never returns 0 with *inode left NULL
  9p: unify paths in v9fs_vfs_lookup()
  ncp_lookup(): use d_splice_alias()
  hfsplus: switch to d_splice_alias()
  hfs: don't allow mounting over .../rsrc
  hfs: use d_splice_alias()
  omfs_lookup(): report IO errors, use d_splice_alias()
  orangefs_lookup: simplify
  openpromfs: switch to d_splice_alias()
  xfs_vn_lookup: simplify a bit
  adfs_lookup: do not fail with ENOENT on negatives, use d_splice_alias()
  adfs_lookup_byname: .. *is* taken care of in fs/namei.c
  romfs_lookup: switch to d_splice_alias()
  qnx6_lookup: switch to d_splice_alias()
  ...
2018-06-04 13:46:22 -07:00
David Howells
1a025028d4 rxrpc: Fix handling of call quietly cancelled out on server
Sometimes an in-progress call will stop responding on the fileserver when
the fileserver quietly cancels the call with an internally marked abort
(RX_CALL_DEAD), without sending an ABORT to the client.

This causes the client's call to eventually expire from lack of incoming
packets directed its way, which currently leads to it being cancelled
locally with ETIME.  Note that it's not currently clear as to why this
happens as it's really hard to reproduce.

The rotation policy implement by kAFS, however, doesn't differentiate
between ETIME meaning we didn't get any response from the server and ETIME
meaning the call got cancelled mid-flow.  The latter leads to an oops when
fetching data as the rotation partially resets the afs_read descriptor,
which can result in a cleared page pointer being dereferenced because that
page has already been filled.

Handle this by the following means:

 (1) Set a flag on a call when we receive a packet for it.

 (2) Store the highest packet serial number so far received for a call
     (bearing in mind this may wrap).

 (3) If, when the "not received anything recently" timeout expires on a
     call, we've received at least one packet for a call and the connection
     as a whole has received packets more recently than that call, then
     cancel the call locally with ECONNRESET rather than ETIME.

     This indicates that the call was definitely in progress on the server.

 (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
     don't try the next server, but rather abort the call.

     This avoids the oops as we don't try to reuse the afs_read struct.
     Rather, as-yet ungotten pages will be reread at a later data.

Also:

 (5) Add an rxrpc tracepoint to log detection of the call being reset.

Without this, I occasionally see an oops like the following:

    general protection fault: 0000 [#1] SMP PTI
    ...
    RIP: 0010:_copy_to_iter+0x204/0x310
    RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
    RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
    RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
    RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
    R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
    R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
    FS:  00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
    Call Trace:
     skb_copy_datagram_iter+0x14e/0x289
     rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
     ? trace_buffer_unlock_commit_regs+0x4f/0x89
     rxrpc_kernel_recv_data+0x149/0x421
     afs_extract_data+0x1e0/0x798
     ? afs_wait_for_call_to_complete+0xc9/0x52e
     afs_deliver_fs_fetch_data+0x33a/0x5ab
     afs_deliver_to_call+0x1ee/0x5e0
     ? afs_wait_for_call_to_complete+0xc9/0x52e
     afs_wait_for_call_to_complete+0x12b/0x52e
     ? wake_up_q+0x54/0x54
     afs_make_call+0x287/0x462
     ? afs_fs_fetch_data+0x3e6/0x3ed
     ? rcu_read_lock_sched_held+0x5d/0x63
     afs_fs_fetch_data+0x3e6/0x3ed
     afs_fetch_data+0xbb/0x14a
     afs_readpages+0x317/0x40d
     __do_page_cache_readahead+0x203/0x2ba
     ? ondemand_readahead+0x3a7/0x3c1
     ondemand_readahead+0x3a7/0x3c1
     generic_file_buffered_read+0x18b/0x62f
     __vfs_read+0xdb/0xfe
     vfs_read+0xb2/0x137
     ksys_read+0x50/0x8c
     do_syscall_64+0x7d/0x1a0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note the weird value in RDI which is a result of trying to kmap() a NULL
page pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-04 16:06:26 -04:00
Linus Torvalds
9214407d12 fasync fix for v4.18
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbFTxwAAoJEAAOaEEZVoIVfOQQAIMmcJjx8c+86TS5bVpi9yWN
 z6C2QEW1vMN2Z65dvjquSbF81yY54oCgiV3/6iGOtcOOI3rH/7f+CKb+4xRskpFv
 PSVXKXkXKOGRnnhcz2X8F+FBbd4xLxzyB+Ff9HCWVv2L76d7Uu8wUcbSpOdC7GOd
 pXHQkA+WNPEnHpr4To09ZwS14IzVCAhFii7MvOU6IXxzbP473/aPpFjrwa+mbkkz
 J7U4bijaLL5eGJGE0ElHOaeF/iAhOBEFGrkT3lR78ZA+spnqVXlv2U9E29zg9H98
 JyFM7vMe9MtR///Ve6BeLNKrDnwCmBo9ScMR8eWUWhrOInGX9yU0UR9XkD5TzA3U
 hQHW2ckjloykp+HuvbUOF/Aut8GczbyaDsAHZ6/fAxgNGnFeJ7yjR5IPKFkKekPt
 H+ls9j4C9yZpTLXioNqJz6d2IOS5MKfjPFC0c2ItO11jzNdbmjO7ciIq4EecSR51
 FKeoUHyjhfak94Dh7x3IQEtnm1o5CbbSNvQ8g6PJYXFBi88jcmQkLmZlULDV3vyg
 9febfSYpJhamJ+GUn8av+HziJG/1o1sOLrIVi2qGRIBJiUO5tAYY6UVvo1xKuzQ3
 BMSRVnqVbs749lUV0QrYoFVdMvVxL6l0+8p4Xgx7q5BR4gG73e4yaBnXLeStMIYo
 oSFy/EH3MYR87m5Yk45E
 =rqcZ
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull fasync fix from Jeff Layton:
 "Just a single fix for a deadlock in the fasync handling code that
  Kirill observed while testing.

  The fix is to change the fa_lock to be rwlock_t, and use a read lock
  in kill_fasync_rcu"

* tag 'locks-v4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fasync: Fix deadlock between task-context and interrupt-context kill_fasync()
2018-06-04 13:05:02 -07:00
Linus Torvalds
eeee3149aa There's been a fair amount of work in the docs tree this time around,
including:
 
  - Extensive RST conversions and organizational work in the
    memory-management docs thanks to Mike Rapoport.
 
  - An update of Documentation/features from Andrea Parri and a script to
    keep it updated.
 
  - Various LICENSES updates from Thomas, along with a script to check SPDX
    tags.
 
  - Work to fix dangling references to documentation files; this involved a
    fair number of one-liner comment changes outside of Documentation/
 
 ...and the usual list of documentation improvements, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbFTkKAAoJEI3ONVYwIuV6t24P/0K9qltHkLwsBo2fbGu/emem
 mb1QrZCFZGebKVrCIvET3YcT0q0xPW+ZldwMQYEUeCcu/vD3cGHGXlDbVJCa1fFD
 2OS10W/sEObPnREtlHO/zAzpapKP9DO1/f6NhO55iBJLGOCgoLL5xvSqgsI8MTGd
 vcJDXLitkh4CJEcfNLkQt8dEZzq9Tb6wdSFIvZBBXRNon2ItVN92D5xoQ0wtB+qt
 KmcGYofajK9bjtZpnC4iNg3i+zdwkd80bGTEN9f0hJTRZK5emCILk8fip8CMhRuB
 iwmcqb2RnMLydNLyK9RSs6OS5z3G4fYu9llRtLlZBAupcjRVpalWaBGxLOVO6jBG
 mvkqdKPMtxV4c7NvwKwFQL9dcjtxsxO4RDRYVWN82dS1L6WKKk8UvTuJUBLH0YA5
 af7ZKn7mJVhJ1cxPblaEBOBM3oQuk57LLkjmcpMOXyJ/IOkTIuV1Ezht+XzFyFQv
 VWSyekiKo+8D6WHACPTaWiizjW15e8CyP+WIhKzJyn7VQQrZwhsOS+R//ITsuvQ0
 vRdZ20lwUeBhR+mnXd5NfIo2w7G+OiqiREVAgxjgRrS0PnkzWG7lzzcSVU8HTfT4
 S7VXqval2a9Xg+N8aU2JUe49W858J8hKvIa98hBxGoZa84wxOGtEo7pIKhnMwMSe
 Uhkh/1/bQMxsK3fBEF74
 =I6FG
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.18' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "There's been a fair amount of work in the docs tree this time around,
  including:

   - Extensive RST conversions and organizational work in the
     memory-management docs thanks to Mike Rapoport.

   - An update of Documentation/features from Andrea Parri and a script
     to keep it updated.

   - Various LICENSES updates from Thomas, along with a script to check
     SPDX tags.

   - Work to fix dangling references to documentation files; this
     involved a fair number of one-liner comment changes outside of
     Documentation/

  ... and the usual list of documentation improvements, typo fixes, etc"

* tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
  Documentation: document hung_task_panic kernel parameter
  docs/admin-guide/mm: add high level concepts overview
  docs/vm: move ksm and transhuge from "user" to "internals" section.
  docs: Use the kerneldoc comments for memalloc_no*()
  doc: document scope NOFS, NOIO APIs
  docs: update kernel versions and dates in tables
  docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
  docs/vm: transhuge: minor updates
  docs/vm: transhuge: change sections order
  Documentation: arm: clean up Marvell Berlin family info
  Documentation: gpio: driver: Fix a typo and some odd grammar
  docs: ranoops.rst: fix location of ramoops.txt
  scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
  docs: uio-howto.rst: use a code block to solve a warning
  mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
  w1: w1_io.c: fix a kernel-doc warning
  Documentation/process/posting: wrap text at 80 cols
  docs: admin-guide: add cgroup-v2 documentation
  Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
  Documentation: refcount-vs-atomic: Update reference to LKMM doc.
  ...
2018-06-04 12:34:27 -07:00
Trond Myklebust
3f0b3cf46e NFS: Filter cache invalidation when holding a delegation
If the client holds a delegation, then ensure we filter out attempts
to invalidate the size, owner, group owner, or mode unless we made the
change, in which case, check that NFS_INO_REVAL_FORCED is set by the
caller.
Always filter out attempts to invalidate the change attribute and
size, since we are authoritative for those.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust
4ebe83af20 NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
If we hold a delegation, we should not need to call
nfs_check_inode_attributes() since we already know which attributes
are valid, and which ones may still need revalidation. The state
of the NFS_INO_REVAL_FORCED flag is therefore irrelevant.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust
c80d17c55d NFS: Improve caching while holding a delegation
Make sure that the client completely ignores change attribute and size
changes on the server when it holds a delegation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust
0b467264d0 NFS: Fix attribute revalidation
Don't mark attributes as invalid just because they have changed. Instead,
for the purposes of adjusting the attribute cache timeout, keep a
separate variable that tracks whether or not a change occurred.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust
6a97d02dfe NFS: fix up nfs_setattr_update_inode
Always try to set the attributes, even if we don't have a valid struct
nfs_fattr.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust
97c2c17af9 NFSv4: Ensure the inode is clean when we set a delegation
If there are attributes that are still invalid when we set a delegation,
then we need to set the NFS_INO_REVAL_FORCED flag.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:58 -04:00
Trond Myklebust
7c6726546c NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
If we hold a delegation, we don't need to care about whether or not
the inode attributes are up to date. We know we can cache the results
of this call regardless.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 15:03:20 -04:00
Chengguang Xu
3619aa8b74 ceph: show ino32 if the value is different with default
In current ceph_show_options(), there is no item for showing 'ino32',
so add showing mount option 'ino32' if the value is different with
default.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:02 +02:00
Chengguang Xu
8db0c7596f ceph: strengthen rsize/wsize/readdir_max_bytes validation
The check (intval < PAGE_SIZE) will involve type cast, so even when
specifying negative value to rsize/wsize/readdir_max_bytes, it will
pass the validation check successfully.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Chengguang Xu
c36ed50de2 ceph: fix alignment of rasize
On currently logic:
when I specify rasize=0~1 then it will be 4096.
when I specify rasize=2~4097 then it will be 8192.

Make it the same as rsize & wsize.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Luis Henriques
73fb0949cf ceph: fix use-after-free in ceph_statfs()
KASAN found an UAF in ceph_statfs.  This was a one-off bug but looking at
the code it looks like the monmap access needs to be protected as it can
be modified while we're accessing it.  Fix this by protecting the access
with the monc->mutex.

  BUG: KASAN: use-after-free in ceph_statfs+0x21d/0x2c0
  Read of size 8 at addr ffff88006844f2e0 by task trinity-c5/304

  CPU: 0 PID: 304 Comm: trinity-c5 Not tainted 4.17.0-rc6+ #172
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  Call Trace:
   dump_stack+0xa5/0x11b
   ? show_regs_print_info+0x5/0x5
   ? kmsg_dump_rewind+0x118/0x118
   ? ceph_statfs+0x21d/0x2c0
   print_address_description+0x73/0x2b0
   ? ceph_statfs+0x21d/0x2c0
   kasan_report+0x243/0x360
   ceph_statfs+0x21d/0x2c0
   ? ceph_umount_begin+0x80/0x80
   ? kmem_cache_alloc+0xdf/0x1a0
   statfs_by_dentry+0x79/0xb0
   vfs_statfs+0x28/0x110
   user_statfs+0x8c/0xe0
   ? vfs_statfs+0x110/0x110
   ? __fdget_raw+0x10/0x10
   __se_sys_statfs+0x5d/0xa0
   ? user_statfs+0xe0/0xe0
   ? mutex_unlock+0x1d/0x40
   ? __x64_sys_statfs+0x20/0x30
   do_syscall_64+0xee/0x290
   ? syscall_return_slowpath+0x1c0/0x1c0
   ? page_fault+0x1e/0x30
   ? syscall_return_slowpath+0x13c/0x1c0
   ? prepare_exit_to_usermode+0xdb/0x140
   ? syscall_trace_enter+0x330/0x330
   ? __put_user_4+0x1c/0x30
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Allocated by task 130:
   __kmalloc+0x124/0x210
   ceph_monmap_decode+0x1c1/0x400
   dispatch+0x113/0xd20
   ceph_con_workfn+0xa7e/0x44e0
   process_one_work+0x5f0/0xa30
   worker_thread+0x184/0xa70
   kthread+0x1a0/0x1c0
   ret_from_fork+0x35/0x40

  Freed by task 130:
   kfree+0xb8/0x210
   dispatch+0x15a/0xd20
   ceph_con_workfn+0xa7e/0x44e0
   process_one_work+0x5f0/0xa30
   worker_thread+0x184/0xa70
   kthread+0x1a0/0x1c0
   ret_from_fork+0x35/0x40

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Yan, Zheng
aae1a442f8 ceph: prevent i_version from going back
inode info from non-auth can be stale.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Yan, Zheng
fa466743a9 ceph: fix wrong check for the case of updating link count
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:46:01 +02:00
Ilya Dryomov
c843d13cae libceph: make abort_on_full a per-osdc setting
The intent behind making it a per-request setting was that it would be
set for writes, but not for reads.  As it is, the flag is set for all
fs/ceph requests except for pool perm check stat request (technically
a read).

ceph_osdc_abort_on_full() skips reads since the previous commit and
I don't see a use case for marking individual requests.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-06-04 20:46:00 +02:00
Yan, Zheng
a57d9064e4 ceph: flush pending works before shutdown super
Pending works hold inode references, which cause "Busy inodes after
unmount" warning.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:57 +02:00
Yan, Zheng
12b69d5f6f ceph: abort osd requests on force umount
This avoid force umount waiting on page writeback:

  io_schedule+0xd/0x30
  wait_on_page_bit_common+0xc6/0x130
  __filemap_fdatawait_range+0xbd/0x100
  filemap_fdatawait_keep_errors+0x15/0x40
  sync_inodes_sb+0x1cf/0x240
  sync_filesystem+0x52/0x90
  generic_shutdown_super+0x1d/0x110
  ceph_kill_sb+0x28/0x80 [ceph]
  deactivate_locked_super+0x35/0x60
  cleanup_mnt+0x36/0x70
  task_work_run+0x79/0xa0
  exit_to_usermode_loop+0x62/0x70
  do_syscall_64+0xdb/0xf0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  0xffffffffffffffff

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:57 +02:00
Luis Henriques
8c6286f1c6 ceph: fix st_nlink stat for directories
Currently, calling stat on a cephfs directory returns 1 for st_nlink.
This behaviour has recently changed in the fuse client, as some
applications seem to expect this value to be either 0 (if it's
unlinked) or 2 + number of subdirectories.  This behaviour was changed
in the fuse client with commit 67c7e4619188 ("client: use common
interp of st_nlink for dirs").

This patch modifies the kernel client to have a similar behaviour.

Link: https://tracker.ceph.com/issues/23873
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng
597817ddbb ceph: support file lock on directory
Link: http://tracker.ceph.com/issues/24028
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Ilya Dryomov
6dd4940ba5 ceph: show wsize only if non-default
This is how it was before commit 95cca2b44e ("ceph: limit osd write
size") went in.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng
4985d6f9e5 ceph: handle the new nfiles/nsubdirs fields in cap message
Without these new fields, stale st_size is returned in following
case.

1. MDS modifies a directory
2. MDS issues CEPH_CAP_ANY_SHARED to client
3. The client satifies stat(2) by its cached metadata. set st_size
   to "i_files + i_subdirs".

Link: http://tracker.ceph.com/issues/23855
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng
a1c6b83581 ceph: define argument structure for handle_cap_grant
The data structure includes the versioned feilds of cap message.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:56 +02:00
Yan, Zheng
2af54a72b5 ceph: update i_files/i_subdirs only when Fs cap is issued
In MDS, file/subdir counts of a directory inode are protected by
filelock. In request reply without Fs cap, nfiles/nsubdirs can be
stale.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Yan, Zheng
49a9f4f671 ceph: always get rstat from auth mds
rstat is not tracked by capability. client can't know if rstat from
non-auth mds is uptodate or not.

Link: http://tracker.ceph.com/issues/23538
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Yan, Zheng
4e9906e798 ceph: use bit flags to define vxattr attributes
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-06-04 20:45:55 +02:00
Adam Manzanares
9a6d9a62e0 fs: aio ioprio use ioprio_check_cap ret val
Previously the value was ignored.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-04 14:20:39 -04:00
Linus Torvalds
f956d08a56 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Misc bits and pieces not fitting into anything more specific"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: delete unnecessary assignment in vfs_listxattr
  Documentation: filesystems: update filesystem locking documentation
  vfs: namei: use path_equal() in follow_dotdot()
  fs.h: fix outdated comment about file flags
  __inode_security_revalidate() never gets NULL opt_dentry
  make xattr_getsecurity() static
  vfat: simplify checks in vfat_lookup()
  get rid of dead code in d_find_alias()
  it's SB_BORN, not MS_BORN...
  msdos_rmdir(): kill BS comment
  remove rpc_rmdir()
  fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
2018-06-04 10:14:28 -07:00
Linus Torvalds
cf626b0da7 Merge branch 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull procfs updates from Al Viro:
 "Christoph's proc_create_... cleanups series"

* 'hch.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (44 commits)
  xfs, proc: hide unused xfs procfs helpers
  isdn/gigaset: add back gigaset_procinfo assignment
  proc: update SIZEOF_PDE_INLINE_NAME for the new pde fields
  tty: replace ->proc_fops with ->proc_show
  ide: replace ->proc_fops with ->proc_show
  ide: remove ide_driver_proc_write
  isdn: replace ->proc_fops with ->proc_show
  atm: switch to proc_create_seq_private
  atm: simplify procfs code
  bluetooth: switch to proc_create_seq_data
  netfilter/x_tables: switch to proc_create_seq_private
  netfilter/xt_hashlimit: switch to proc_create_{seq,single}_data
  neigh: switch to proc_create_seq_data
  hostap: switch to proc_create_{seq,single}_data
  bonding: switch to proc_create_seq_data
  rtc/proc: switch to proc_create_single_data
  drbd: switch to proc_create_single
  resource: switch to proc_create_seq_data
  staging/rtl8192u: simplify procfs code
  jfs: simplify procfs code
  ...
2018-06-04 10:00:01 -07:00
Linus Torvalds
9c50eafc32 Merge branch 'work.rmdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull rmdir update from Al Viro:
 "More shrink_dcache_parent()-related stuff - killing the main source of
  potentially contended calls of that on large subtrees"

* 'work.rmdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  rmdir(),rename(): do shrink_dcache_parent() only on success
2018-06-04 09:53:33 -07:00
Trond Myklebust
2f28dc385a NFSv4: Don't ask for delegated attributes when adding a hard link
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust
771734f291 NFSv4: Don't ask for delegated attributes when revalidating the inode
Again, when revalidating the inode, we don't need to ask for attributes
for which we are authoritative.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust
a841b54dbd NFS: Pass the inode down to the getattr() callback
Allow the getattr() callback to check things like whether or not we hold
a delegation so that it can adjust the attributes that it is asking for.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Trond Myklebust
30846df06f NFSv4: Don't request size+change attribute if they are delegated to us
When we hold a delegation, we should not need to request attributes such
as the file size or the change attribute. For some servers, avoiding
asking for these unneeded attributes can improve the overall system
performance.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-06-04 12:07:07 -04:00
Linus Torvalds
06c86e66d6 Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
 "This is the first part of dealing with livelocks etc around
  shrink_dcache_parent()."

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  restore cond_resched() in shrink_dcache_parent()
  dput(): turn into explicit while() loop
  dcache: move cond_resched() into the end of __dentry_kill()
  d_walk(): kill 'finish' callback
  d_invalidate(): unhash immediately
2018-06-04 08:57:36 -07:00
Linus Torvalds
f459c34538 for-4.18/block-20180603
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJbFIrHAAoJEPfTWPspceCm2+kQAKo7o7HL30aRxJYu+gYafkuW
 PV47zr3e4vhMDEzDaMsh1+V7I7bm3uS+NZu6cFbcV+N9KXFpeb4V4Hvvm5cs+OC3
 WCOBi4eC1h4qnDQ3ZyySrCMN+KHYJ16pZqddEjqw+fhVudx8i+F+jz3Y4ZMDDc3q
 pArKZvjKh2wEuYXUMFTjaXY46IgPt+er94OwvrhyHk+4AcA+Q/oqSfSdDahUC8jb
 BVR3FV4I3NOHUaru0RbrUko13sVZSboWPCIFrlTDz8xXcJOnVHzdVS1WLFDXLHnB
 O8q9cADCfa4K08kz68RxykcJiNxNvz5ChDaG0KloCFO+q1tzYRoXLsfaxyuUDg57
 Zd93OFZC6hAzXdhclDFIuPET9OQIjDzwphodfKKmDsm3wtyOtydpA0o7JUEongp0
 O1gQsEfYOXmQsXlo8Ot+Z7Ne/HvtGZ91JahUa/59edxQbcKaMrktoyQsQ/d1nOEL
 4kXID18wPcFHWRQHYXyVuw6kbpRtQnh/U2m1eenSZ7tVQHwoe6mF3cfSf5MMseak
 k8nAnmsfEvOL4Ar9ftg61GOrImaQlidxOC2A8fmY5r0Sq/ZldvIFIZizsdTTCcni
 8SOTxcQowyqPf5NvMNQ8cKqqCJap3ppj4m7anZNhbypDIF2TmOWsEcXcMDn4y9on
 fax14DPLo59gBRiPCn5f
 =nga/
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - clean up how we pass around gfp_t and
   blk_mq_req_flags_t (Christoph)

 - prepare us to defer scheduler attach (Christoph)

 - clean up drivers handling of bounce buffers (Christoph)

 - fix timeout handling corner cases (Christoph/Bart/Keith)

 - bcache fixes (Coly)

 - prep work for bcachefs and some block layer optimizations (Kent).

 - convert users of bio_sets to using embedded structs (Kent).

 - fixes for the BFQ io scheduler (Paolo/Davide/Filippo)

 - lightnvm fixes and improvements (Matias, with contributions from Hans
   and Javier)

 - adding discard throttling to blk-wbt (me)

 - sbitmap blk-mq-tag handling (me/Omar/Ming).

 - remove the sparc jsflash block driver, acked by DaveM.

 - Kyber scheduler improvement from Jianchao, making it more friendly
   wrt merging.

 - conversion of symbolic proc permissions to octal, from Joe Perches.
   Previously the block parts were a mix of both.

 - nbd fixes (Josef and Kevin Vigor)

 - unify how we handle the various kinds of timestamps that the block
   core and utility code uses (Omar)

 - three NVMe pull requests from Keith and Christoph, bringing AEN to
   feature completeness, file backed namespaces, cq/sq lock split, and
   various fixes

 - various little fixes and improvements all over the map

* tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits)
  blk-mq: update nr_requests when switching to 'none' scheduler
  block: don't use blocking queue entered for recursive bio submits
  dm-crypt: fix warning in shutdown path
  lightnvm: pblk: take bitmap alloc. out of critical section
  lightnvm: pblk: kick writer on new flush points
  lightnvm: pblk: only try to recover lines with written smeta
  lightnvm: pblk: remove unnecessary bio_get/put
  lightnvm: pblk: add possibility to set write buffer size manually
  lightnvm: fix partial read error path
  lightnvm: proper error handling for pblk_bio_add_pages
  lightnvm: pblk: fix smeta write error path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: remove dead function
  lightnvm: pass flag on graceful teardown to targets
  lightnvm: pblk: check for chunk size before allocating it
  lightnvm: pblk: remove unnecessary argument
  lightnvm: pblk: remove unnecessary indirection
  lightnvm: pblk: return NVM_ error on failed submission
  lightnvm: pblk: warn in case of corrupted write buffer
  ...
2018-06-04 07:58:06 -07:00
Andreas Gruenbacher
628e366df1 gfs2: Iomap cleanups and improvements
Clean up gfs2_iomap_alloc and gfs2_iomap_get.  Document how
gfs2_iomap_alloc works: it now needs to be called separately after
gfs2_iomap_get where necessary; this will be used later by iomap write.
Move gfs2_iomap_ops into bmap.c.

Introduce a new gfs2_iomap_get_alloc helper and use it in
fallocate_chunk: gfs2_iomap_begin will become unsuitable for fallocate
with proper iomap write support.

In gfs2_block_map and fallocate_chunk, zero-initialize struct iomap.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:56:51 -05:00
Andreas Gruenbacher
845802b112 gfs2: Remove ordered write mode handling from gfs2_trans_add_data
In journaled data mode, we need to add each buffer head to the current
transaction.  In ordered write mode, we only need to add the inode to
the ordered inode list.  So far, both cases are handled in
gfs2_trans_add_data.  This makes the code look misleading and is
inefficient for small block sizes as well.  Handle both cases separately
instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:50:16 -05:00
Andreas Gruenbacher
d6382a3505 gfs2: gfs2_stuffed_write_end cleanup
First, change the sanity check in gfs2_stuffed_write_end to check for
the actual write size instead of the requested write size.

Second, use the existing teardown code in gfs2_write_end instead of
duplicating it in gfs2_stuffed_write_end.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:45:53 -05:00
Andreas Gruenbacher
7841b9f084 gfs2: hole_size improvement
Reimplement function hole_size based on a generic function for walking
the metadata tree and rename hole_size to gfs2_hole_size.  While
previously, multiple invocations of hole_size were sometimes needed to
walk across the entire hole, the new implementation always returns the
entire hole at once (provided that the caller is interested in the total
size).

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:39:23 -05:00
Bob Peterson
dc8fbb03dc GFS2: gfs2_free_extlen can return an extent that is too long
Function gfs2_free_extlen calculates the length of an extent of
free blocks that may be reserved. The end pointer was calculated as
end = start + bh->b_size but b_size is incorrect because the
bitmap usually stops prior to the end of the buffer data on
the last bitmap.

What this means is that when you do a write, you can reserve a
chunk of blocks that runs off the end of the last bitmap. For
example, I've got a file system where there is only one bitmap
for each rgrp, so ri_length==1. I saw cases in which iozone
tried to do a big write, grabbed a large block reservation,
chose rgrp 5464152, which has ri_data0 5464153 and ri_data 8188.
So 5464153 + 8188 = 5472341 which is the end of the rgrp.

When it grabbed a reservation it got back: 5470936, length 7229.
But 5470936 + 7229 = 5478165. So the reservation starts inside
the rgrp but runs 5824 blocks past the end of the bitmap.

This patch fixes the calculation so it won't exceed the last
bitmap. It also adds a BUG_ON to guard against overflows in the
future.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:33:42 -05:00
Andreas Gruenbacher
7b5747f43f GFS2: Fix allocation error bug with recursive rgrp glocking
Before this patch function gfs2_write_begin, upon discovering an
error, called gfs2_trim_blocks while the rgrp glock was still held.
That's because gfs2_inplace_release is not called until later.
This patch reorganizes the logic a bit so gfs2_inplace_release
is called to release the lock prior to the call to gfs2_trim_blocks,
thus preventing the glock recursion.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:33:17 -05:00
Andreas Gruenbacher
07e23d68f6 gfs2: Update find_metapath comment
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-06-04 07:32:44 -05:00
Dave Chinner
9f96cc958e xfs: verify AGI unlinked list contains valid blocks
The heads of tha AGI unlinked list are only scanned on debug
kernels when the verifier runs. Change that to always scan the heads
and validate that the inode numbers are valid.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-03 16:12:16 -07:00
Linus Torvalds
325e14f97e Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

 - fix io_destroy()/aio_complete() race

 - the vfs_open() change to get rid of open_check_o_direct() boilerplate
   was nice, but buggy. Al has a patch avoiding a revert, but that's
   definitely not a last-day fodder, so for now revert it is...

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Revert "fs: fold open_check_o_direct into do_dentry_open"
  fix io_destroy()/aio_complete() race
2018-06-03 11:01:28 -07:00
Al Viro
af04fadcaa Revert "fs: fold open_check_o_direct into do_dentry_open"
This reverts commit cab64df194.

Having vfs_open() in some cases drop the reference to
struct file combined with

	error = vfs_open(path, f, cred);
	if (error) {
		put_filp(f);
		return ERR_PTR(error);
	}
	return f;

is flat-out wrong.  It used to be

		error = vfs_open(path, f, cred);
		if (!error) {
			/* from now on we need fput() to dispose of f */
			error = open_check_o_direct(f);
			if (error) {
				fput(f);
				f = ERR_PTR(error);
			}
		} else {
			put_filp(f);
			f = ERR_PTR(error);
		}

and sure, having that open_check_o_direct() boilerplate gotten rid of is
nice, but not that way...

Worse, another call chain (via finish_open()) is FUBAR now wrt
FILE_OPENED handling - in that case we get error returned, with file
already hit by fput() *AND* FILE_OPENED not set.  Guess what happens in
path_openat(), when it hits

	if (!(opened & FILE_OPENED)) {
		BUG_ON(!error);
		put_filp(file);
	}

The root cause of all that crap is that the callers of do_dentry_open()
have no way to tell which way did it fail; while that could be fixed up
(by passing something like int *opened to do_dentry_open() and have it
marked if we'd called ->open()), it's probably much too late in the
cycle to do so right now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-03 10:58:23 -07:00
David S. Miller
9c54aeb03a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Filling in the padding slot in the bpf structure as a bug fix in 'ne'
overlapped with actually using that padding area for something in
'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-03 09:31:58 -04:00
Matthew Wilcox
cc4a90ac81 dax: dax_insert_mapping_entry always succeeds
It does not return an error, so we don't need to check the return value
for IS_ERR().  Indeed, it is a bug to do so; with a sufficiently large
PFN, a legitimate DAX entry may be mistaken for an error return.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-06-02 19:39:37 -07:00
Long Li
8e7360f67e CIFS: Add support for direct pages in wdata
Add a function to allocate wdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages that
point to the data buffer to write to.

wdata is reponsible for free those pages after it's done.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-02 18:36:26 -05:00
Long Li
1dbe3466b4 CIFS: Use offset when reading pages
With offset defined in rdata, transport functions need to look at this
offset when reading data into the correct places in pages.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-02 18:36:26 -05:00
Long Li
f9f5aca115 CIFS: Add support for direct pages in rdata
Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is still reponsible for free those pages after it's done.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-02 18:36:26 -05:00
Ronnie Sahlberg
8ce79ec359 cifs: update multiplex loop to handle compounded responses
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2018-06-02 18:36:26 -05:00
Al Viro
de52cf922a AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWvmaZvu3V2unywtrAQKZoA/9HzO6QsB7h7hWY6tTuoL0gD8T8S4hC7l3
 UYFtTgq0rFHJYiET4SWoy0Sfs8rY1iFPtaIeFVQG804SrnXu5/Q1tsv+1lRhZIuo
 /upAtZ3xEcqvAqU8pgcksKl/KUdmm7ZHUbhAFCasu+1eczGF5Q55UAUgonFrnEMi
 9N0WviRUkRAlTre7cvCMRI05c+HJV+PCYrJPjStAkJeuS1CuTEAT/d58NumquMAt
 6ENkpR4OhRUJZDhYH7XIRLm7hsYjr9v3VIeCiLpYqUZGuvhaj3jzPi0e9zD5PDzZ
 lyyodQVegBs88V2rXrjjZHohNQRiuSzI+42pMXrdaDu5jBFFqYLEeaBoperJY7nl
 W6l6HSb/I8VValM7iwkyzNWeQ6KhdUhYvA5ljYaJufZvqxp4di9xT4mAxRqbHSX+
 H5I/n+R27FEOFAqnWInaksj5IO80HGThrGhdz9O/4pa8xITz7W2ZKg5YMLEoF9yp
 /QUxsn3lz4VD4tjPrqampJ+IwbpQB+XDiJhM4boI47kC2IxEc9L2QiYWlFl/okZ4
 CGuXsluQFPleR3Mo8xq1WaQzmT40iYQ+aBOPq1/OhDisexZJ55Cjha1GHk/8aHDu
 GL5UiL7AfWEwY20mJiCObg8u2nnkwg/0YPR3awDBlCMDBeYhxbSFOLrKiQxUjWM9
 Pp6PUhTtSjU=
 =1ow3
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20180514' into afs-proc

backmerge AFS fixes that went into mainline and deal with
the conflict in fs/afs/fsclient.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-06-02 18:09:27 -04:00
Linus Torvalds
7fdf3e8616 Merge candidates for 4.17-rc
- bnxt netdev changes merged this cycle caused the bnxt RDMA driver to crash under
   certain situations
 - Arnd found (several, unfortunately) kconfig problems with the patches adding
   INFINIBAND_ADDR_TRANS. Reverting this last part, will fix it more fully
   outside -rc.
 - Subtle change in error code for a uapi function caused breakage in userspace.
   This was bug was subtly introduced cycle
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCgAGBQJbEXg1AAoJEDht9xV+IJsax2YQAJjXnfQ9+QsbtT8sqFzGFLo3
 r5aqqwF6RkGDFsovVDRH/9S62JjeqJRQA+4ykooajD+6mXU06Sf3p4tcoco0Dqhn
 66b/lkdPFzXSytlne7AnUnA3xkKG4u5jYGReSryIQjXu29iwt8scgiiqt8nX9Gzi
 eC9U2UQn5ZF65yRo4V/UGuHjdnUXiPYfg2Ff5YqLUxdL0XE42ftpuiR3Xuzgfj8c
 /rdqDwnvdViQwPeTSNTJoZzeV+49WKp9BP+lzsCeIXvzuzY1aOd94z0i7fq342hB
 jXpz6PtRTSBkZ4xBuvtopnoz0HQXMv7kQFMobkyjaP3qXcKz6Dx9d3QvWkLq8qdQ
 D4MPYjVCONJAJvXppxuNSzyz0lg5EaSICWeZhkr5P68Ja0fptiXAumVCTr/kEifV
 Yz8y0xsEVMIxBy3C9HOCe4khzWKqd9Uoo4/VDM+GdhKHIpUCnnKTte9vIZcYg1JL
 1doCithudNX7KF4K79JJqADwtFhXoPaHt3XF9YRhgouDN9gC+/hB5ZZvD4jjcqhF
 tLfRh1+LeK6QuVTE//ON/OS5xGgEzcZVLUJSUs/NdB+5YAXREz/2+IoK/0+rFiP0
 wlHlgUk9ZQx3j7La5iiPrRdK+h6u0LUYd02XuMevnnswGqv+DWrxDnFy/icovPCt
 AxHZ0KxdLJeo2euOd755
 =Ppe+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
 "Just three small last minute regressions that were found in the last
  week. The Broadcom fix is a bit big for rc7, but since it is fixing
  driver crash regressions that were merged via netdev into rc1, I am
  sending it.

   - bnxt netdev changes merged this cycle caused the bnxt RDMA driver
     to crash under certain situations

   - Arnd found (several, unfortunately) kconfig problems with the
     patches adding INFINIBAND_ADDR_TRANS. Reverting this last part,
     will fix it more fully outside -rc.

   - Subtle change in error code for a uapi function caused breakage in
     userspace. This was bug was subtly introduced cycle"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/core: Fix error code for invalid GID entry
  IB: Revert "remove redundant INFINIBAND kconfig dependencies"
  RDMA/bnxt_re: Fix broken RoCE driver due to recent L2 driver changes
2018-06-02 09:55:44 -07:00
Christoph Hellwig
afd9d6a1df fs: use ->is_partially_uptodate in page_cache_seek_hole_data
This way the implementation doesn't depend on buffer_head internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
bd56b3e144 fs: remove the buffer_unwritten check in page_seek_hole_data
We only call into this function through the iomap iterators, so we already
know the buffer is unwritten.  In addition to that we always require the
uptodate flag that is ORed with the result anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
8a78cb1f1b fs: move page_cache_seek_hole_data to iomap.c
This function is only used by the iomap code, depends on being called
from it, and will soon stop poking into buffer head internals.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
b84e772299 xfs: use iomap_bmap
Switch to the iomap based bmap implementation to get rid of one of the
last users of xfs_get_blocks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
89eb1906a9 iomap: add an iomap-based bmap implementation
This adds a simple iomap-based implementation of the legacy ->bmap
interface.  Note that we can't easily add checks for rt or reflink
files, so these will have to remain in the callers.  This interface
just needs to die..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
57fc505d11 iomap: add a iomap_sector helper
Factor the repeated calculation of the on-disk sector for a given logical
block into a littler helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
6533b4e40d iomap: use __bio_add_page in iomap_dio_zero
We don't need any merging logic, and this also replaces a BUG_ON with a
WARN_ON_ONCE inside __bio_add_page for the impossible overflow condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:33 -07:00
Christoph Hellwig
7ee66c03e4 iomap: move IOMAP_F_BOUNDARY to gfs2
Just define a range of fs specific flags and use that in gfs2 instead of
exposing this internal flag globally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:32 -07:00
Christoph Hellwig
19319b5321 iomap: inline data should be an iomap type, not a flag
Inline data is fundamentally different from our normal mapped case in that
it doesn't even have a block address.  So instead of having a flag for it
it should be an entirely separate iomap range type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 18:37:32 -07:00
Mike Marshall
b1116bc03c orangefs: use sparse annotations for holding locks across function calls.
Sparse complained and Al Viro knew what to do...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:51:49 -04:00
Mike Marshall
3cf796afed orangefs: make debug_help_fops static
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:51:45 -04:00
Mike Marshall
8a6080f574 orangefs: remove unused function orangefs_get_bufmap_init
get_bufmap_init is used in the out-of-tree module, but was left in
the upstream version as an oversight. Tip-of-the-hat to sparse and Al Viro.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:51:40 -04:00
Mike Marshall
817e9b4d9e orangefs: specify user pointers when using dev_map_desc and bufmap
Sparse lead me to the dev_map_desc one and Al Viro lead me to the bufmap
one.

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:51:36 -04:00
Mike Marshall
95f5f88f89 orangefs: formatting cleanups
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:51:30 -04:00
Martin Brandenburg
f6a4b4c9d0 orangefs: set i_size on new symlink
As long as a symlink inode remains in-core, the destination (and
therefore size) will not be re-fetched from the server, as it cannot
change.  The original implementation of the attribute cache assumed that
setting the expiry time in the past was sufficient to cause a re-fetch
of all attributes on the next getattr.  That does not work in this case.

The bug manifested itself as follows.  When the command sequence

touch foo; ln -s foo bar; ls -l bar

is run, the output was

lrwxrwxrwx. 1 fedora fedora 4906 Apr 24 19:10 bar -> foo

However, after a re-mount, ls -l bar produces

lrwxrwxrwx. 1 fedora fedora    3 Apr 24 19:10 bar -> foo

After this commit, even before a re-mount, the output is

lrwxrwxrwx. 1 fedora fedora    3 Apr 24 19:10 bar -> foo

Reported-by: Becky Ligon <ligon@clemson.edu>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Fixes: 71680c18c8 ("orangefs: Cache getattr results.")
Cc: stable@vger.kernel.org
Cc: hubcap@omnibond.com
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:49:46 -04:00
Martin Brandenburg
7f54910fa8 orangefs: report attributes_mask and attributes for statx
OrangeFS formerly failed to set attributes_mask with the result that
software could not see immutable and append flags present in the
filesystem.

Reported-by: Becky Ligon <ligon@clemson.edu>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Fixes: 68a24a6cc4 ("orangefs: implement statx")
Cc: stable@vger.kernel.org
Cc: hubcap@omnibond.com
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:49:36 -04:00
Colin Ian King
ec62e95ae3 orangefs: make struct orangefs_file_vm_ops static
The struct orangefs_file_vm_ops is local to the source and does not need
to be in global scope, so make it static.

Cleans up sparse warning:
fs/orangefs/file.c:547:35: warning: symbol 'orangefs_file_vm_ops' was not
declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:49:00 -04:00
Martin Brandenburg
9f8fd53cd0 orangefs: revamp block sizes
Now the superblock block size is PAGE_SIZE.  The inode block size is
PAGE_SIZE for directories and symlinks, but is the server-reported
block size for regular files.

The block size in the OrangeFS private inode is now deleted.  Stat
now reports PAGE_SIZE for directories and symlinks and the
server-reported block size for regular files.

The user-space visible change is that the block size for directores
and symlinks and the superblock is now PAGE_SIZE rather than the size of
the client-core shared memory buffers, which was typically four
megabytes.

Reported-by: Becky Ligon <ligon@clemson.edu>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Cc: hubcap@omnibond.com
Cc: walt@omnibond.com
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-06-01 14:48:31 -04:00
Dave Chinner
16858f7c21 xfs: fix error handling in xfs_refcount_insert()
generic/475 fired an assert failure just after the filesystem was
shut down:

XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_refcount.c, line: 182
.....
Call Trace:
 xfs_refcount_insert+0x151/0x190
 xfs_refcount_adjust_extents.constprop.11+0x9c/0x470
 xfs_refcount_adjust.constprop.10+0xb0/0x270
 xfs_refcount_finish_one+0x25a/0x420
 xfs_trans_log_finish_refcount_update+0x2a/0x40
 xfs_refcount_update_finish_item+0x35/0xa0
 xfs_defer_finish+0x15e/0x4d0
 xfs_reflink_remap_extent+0x1bc/0x610
 xfs_reflink_remap_blocks+0x6e/0x280
 xfs_reflink_remap_range+0x311/0x530
 vfs_clone_file_range+0x119/0x200
 ....

If xfs_btree_insert() returns an error, the corruption check fires
instead of passing the error back the caller. The corruption check
should be after we've checked for an error, not before, thereby
avoiding assert failures if the filesystem shuts down during a
refcount btree record insert.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
a0e5c435ba xfs: fix xfs_rtalloc_rec units
All the realtime allocation functions deal with space on the rtdev in
units of realtime extents.  However, struct xfs_rtalloc_rec confusingly
uses the word 'block' in the name, even though they're really extents.

Fix the naming problem and fix all the unit handling problems in the two
existing users.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
8ad560d256 xfs: strengthen rtalloc query range checks
Strengthen the rtalloc range query checks to make sure that the keys do
not run off the end of the realtime device inappropriately.  Note that
the query range functions require units of rt extents, not blocks,
despite the type name.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
a03f1641c7 xfs: xfs_rtbuf_get should check the bmapi_read results
The xfs_rtbuf_get function should check the block mapping it gets back
from bmapi_read.  If there are no mappings or the mapping isn't a real
extent, we should return -EFSCORRUPTED rather than trying to read a
garbage value.  We also require realtime bitmap blocks to be real,
written allocations.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Darrick J. Wong
2483113f3d xfs: xfs_rtword_t should be unsigned, not signed
xfs_rtword_t is used for bit manipulations in the realtime bitmap file.
Since we're performing bit shifts with this type, we don't want sign
extension and we don't want to be left shifting negative quantities
because that's undefined behavior.

This also shuts up these UBSAN warnings:
UBSAN: Undefined behaviour in fs/xfs/libxfs/xfs_rtbitmap.c:833:48
signed integer overflow:
-2147483648 - 1 cannot be represented in type 'int'

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Bill O'Donnell <billodo@redhat.com>
2018-06-01 09:00:16 -07:00
Ronnie Sahlberg
1fc6ad2f10 cifs: remove header_preamble_size where it is always 0
Since header_preamble_size is 0 for SMB2+ we can remove it in those
code paths that are only invoked from SMB2.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-01 09:14:30 -05:00
Ronnie Sahlberg
49f466bdbd cifs: remove struct smb2_hdr
struct smb2_hdr is now just a wrapper for smb2_sync_hdr.
We can thus get rid of smb2_hdr completely and access the sync header directly.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-06-01 09:14:30 -05:00
Mark Syms
d81243c697 CIFS: 511c54a2f6 adds a check for session expiry, status STATUS_NETWORK_SESSION_EXPIRED, however the server can also respond with STATUS_USER_SESSION_DELETED in cases where the session has been idle for some time and the server reaps the session to recover resources.
Handle this additional status in the same way as SESSION_EXPIRED.

Signed-off-by: Mark Syms <mark.syms@citrix.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-06-01 09:14:13 -05:00
Ronnie Sahlberg
e4dc31fe9a cifs: change smb2_get_data_area_len to take a smb2_sync_hdr as argument
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:51 -05:00
Ronnie Sahlberg
84f0cbfba8 cifs: update smb2_calc_size to use smb2_sync_hdr instead of smb2_hdr
smb2_hdr is just a wrapper around smb2_sync_hdr at this stage
and smb2_hdr is going away.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:51 -05:00
Ronnie Sahlberg
0d5a288d25 cifs: remove struct smb2_oplock_break_rsp
The two structures smb2_oplock_breaq_req/rsp are now basically identical.
Replace this with a single definition of a smb2_oplock_break structure.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:51 -05:00
Ronnie Sahlberg
977b617040 cifs: remove rfc1002 header from all SMB2 response structures
Separate out all the 4 byte rfc1002 headers so that they are no longer
part of the SMB2 header structures to prepare for future work to add
compounding support.

Update the smb3 transform header processing that we no longer have
a rfc1002 header at the start of this structure.

Update smb2_readv_callback to accommodate that the first iovector in the
response is no the smb2 header and no longer a rfc1002 header.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-31 21:30:50 -05:00
Steve French
b2adf22fdf smb3: on reconnect set PreviousSessionId field
The server detects reconnect by the (non-zero) value in PreviousSessionId
of SMB2/SMB3 SessionSetup request, but this behavior regressed due
to commit 166cea4dc3
("SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup")

CC: Stable <stable@vger.kernel.org>
CC: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-31 21:23:07 -05:00
Steve French
ce558b0e17 smb3: Add posix create context for smb3.11 posix mounts
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-31 21:23:07 -05:00
Linus Torvalds
0512e01345 Changes since last update:
- Clear out i_mapping error state when we're reinitializing inodes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlsPY0wACgkQ+H93GTRK
 tOuuDQ/+IvBngJL9I5py7GF0EXlMuge0nAulEj4d1ZT4tNCPp0Ouu4Jy+za+RapQ
 w604fvI6VPPbtidjLpUkR+ZzVeIAanaUkHY+MXl7DEYnyKC+VO7rPZUQiXe4kCeE
 ExNpL063vj4FND3xO/tXz2cs6Wjk8RuCLPWprLVKPpZ79w+BQwYFlpKGschMhR7w
 EQM+7TIJHff1C2nwETbX5ZcM6yxo6PVUwxEsF7+pubVulMoJZ57m5OnS7RXZY7L7
 33S3du85A/Unby+mlYQTsmWf+1FOfIIf6+r1i13gRorGSZongPSenQdO6h4uKzXc
 3OHXTl783ip2cFhhbbTnDlmly66Q1wcDwUDd88YvP94Wv9K+lWASKJGqDwpT/T3/
 gkmg9pTXezPytTZb+F86nFN91b4NWSdskwN4/Du2ydnSEQVmzwdYLyc1oQn9IWal
 HITBlVApLF33rHgmPJXRT64uKPqsPttu3DR5337waTPKf8po+Xk7CaATIIHx8gTD
 Jj8UfH7b9u7tjk5yXnx7qVCquwsG1E8N3Xi5eqn2dsTVSqia3vjyBoI7esPX5DBO
 ZvbBuU5MMmGr0p7DCEcFbe/otToqdoc0quebuUodKbhUS70+RGDoqwfR+R7Gbprq
 M6+Tfm7S6DIKOsfgde5HBEEAjQtsrNMNdBsBemtL1v3fzI6SyJQ=
 =UtGb
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.17-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "Clear out i_mapping error state when we're reinitializing inodes.

  This last minute fix prevents writeback error state from persisting
  past the end of the in-core inode lifecycle and causing EIO errors to
  be reported to userspace when no error has occurred.

  This fix for the behavioral regression has been soaking in for-next
  for a while, but various fs developers persuaded me to try to get it
  upstream for 4.17 because the patch that broke things was introduced
  in 4.17-rc4"

* tag 'xfs-4.17-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fs: clear writeback errors in inode_init_always
2018-05-31 16:23:07 -05:00
Trond Myklebust
ae55e59da0 pnfs: Don't release the sequence slot until we've processed layoutget on open
If the server recalls the layout that was just handed out, we risk hitting
a race as described in RFC5661 Section 2.10.6.3 unless we ensure that we
release the sequence slot after processing the LAYOUTGET operation that
was sent as part of the OPEN compound.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:12 -04:00
Trond Myklebust
32f1c28f3d pnfs: Don't call commit on failed layoutget-on-open
If the layoutget on open call failed, we can't really commit the inode,
so don't bother calling it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:12 -04:00
Trond Myklebust
64294b08f9 pNFS: Don't send LAYOUTGET on OPEN for read, if we already have cached data
If we're only opening the file for reading, and the file is empty and/or
we already have cached data, then heuristically optimise away the
LAYOUTGET.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:12 -04:00
Trond Myklebust
8dc96566c0 NFSv4/pnfs: Don't switch off layoutget-on-open for transient errors
Ensure that we only switch off the LAYOUTGET operation in the OPEN
compound when the server is truly broken, and/or it is complaining
that the compound is too large.
Currently, we end up turning off the functionality permanently,
even for transient errors such as EACCES or ENOSPC.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Trond Myklebust
d49e0d5b99 NFSv4/pnfs: Ensure pnfs_parse_lgopen() won't try to parse uninitialised data
We need to ensure that pnfs_parse_lgopen() doesn't try to parse a
struct nfs4_layoutget_res that was not filled by a successful call
to decode_layoutget(). This can happen if we performed a cached open,
or if either the OP_ACCESS or OP_GETATTR operations preceding the
OP_LAYOUTGET in the compound returned an error.

By initialising the 'status' field to NFS4ERR_DELAY, we ensure that
pnfs_parse_lgopen() won't try to interpret the structure.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
30ae2412e9 pnfs: Fix manipulation of NFS_LAYOUT_FIRST_LAYOUTGET
The flag was not always being cleared after LAYOUTGET on OPEN.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
c49b5209f9 pnfs: Add barrier to prevent lgopen using LAYOUTGET during recall
Since the LAYOUTGET on OPEN can be sent without prior inode information,
existing methods to prevent LAYOUTGET from being sent while processing
CB_LAYOUTRECALL don't work.  Track if a recall occurred while LAYOUTGET
was being sent, and if so ignore the results.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
6e01260cee pnfs: Stop attempting LAYOUTGET on OPEN on failure
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
78746a384c pnfs: Add LAYOUTGET to OPEN of an existing file
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Trond Myklebust
29a8bfe52d pNFS: Refactor nfs4_layoutget_release()
Move the actual freeing of the struct nfs4_layoutget into fs/nfs/pnfs.c
where it can be reused by the layoutget on open code.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
2409a976a2 pnfs: Add LAYOUTGET to OPEN of a new file
This triggers when have no pre-existing inode to attach to.
The preexisting case is saved for later.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
5e36e2a941 pnfs: Change pnfs_alloc_init_layoutget_args call signature
Don't send in a layout, instead use the (possibly NULL) inode.

This is needed for LAYOUTGET attached to an OPEN where the inode is not
yet set.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
1b146fcff7 pnfs: Move nfs4_opendata into nfs4_fs.h
It will be needed now by the pnfs code.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
56f487f8c8 pnfs: Add conditional encode/decode of LAYOUTGET within OPEN compound
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
dacb452db8 pnfs: move allocations out of nfs4_proc_layoutget
They work better in the new alloc_init function.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
587f03deb6 pnfs: refactor send_layoutget
Pull out the alloc/init part for eventual reuse by OPEN.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
f86c3ac502 pnfs: Add layout driver flag PNFS_LAYOUTGET_ON_OPEN
Driver can set flag to allow LAYOUTGET to be sent with OPEN.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
3b65a30df9 NFS4: move ctx into nfs4_run_open_task
Preparing to add conditional LAYOUTGET to OPEN rpc, the LAYOUTGET
will need the ctx info.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:11 -04:00
Fred Isaman
808ba32abe pnfs: Store return value of decode_layoutget for later processing
This will be needed to seperate return value of OPEN and LAYOUTGET
when they are combined into a single RPC.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:10 -04:00
Fred Isaman
34ec9aac7d pnfs: Remove redundant assignment from nfs4_proc_layoutget().
nfs_init_sequence() will clear this for us.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:03:10 -04:00
Benjamin Coddington
a3cf9bca2a NFSv4: Don't add a new lock on an interrupted wait for LOCK
If the wait for a LOCK operation is interrupted, and then the file is
closed, the locks cleanup code will assume that no new locks will be added
to the inode after it has completed.  We already have a mechanism to detect
if there was signal, so let's use that to avoid recreating the local lock
once the RPC completes.  Also skip re-sending the LOCK operation for the
various error cases if we were signaled.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
[Trond: Fix inverted test of locks_lock_inode_wait()]
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
cf61eb2686 NFSv4: Always clear the pNFS layout when handling ESTALE
If we get an ESTALE error in response to an RPC call operating on the
file on the MDS, we should immediately cancel the layout for that file.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Dave Wysochanski
d68894800e NFSv4: Fix possible 1-byte stack overflow in nfs_idmap_read_and_verify_message
In nfs_idmap_read_and_verify_message there is an incorrect sprintf '%d'
that converts the __u32 'im_id' from struct idmap_msg to 'id_str', which
is a stack char array variable of length NFS_UINT_MAXLEN == 11.
If a uid or gid value is > 2147483647 = 0x7fffffff, the conversion
overflows into a negative value, for example:
crash> p (unsigned) (0x80000000)
$1 = 2147483648
crash> p (signed) (0x80000000)
$2 = -2147483648
The '-' sign is written to the buffer and this causes a 1 byte overflow
when the NULL byte is written, which corrupts kernel stack memory.  If
CONFIG_CC_STACKPROTECTOR_STRONG is set we see a stack-protector panic:

[11558053.616565] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa05b8a8c
[11558053.639063] CPU: 6 PID: 9423 Comm: rpc.idmapd Tainted: G        W      ------------ T 3.10.0-514.el7.x86_64 #1
[11558053.641990] Hardware name: Red Hat OpenStack Compute, BIOS 1.10.2-3.el7_4.1 04/01/2014
[11558053.644462]  ffffffff818c7bc0 00000000b1f3aec1 ffff880de0f9bd48 ffffffff81685eac
[11558053.646430]  ffff880de0f9bdc8 ffffffff8167f2b3 ffffffff00000010 ffff880de0f9bdd8
[11558053.648313]  ffff880de0f9bd78 00000000b1f3aec1 ffffffff811dcb03 ffffffffa05b8a8c
[11558053.650107] Call Trace:
[11558053.651347]  [<ffffffff81685eac>] dump_stack+0x19/0x1b
[11558053.653013]  [<ffffffff8167f2b3>] panic+0xe3/0x1f2
[11558053.666240]  [<ffffffff811dcb03>] ? kfree+0x103/0x140
[11558053.682589]  [<ffffffffa05b8a8c>] ? idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.689710]  [<ffffffff810855db>] __stack_chk_fail+0x1b/0x30
[11558053.691619]  [<ffffffffa05b8a8c>] idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.693867]  [<ffffffffa00209d6>] rpc_pipe_write+0x56/0x70 [sunrpc]
[11558053.695763]  [<ffffffff811fe12d>] vfs_write+0xbd/0x1e0
[11558053.702236]  [<ffffffff810acccc>] ? task_work_run+0xac/0xe0
[11558053.704215]  [<ffffffff811fec4f>] SyS_write+0x7f/0xe0
[11558053.709674]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

Fix this by calling the internally defined nfs_map_numeric_to_string()
function which properly uses '%u' to convert this __u32.  For consistency,
also replace the one other place where snprintf is called.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reported-by: Stephen Johnston <sjohnsto@redhat.com>
Fixes: cf4ab538f1 ("NFSv4: Fix the string length returned by the idmapper")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
d554168f87 NFS: Fix up nfs_post_op_update_inode() to force ctime updates
We do not want to ignore ctime updates that originate from functions
such as link().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
472f761e11 NFS: Ensure we revalidate the inode correctly after setacl
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
59a707b0d4 NFS: Ensure we revalidate the inode correctly after remove or rename
We may need to revalidate the change attribute, ctime and the nlinks count.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
821a868a23 NFS: Set the force revalidate flag if the inode is not completely initialised
Ensure that a delegation doesn't cause us to skip initialising the inode
if it was incomplete when we exited nfs_fhget()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
3cb3fd6da4 NFS: Fix up sillyrename()
Ensure that we register the fact that the inode ctime has changed.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
ed7e9ad090 NFSv4: Fix sillyrename to return the delegation when appropriate
Ensure that we pass down the inode of the file being deleted so
that we can return any delegation being held.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Trond Myklebust
991eedb137 NFSv4: Only pass the delegation to setattr if we're sending a truncate
Even then it isn't really necessary. The reason why we may not want to
pass in a stateid in other cases is that we cannot use the delegation
credential.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker
2f261020b6 NFS: Merge nfs41_free_stateid() with _nfs41_free_stateid()
Having these exist as two functions doesn't seem to add anything useful,
and I think merging them together makes this easier to follow.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker
fba83f3411 NFS: Pass "privileged" value to nfs4_init_sequence()
We currently have a separate function just to set this, but I think it
makes more sense to set it at the same time as the other values in
nfs4_init_sequence()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker
e9ae1ee2b2 NFS: Move call to nfs4_state_protect() to nfs4_commit_setup()
Rather than doing this in the generic NFS client code.  Let's put this
with the other v4 stuff so it's all in one place.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
Anna Schumaker
fb91fb0ee7 NFS: Move call to nfs4_state_protect_write() to nfs4_write_setup()
This doesn't really need to be in the generic NFS client code, and I
think it makes more sense to keep the v4 code in one place.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:16 -04:00
NeilBrown
e04bbf6b1b NFS: Avoid quadratic search when freeing delegations.
There are three places that walk all delegation for an nfs_client and
restart whenever they find something interesting - potentially
resulting in a quadratic search:  If there are 10,000 uninteresting
delegations followed by 10,000 interesting one, then the code
skips over 100,000,000 delegations, which can take a noticeable amount
of time.

Of these nfs_delegation_reap_unclaimed() and
nfs_reap_expired_delegations() are only called during unusual events:
a server reboots or reports expired delegations, probably due to a
network partition.  Optimizing these is not particularly important.

The third, nfs_client_return_marked_delegations(), is called
periodically via nfs_expire_unreferenced_delegations().  It could
cause periodic problems on a busy server.

New delegations are added to the end of the list, so if there are
10,000 open files with delegations, and 10,000 more recently opened files
that received delegations but are now closed, then
nfs_client_return_marked_delegations() can take seconds to skip over
the 10,000 open files 10,000 times.  That is a waste of time.

The avoid this waste a place-holder (an inode) is kept when locks are
dropped, so that the place can usually be found again after taking
rcu_readlock().  This place holder ensure that we find the right
starting point in the list of nfs_servers, and makes is probable that
we find the right starting point in the list of delegations.
We might need to occasionally restart at the head of that list.

It might be possible that the place_holder inode could lose its
delegation separately, and then get a new one using the same (freed
and then reallocated) 'struct nfs_delegation'.  Were this to happen,
the new delegation would be at the end of the list and we would miss
returning some other delegations.  This would have the effect of
unnecessarily delaying the return of some unused delegations until the
next time this function is called - typically 90 seconds later.  As
this is not a correctness issue and is vanishingly unlikely to happen,
it does not seem worth addressing.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 15:02:14 -04:00
NeilBrown
3ca951b618 NFS: use cond_resched() when restarting walk of delegation list.
In three places we walk the list of delegations for an nfs_client
until an interesting one is found, then we act of that delegation
and restart the walk.

New delegations are added to the end of a list and the interesting
delegations are usually old, so in many case we won't repeat
a long walk over and over again, but it is possible - particularly if
the first server in the list has a large number of uninteresting
delegations.

In each cache the work done on interesting delegations will often
complete without sleeping, so this could loop many times without
giving up the CPU.

So add a cond_resched() at an appropriate point to avoid hogging the
CPU for too long.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 14:59:19 -04:00
NeilBrown
f389349142 NFS: slight optimization for walking list for delegations
There are 3 places where we walk the list of delegations
for an nfs_client.
In each case there are two nested loops, one for nfs_servers
and one for nfs_delegations.

When we find an interesting delegation we try to get an active
reference to the server.  If that fails, it is pointless to
continue to look at the other delegation for the server as
we will never be able to get an active reference.
So instead of continuing in the inner loop, break out
and continue in the outer loop.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31 14:59:19 -04:00
youngjun yoo
1061fd484b fs: f2fs: insert space around that ':' and ', '
clean up checkpatch error:
ERROR: space required after that ':'
ERROR: space required after that ','

Signed-off-by: youngjun yoo <youngjun.willow@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:54 -07:00
youngjun yoo
f11e98bdbe fs: f2fs: add missing blank lines after declarations
clean up checkpatch warning:
WARNING: Missing a blank line after declarations

Signed-off-by: youngjun yoo <youngjun.willow@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:54 -07:00
youngjun yoo
193bea1dd8 fs: f2fs: changed variable type of offset "unsigned" to "loff_t"
clean up checkpatch warning:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'

Signed-off-by: youngjun yoo <youngjun.willow@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
4d57b86dd8 f2fs: clean up symbol namespace
As Ted reported:

"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix.  There's well over a hundred (see attached below).

As one example, in fs/f2fs/dir.c there is:

unsigned char get_de_type(struct f2fs_dir_entry *de)

This function is clearly only useful for f2fs, but it has a generic
name.  This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build.  It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.

You might want to fix this at some point.  Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.

acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."

This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
2e79d951ff f2fs: make set_de_type() static
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
fc99fe27b9 f2fs: make __f2fs_write_data_pages() static
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
9fd62605bb f2fs: fix to avoid accessing cross the boundary
Configure io_bits with 2 and enable LFS mode, generic/017 reports below dmesg:

BUG: unable to handle kernel NULL pointer dereference at 00000039
*pdpt = 000000002fcb2001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: crc32_generic zram f2fs(O) bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi pcbc snd_seq joydev aesni_intel aes_i586 snd_seq_device snd_timer crypto_simd cryptd snd soundcore i2c_piix4 serio_raw mac_hid video parport_pc ppdev lp parport hid_generic usbhid psmouse hid e1000
CPU: 2 PID: 20779 Comm: xfs_io Tainted: G           O      4.17.0-rc2 #38
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: is_checkpointed_data+0x84/0xd0 [f2fs]
EFLAGS: 00010207 CPU: 2
EAX: 00000000 EBX: f5cd7000 ECX: fffffe32 EDX: 00000039
ESI: 000001cd EDI: ec95fb6c EBP: e264bd80 ESP: e264bd6c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000039 CR3: 2fe55660 CR4: 000406f0
Call Trace:
 __exchange_data_block+0xb3f/0x1000 [f2fs]
 f2fs_fallocate+0xab9/0x16b0 [f2fs]
 vfs_fallocate+0x17c/0x2d0
 ksys_fallocate+0x42/0x70
 sys_fallocate+0x31/0x40
 do_fast_syscall_32+0xaa/0x22c
 entry_SYSENTER_32+0x4c/0x7b
EIP: 0xb7f98c51
EFLAGS: 00000293 CPU: 2
EAX: ffffffda EBX: 00000003 ECX: 00000008 EDX: 01001000
ESI: 00000000 EDI: 00001000 EBP: 00000000 ESP: bfc0357c
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Code: 00 00 d3 e8 8b 4d ec 2b 02 8b 55 f0 6b c0 1c 03 41 70 29 d6 8b 93 d0 06 00 00 8b 40 0c 83 ea 01 21 d6 89 f2 89 f1 c1 ea 03 f7 d1 <0f> be 14 10 83 e1 07 b8 01 00 00 00 d3 e0 85 c2 89 f8 0f 95 c3
EIP: is_checkpointed_data+0x84/0xd0 [f2fs] SS:ESP: 0068:e264bd6c
CR2: 0000000000000039
---[ end trace 9a4d4087cce6080a ]---

This is because in recovery flow of __exchange_data_block, we didn't pass olen to
__roll_back_blkaddrs, instead we passed len, which indicates wrong array size, result
in copying random block address into dnode page.

Later, once that random block address was accessed by is_checkpointed_data, it can
cause NULL pointer dereference.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Chao Yu
fe16efe6a7 f2fs: fix to let caller retry allocating block address
Configure io_bits with 2 and enable LFS mode, generic/013 reports below dmesg:

BUG: unable to handle kernel NULL pointer dereference at 00000104
*pdpt = 0000000029b7b001 *pde = 0000000000000000
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: crc32_generic zram f2fs(O) rfcomm bnep bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev snd_seq_device aesni_intel snd_timer aes_i586 snd crypto_simd cryptd soundcore i2c_piix4 serio_raw mac_hid video parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000
CPU: 0 PID: 11161 Comm: fsstress Tainted: G           O      4.17.0-rc2 #38
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_submit_page_write+0x28d/0x550 [f2fs]
EFLAGS: 00010206 CPU: 0
EAX: e863dcd8 EBX: 00000000 ECX: 00000100 EDX: 00000200
ESI: e863dcf4 EDI: f6f82768 EBP: e863dbb0 ESP: e863db74
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000104 CR3: 29a62020 CR4: 000406f0
Call Trace:
 do_write_page+0x6f/0xc0 [f2fs]
 write_data_page+0x4a/0xd0 [f2fs]
 do_write_data_page+0x327/0x630 [f2fs]
 __write_data_page+0x34b/0x820 [f2fs]
 __f2fs_write_data_pages+0x42d/0x8c0 [f2fs]
 f2fs_write_data_pages+0x27/0x30 [f2fs]
 do_writepages+0x1a/0x70
 __filemap_fdatawrite_range+0x94/0xd0
 filemap_write_and_wait_range+0x3d/0xa0
 __generic_file_write_iter+0x11a/0x1f0
 f2fs_file_write_iter+0xdd/0x3b0 [f2fs]
 __vfs_write+0xd2/0x150
 vfs_write+0x9b/0x190
 ksys_write+0x45/0x90
 sys_write+0x16/0x20
 do_fast_syscall_32+0xaa/0x22c
 entry_SYSENTER_32+0x4c/0x7b
EIP: 0xb7fc8c51
EFLAGS: 00000246 CPU: 0
EAX: ffffffda EBX: 00000003 ECX: 09cde000 EDX: 00001000
ESI: 00000003 EDI: 00001000 EBP: 00000000 ESP: bfbded38
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
Code: e8 f9 77 34 c9 8b 45 e0 8b 80 b8 00 00 00 39 45 d8 0f 84 bb 02 00 00 8b 45 e0 8b 80 b8 00 00 00 8d 50 d8 8b 08 89 55 f0 8b 50 04 <89> 51 04 89 0a c7 00 00 01 00 00 c7 40 04 00 02 00 00 8b 45 dc
EIP: f2fs_submit_page_write+0x28d/0x550 [f2fs] SS:ESP: 0068:e863db74
CR2: 0000000000000104
---[ end trace 4cac79c0d1305ee6 ]---

allocate_data_block will submit all sequential pending IOs sorted by a
FIFO list, If we failed to submit other user's IO due to unaligned write,
we will retry to allocate new block address for current IO, then it will
initialize fio.list again, if fio was in the list before, it can break
FIFO list, result in above panic.

Thread A			Thread B
- do_write_page
 - allocate_data_block
  - list_add_tail
  : fioA cached in FIFO list.
				- do_write_page
				 - allocate_data_block
				  - list_add_tail
				  : fioB cached in FIFO list.
				 - f2fs_submit_page_write
				 : fail to submit IO
				 - allocate_data_block
				  - INIT_LIST_HEAD
 - f2fs_submit_page_write
  - list_del  <-- NULL pointer dereference

This patch adds fio.retry parameter to indicate failure status for each
IO, and avoid bailing out if there is still pending IO in FIFO list for
fixing.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:53 -07:00
Anatoly Pugachev
4071e67cff disable loading f2fs module on PAGE_SIZE > 4KB
The following patch disables loading of f2fs module on architectures
which have PAGE_SIZE > 4096 , since it is impossible to mount f2fs on
such architectures , log messages are:

mount: /mnt: wrong fs type, bad option, bad superblock on
/dev/vdiskb1, missing codepage or helper program, or other error.
/dev/vdiskb1: F2FS filesystem,
UUID=1d8b9ca4-2389-4910-af3b-10998969f09c, volume name ""

May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
filesystem in 1th superblock
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
filesystem in 2th superblock
May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
page_cache_size (8192), supports only 4KB

which was introduced by git commit 5c9b469295

tested on git kernel 4.17.0-rc6-00309-gec30dcf7f425

with patch applied:

modprobe: ERROR: could not insert 'f2fs': Invalid argument
May 28 01:40:28 v215 kernel: F2FS not supported on PAGE_SIZE(8192) != 4096

Signed-off-by: Anatoly Pugachev <matorola@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
14a28559f4 f2fs: fix error path of move_data_page
This patch fixes error path of move_data_page:
- clear cold data flag if it fails to write page.
- redirty page for non-ENOMEM case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
1174abfd83 f2fs: don't drop dentry pages after fs shutdown
As description in commit "f2fs: don't drop any page on f2fs_cp_error()
case":

"We still provide readdir() after shtudown, so we should keep pages to
avoid additional IOs."

In order to provider lastest directory structure, let's keep dentry
pages in cache after fs shutdown.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
250dbf5151 f2fs: fix to avoid race during access gc_thread pointer
Thread A			Thread B
- f2fs_remount
 - stop_gc_thread
				- f2fs_sbi_store
   sbi->gc_thread = NULL;
				  access sbi->gc_thread->gc_*

Previously, we allocate memory for sbi->gc_thread based on background
gc thread mount option, the memory can be released if we turn off
that mount option, but still there are several places access gc_thread
pointer without considering race condition, result in NULL point
dereference.

In order to fix this issue, use sb->s_umount to exclude those operations.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
aec2f729fc f2fs: clean up with clear_radix_tree_dirty_tag
Introduce clear_radix_tree_dirty_tag to include common codes for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Chao Yu
64c74a7ab5 f2fs: fix to don't trigger writeback during recovery
- f2fs_fill_super
 - recover_fsync_data
  - recover_data
   - del_fsync_inode
    - iput
     - iput_final
      - write_inode_now
       - f2fs_write_inode
        - f2fs_balance_fs
         - f2fs_balance_fs_bg
          - sync_dirty_inodes

With data_flush mount option, during recovery, in order to avoid entering
above writeback flow, let's detect recovery status and do skip in
f2fs_balance_fs_bg.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Sheng Yong
35a9a766a1 f2fs: clear discard_wake earlier
If SBI_NEED_FSCK is set, discard_wake will never be cleared. As a
result, the condition of wait_event_interruptible_timeout() is always
true, which gets discard thread run too frequently.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:52 -07:00
Yunlei He
f9d1dced75 f2fs: let discard thread wait a little longer if dev is busy
This patch modify discard thread wait policy as below:
	issued       io_interrupted     wait time(ms)
1.        8                 0               50
2.      (0,8)               1               50
3.        0                 1              500 (dev is busy)
4.        0                 0            60000 (no candidates)

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
2ef79ecb5e f2fs: avoid stucking GC due to atomic write
f2fs doesn't allow abuse on atomic write class interface, so except
limiting in-mem pages' total memory usage capacity, we need to limit
atomic-write usage as well when filesystem is seriously fragmented,
otherwise we may run into infinite loop during foreground GC because
target blocks in victim segment are belong to atomic opened file for
long time.

Now, we will detect failure due to atomic write in foreground GC, if
the count exceeds threshold, we will drop all atomic written data in
cache, by this, I expect it can keep our system running safely to
prevent Dos attack.

In addition, his patch adds to show GC skip information in debugfs,
now it just shows count of skipped caused by atomic write.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Jaegeuk Kim
5b0e95398e f2fs: introduce sbi->gc_mode to determine the policy
This is to avoid sbi->gc_thread pointer access.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
107a805de8 f2fs: keep migration IO order in LFS mode
For non-migration IO, we will keep order of data/node blocks' submitting
as allocation sequence by sorting IOs in per log io_list list, but for
migration IO, it could be out-of-order.

In LFS mode, we should keep all IOs including migration IO be ordered,
so that this patch fixes to add an additional lock to keep submitting
order.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
e5e5732d81 f2fs: fix to wait page writeback during revoking atomic write
After revoking atomic write, related LBA can be reused by others, so we
need to wait page writeback before reusing the LBA, in order to avoid
interference between old atomic written in-flight IO and new IO.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Sahitya Tummala
60b2b4ee2b f2fs: Fix deadlock in shutdown ioctl
f2fs_ioc_shutdown() ioctl gets stuck in the below path
when issued with F2FS_GOING_DOWN_FULLSYNC option.

__switch_to+0x90/0xc4
percpu_down_write+0x8c/0xc0
freeze_super+0xec/0x1e4
freeze_bdev+0xc4/0xcc
f2fs_ioctl+0xc0c/0x1ce0
f2fs_compat_ioctl+0x98/0x1f0

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:51 -07:00
Chao Yu
f8de433122 f2fs: detect synchronous writeback more earlier
This patch changes to detect synchronous writeback more earlier before,
in order to avoid unnecessary page writeback before exiting asynchronous
writeback.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
7b525dd013 f2fs: clean up with is_valid_blkaddr()
- rename is_valid_blkaddr() to is_valid_meta_blkaddr() for readability.
- introduce is_valid_blkaddr() for cleanup.

No logic change in this patch.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
5ad25442b6 f2fs: fix to initialize min_mtime with ULLONG_MAX
Since sit_i.min_mtime's type is unsigned long long, so we should
initialize it with max value of the type ULLONG_MAX instead of
LLONG_MAX.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
e7a4feb0ab f2fs: fix to let checkpoint guarantee atomic page persistence
1. thread A: commit_inmem_pages submit data into block layer, but
haven't waited it writeback.
2. thread A: commit_inmem_pages update related node.
3. thread B: do checkpoint, flush all nodes to disk.
4. SPOR

Then, atomic file becomes corrupted since nodes is flushed before data.

This patch fixes to treat atomic page as checkpoint guaranteed one,
then in checkpoint, we can make sure all atomic page can be writebacked
with metadata of atomic file.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
1c41e6808e f2fs: fix to initialize i_current_depth according to inode type
i_current_depth is used only for directory inode, but its space is
shared with i_gc_failures field used for regular inode, in order to
avoid affecting i_gc_failures' value, this patch fixes to initialize
the union's fields according to inode type.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Chao Yu
299254d85d Revert "f2fs: add ovp valid_blocks check for bg gc victim to fg_gc"
For extreme case:
10 section, op = 10%, no_fggc_threshold = 90%
All section usage: 85% 85% 85% 85% 90% 90% 95% 95% 95% 95%

During foreground GC, if we skip select dirty section whose usage
is larger than no_fggc_threshold, we can only recycle 80% invalid
space from four 85% usage sections and two 90% usage sections,
result in encountering out-of-space issue.

This reverts commit e93b986525 to
fix this issue, besides, we keep the logic that we scan all dirty
section when searching a victim, so that GC can select victim with
least valid blocks.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:50 -07:00
Jaegeuk Kim
868de6135f f2fs: don't drop any page on f2fs_cp_error() case
We still provide readdir() after shtudown, so we should keep pages to avoid
additional IOs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Colin Ian King
4580038e2d f2fs: fix spelling mistake: "extenstion" -> "extension"
Trivial fix to spelling mistake in extension list text

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Jaegeuk Kim
0cfe75c5b0 f2fs: enhance sanity_check_raw_super() to avoid potential overflows
In order to avoid the below overflow issue, we should have checked the
boundaries in superblock before reaching out to allocation. As Linus suggested,
the right place should be sanity_check_raw_super().

Dr Silvio Cesare of InfoSect reported:

There are integer overflows with using the cp_payload superblock field in the
f2fs filesystem potentially leading to memory corruption.

include/linux/f2fs_fs.h

struct f2fs_super_block {
...
        __le32 cp_payload;

fs/f2fs/f2fs.h

typedef u32 block_t;    /*
                         * should not change u32, since it is the on-disk block
                         * address format, __le32.
                         */
...

static inline block_t __cp_payload(struct f2fs_sb_info *sbi)
{
        return le32_to_cpu(F2FS_RAW_SUPER(sbi)->cp_payload);
}

fs/f2fs/checkpoint.c

        block_t start_blk, orphan_blocks, i, j;
...
        start_blk = __start_cp_addr(sbi) + 1 + __cp_payload(sbi);
        orphan_blocks = __start_sum_addr(sbi) - 1 - __cp_payload(sbi);

+++ integer overflows

...
        unsigned int cp_blks = 1 + __cp_payload(sbi);
...
        sbi->ckpt = kzalloc(cp_blks * blk_size, GFP_KERNEL);

+++ integer overflow leading to incorrect heap allocation.

        int cp_payload_blks = __cp_payload(sbi);
...
        ckpt->cp_pack_start_sum = cpu_to_le32(1 + cp_payload_blks +
                        orphan_blocks);

+++ sign bug and integer overflow

...
        for (i = 1; i < 1 + cp_payload_blks; i++)

+++ integer overflow

...

      sbi->max_orphans = (sbi->blocks_per_seg - F2FS_CP_PACKS -
                        NR_CURSEG_TYPE - __cp_payload(sbi)) *
                                F2FS_ORPHANS_PER_BLOCK;

+++ integer overflow

Reported-by: Greg KH <greg@kroah.com>
Reported-by: Silvio Cesare <silvio.cesare@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
b4c3ca8ba9 f2fs: treat volatile file's data as hot one
Volatile file's data will be updated oftenly, so it'd better to place
its data into hot data segment.

In addition, for atomic file, we change to check FI_ATOMIC_FILE instead
of FI_HOT_DATA to make code readability better.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
af8ff65bb8 f2fs: introduce release_discard_addr() for cleanup
Introduce release_discard_addr() to include common codes for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Fengguang Wu: declare static function, reported by kbuild test robot]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
a9af3fdcc4 f2fs: fix potential overflow
In build_sit_entries(), if valid_blocks in SIT block is smaller than
valid_blocks in journal, for below calculation:

sbi->discard_blks += old_valid_blocks - se->valid_blocks;

There will be two times potential overflow:
- old_valid_blocks - se->valid_blocks will overflow, and be a very
large number.
- sbi->discard_blks += result will overflow again, comes out a correct
result accidently.

Anyway, it should be fixed.

Fixes: d600af236d ("f2fs: avoid unneeded loop in build_sit_entries")
Fixes: 1f43e2ad7b ("f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Chao Yu
b2532c6940 f2fs: rename dio_rwsem to i_gc_rwsem
RW semphore dio_rwsem in struct f2fs_inode_info is introduced to avoid
race between dio and data gc, but now, it is more wildly used to avoid
foreground operation vs data gc. So rename it to i_gc_rwsem to improve
its readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:49 -07:00
Yunlei He
b82f6e347b f2fs: move mnt_want_write_file after range check
This patch move mnt_want_write_file after range check,
it's needless to check arguments with it.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Yunlei He
cba41be08c f2fs: fix missing clear FI_NO_PREALLOC in some error case
This patch fix missing clear FI_NO_PREALLOC in some error case

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
ade990f95e f2fs: enforce fsync_mode=strict for renamed directory
This is to give a option for user to be able to recover B/foo in the below
case.

mkdir A
sync()
rename(A, B)
creat (B/foo)
fsync (B/foo)
---crash---

Sugessted-by: Velayudhan Pillai <vijay@cs.utexas.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
8a29c1260e f2fs: sanity check for total valid node blocks
This patch enhances sanity check for SIT entries.

syzbot hit the following crash on upstream commit
83beed7b2b (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=bf9253040425feb155ad

syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5692130282438656
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5095924598571008
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bf9253040425feb155ad@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): invalid crc value
F2FS-fs (loop0): Try to recover 1th superblock, ret: 0
F2FS-fs (loop0): Mounted with checkpoint version = d
F2FS-fs (loop0): Bitmap was wrongly cleared, blk:9740
------------[ cut here ]------------
kernel BUG at fs/f2fs/segment.c:1884!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4508 Comm: syz-executor0 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882
RSP: 0018:ffff8801af526708 EFLAGS: 00010282
RAX: ffffed0035ea4cc0 RBX: ffff8801ad454f90 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff82eeb87e RDI: ffffed0035ea4cb6
RBP: ffff8801af526760 R08: ffff8801ad4a2480 R09: ffffed003b5e4f90
R10: ffffed003b5e4f90 R11: ffff8801daf27c87 R12: ffff8801adb8d380
R13: 0000000000000001 R14: 0000000000000008 R15: 00000000ffffffff
FS:  00000000014af940(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f06bc223000 CR3: 00000001adb02000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 allocate_data_block+0x66f/0x2050 fs/f2fs/segment.c:2663
 do_write_page+0x105/0x1b0 fs/f2fs/segment.c:2727
 write_node_page+0x129/0x350 fs/f2fs/segment.c:2770
 __write_node_page+0x7da/0x1370 fs/f2fs/node.c:1398
 sync_node_pages+0x18cf/0x1eb0 fs/f2fs/node.c:1652
 block_operations+0x429/0xa60 fs/f2fs/checkpoint.c:1088
 write_checkpoint+0x3ba/0x5380 fs/f2fs/checkpoint.c:1405
 f2fs_sync_fs+0x2fb/0x6a0 fs/f2fs/super.c:1077
 __sync_filesystem fs/sync.c:39 [inline]
 sync_filesystem+0x265/0x310 fs/sync.c:67
 generic_shutdown_super+0xd7/0x520 fs/super.c:429
 kill_block_super+0xa4/0x100 fs/super.c:1191
 kill_f2fs_super+0x9f/0xd0 fs/f2fs/super.c:3030
 deactivate_locked_super+0x97/0x100 fs/super.c:316
 deactivate_super+0x188/0x1b0 fs/super.c:347
 cleanup_mnt+0xbf/0x160 fs/namespace.c:1174
 __cleanup_mnt+0x16/0x20 fs/namespace.c:1181
 task_work_run+0x1e4/0x290 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:191 [inline]
 exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
 do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457d97
RSP: 002b:00007ffd46f9c8e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000457d97
RDX: 00000000014b09a3 RSI: 0000000000000002 RDI: 00007ffd46f9da50
RBP: 00007ffd46f9da50 R08: 0000000000000000 R09: 0000000000000009
R10: 0000000000000005 R11: 0000000000000246 R12: 00000000014b0940
R13: 0000000000000000 R14: 0000000000000002 R15: 000000000000658e
RIP: update_sit_entry+0x1215/0x1590 fs/f2fs/segment.c:1882 RSP: ffff8801af526708
---[ end trace f498328bb02610a2 ]---

Reported-and-tested-by: syzbot+bf9253040425feb155ad@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+7d6d31d3bc702f566ce3@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+0a725420475916460f12@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
b2ca374f33 f2fs: sanity check on sit entry
syzbot hit the following crash on upstream commit
87ef12027b (Wed Apr 18 19:48:17 2018 +0000)
Merge tag 'ceph-for-4.17-rc2' of git://github.com/ceph/ceph-client
syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=83699adeb2d13579c31e

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5805208181407744
syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=6005073343676416
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=6555047731134464
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+83699adeb2d13579c31e@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
BUG: unable to handle kernel paging request at ffffed006b2a50c0
PGD 21ffee067 P4D 21ffee067 PUD 21fbeb067 PMD 0
Oops: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4514 Comm: syzkaller989480 Not tainted 4.17.0-rc1+ #8
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:build_sit_entries fs/f2fs/segment.c:3653 [inline]
RIP: 0010:build_segment_manager+0x7ef7/0xbf70 fs/f2fs/segment.c:3852
RSP: 0018:ffff8801b102e5b0 EFLAGS: 00010a06
RAX: 1ffff1006b2a50c0 RBX: 0000000000000004 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801ac74243e
RBP: ffff8801b102f410 R08: ffff8801acbd46c0 R09: fffffbfff14d9af8
R10: fffffbfff14d9af8 R11: ffff8801acbd46c0 R12: ffff8801ac742a80
R13: ffff8801d9519100 R14: dffffc0000000000 R15: ffff880359528600
FS:  0000000001e04880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffed006b2a50c0 CR3: 00000001ac6ac000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_fill_super+0x4095/0x7bf0 fs/f2fs/super.c:2803
 mount_bdev+0x30c/0x3e0 fs/super.c:1165
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1268
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2517 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2847
 ksys_mount+0x12d/0x140 fs/namespace.c:3063
 __do_sys_mount fs/namespace.c:3077 [inline]
 __se_sys_mount fs/namespace.c:3074 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443d6a
RSP: 002b:00007ffd312813c8 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443d6a
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffd312813d0
RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004
R13: 0000000000402c60 R14: 0000000000000000 R15: 0000000000000000
RIP: build_sit_entries fs/f2fs/segment.c:3653 [inline] RSP: ffff8801b102e5b0
RIP: build_segment_manager+0x7ef7/0xbf70 fs/f2fs/segment.c:3852 RSP: ffff8801b102e5b0
CR2: ffffed006b2a50c0
---[ end trace a2034989e196ff17 ]---

Reported-and-tested-by: syzbot+83699adeb2d13579c31e@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
5d64600d4f f2fs: avoid bug_on on corrupted inode
syzbot has tested the proposed patch but the reproducer still triggered crash:
kernel BUG at fs/f2fs/inode.c:LINE!

F2FS-fs (loop1): invalid crc value
F2FS-fs (loop5): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop5): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop5): invalid crc value
------------[ cut here ]------------
kernel BUG at fs/f2fs/inode.c:238!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4886 Comm: syz-executor1 Not tainted 4.17.0-rc1+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:do_read_inode fs/f2fs/inode.c:238 [inline]
RIP: 0010:f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313
RSP: 0018:ffff8801c44a70e8 EFLAGS: 00010293
RAX: ffff8801ce208040 RBX: ffff8801b3621080 RCX: ffffffff82eace18
F2FS-fs (loop2): Magic Mismatch, valid(0xf2f52010) - read(0x0)
RDX: 0000000000000000 RSI: ffffffff82eaf047 RDI: 0000000000000007
RBP: ffff8801c44a7410 R08: ffff8801ce208040 R09: ffffed0039ee4176
R10: ffffed0039ee4176 R11: ffff8801cf720bb7 R12: ffff8801c0efa000
R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f753aa9d700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
------------[ cut here ]------------
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel BUG at fs/f2fs/inode.c:238!
CR2: 0000000001b03018 CR3: 00000001c8b74000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 f2fs_fill_super+0x4377/0x7bf0 fs/f2fs/super.c:2842
 mount_bdev+0x30c/0x3e0 fs/super.c:1165
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1268
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2517 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2847
 ksys_mount+0x12d/0x140 fs/namespace.c:3063
 __do_sys_mount fs/namespace.c:3077 [inline]
 __se_sys_mount fs/namespace.c:3074 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3074
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457daa
RSP: 002b:00007f753aa9cba8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000000 RCX: 0000000000457daa
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007f753aa9cbf0
RBP: 0000000000000064 R08: 0000000020016a00 R09: 0000000020000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 0000000000000064 R14: 00000000006fcb80 R15: 0000000000000000
RIP: do_read_inode fs/f2fs/inode.c:238 [inline] RSP: ffff8801c44a70e8
RIP: f2fs_iget+0x3307/0x3ca0 fs/f2fs/inode.c:313 RSP: ffff8801c44a70e8
invalid opcode: 0000 [#2] SMP KASAN
---[ end trace 1cbcbec2156680bc ]---

Reported-and-tested-by: syzbot+41a1b341571f0952badb@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Jaegeuk Kim
a4f843bd00 f2fs: give message and set need_fsck given broken node id
syzbot hit the following crash on upstream commit
83beed7b2b (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264
syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5436660795834368
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
------------[ cut here ]------------
kernel BUG at fs/f2fs/node.c:1185!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185
RSP: 0018:ffff8801d960e820 EFLAGS: 00010293
RAX: ffff8801d88205c0 RBX: 0000000000000003 RCX: ffffffff82f6cc06
RDX: 0000000000000000 RSI: ffffffff82f6d5e8 RDI: 0000000000000004
RBP: ffff8801d960ec30 R08: ffff8801d88205c0 R09: ffffed003b5e46c2
R10: 0000000000000003 R11: 0000000000000003 R12: ffff8801a86e00c0
R13: 0000000000000001 R14: ffff8801a86e0530 R15: ffff8801d9745240
FS:  000000000072c880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3d403209b8 CR3: 00000001d8f3f000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 get_node_page fs/f2fs/node.c:1237 [inline]
 truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014
 remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039
 f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547
 evict+0x4a6/0x960 fs/inode.c:557
 iput_final fs/inode.c:1519 [inline]
 iput+0x62d/0xa80 fs/inode.c:1545
 f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849
 mount_bdev+0x30c/0x3e0 fs/super.c:1164
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1267
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2848
 ksys_mount+0x12d/0x140 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443dea
RSP: 002b:00007ffcc7882368 EFLAGS: 00000297 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0000000020000c00 RCX: 0000000000443dea
RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffcc7882370
RBP: 0000000000000003 R08: 0000000020016a00 R09: 000000000000000a
R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000004
R13: 0000000000402ce0 R14: 0000000000000000 R15: 0000000000000000
RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: ffff8801d960e820
---[ end trace 4edbeb71f002bb76 ]---

Reported-and-tested-by: syzbot+d154ec99402c6f628887@syzkaller.appspotmail.com
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:48 -07:00
Chao Yu
cf52b27a39 f2fs: clean up commit_inmem_pages()
This patch moves error handling from commit_inmem_pages() into
__commit_inmem_page() for cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Sheng Yong
eff15c2aec f2fs: do not check F2FS_INLINE_DOTS in recover
Only dir may have F2FS_INLINE_DOTS flag, so there is no need to check
the flag in recover flow.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Sheng Yong
a515d12f1b f2fs: remove duplicated dquot_initialize and fix error handling
This patch removes duplicated dquot_initialize in recover_orphan_inode(),
and fix the error handling if dquot_initialize fails.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Chao Yu
c22aecd759 f2fs: fix to detect failure of dquot_initialize
dquot_initialize() can fail due to any exception inside quota subsystem,
f2fs needs to be aware of it, and return correct return value to caller.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Yunlei He
d618477473 f2fs: stop issue discard if something wrong with f2fs
v4->v5: move data corruption check to __submit_discard_cmd, in order to
control discard io submitted more accurately, besides, increase async
thread wait time if data corruption detected.

This patch stop async thread and umount process to issue discard
if something wrong with f2fs, which is similar to fstrim.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:47 -07:00
Chao Yu
b169c3c560 f2fs: fix return value in f2fs_ioc_commit_atomic_write
In f2fs_ioc_commit_atomic_write, if file is volatile, return -EINVAL to
indicate that commit failure.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Yunlei He
054afda999 f2fs: allocate hot_data for atomic write more strictly
If a file not set type as hot, has dirty pages more than
threshold 64 before starting atomic write, may be lose hot
flag.

v1->v2: move set FI_ATOMIC_FILE flag behind flush dirty pages too,
in case of dirty pages before starting atomic use atomic mode to
write back.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Sheng Yong
d0891e84e1 f2fs: check if inmem_pages list is empty correctly
`cur' will never be NULL, we should check inmem_pages list instead.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Chao Yu
27319ba404 f2fs: fix race in between GC and atomic open
Thread					GC thread
- f2fs_ioc_start_atomic_write
 - get_dirty_pages
 - filemap_write_and_wait_range
					- f2fs_gc
					 - do_garbage_collect
					  - gc_data_segment
					   - move_data_page
					    - f2fs_is_atomic_file
					    - set_page_dirty
 - set_inode_flag(, FI_ATOMIC_FILE)

Dirty data page can still be generated by GC in race condition as
above call stack.

This patch adds fi->dio_rwsem[WRITE] in f2fs_ioc_start_atomic_write
to avoid such race.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Souptick Joarder
ea4d479bb3 fs: f2fs: Adding new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite
and fault handler.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Zhikang Zhang
d6964949e4 f2fs: change le32 to le16 of f2fs_inode->i_extra_size
In the structure of f2fs_inode, i_extra_size's type is __le16,
so we should keep type consistent when using it.

Fixes: 704956ecf5 ("f2fs: support inode checksum")
Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Zhikang Zhang
56b07e7e65 f2fs: check cur_valid_map_mir & raw_sit block count when flush sit entries
We should check valid_map_mir and block count to ensure
the flushed raw_sit is correct.

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:46 -07:00
Chao Yu
3d165dc3ae f2fs: correct return value of f2fs_trim_fs
Correct return value in two cases:
- return EINVAL if end boundary is out-of-range.
- return EIO if fs needs off-line check.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
764811581d f2fs: fix to show missing bits in FS_IOC_GETFLAGS
This patch fixes to show missing encrypt/inline_data flag in
FS_IOC_GETFLAGS like ext4 does.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
c807a7cb54 f2fs: remove unneeded F2FS_PROJINHERIT_FL
Now F2FS_FL_USER_VISIBLE and F2FS_FL_USER_MODIFIABLE has included
F2FS_PROJINHERIT_FL, so remove unneeded F2FS_PROJINHERIT_FL when
using visible/modifiable flag macro.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
81114baa83 f2fs: don't use GFP_ZERO for page caches
Related to https://lkml.org/lkml/2018/4/8/661

Sometimes, we need to write meta data to new allocated block address,
then we will allocate a zeroed page in inner inode's address space, and
fill partial data in it, and leave other place with zero value which means
some fields are initial status.

There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
I have just checked them, for both of them, we can avoid using __GFP_ZERO,
and do initialization by ourselves to avoid unneeded/redundant zeroing
from mm.

Cc: <stable@vger.kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Yunlei He
241b493d8f f2fs: issue all big range discards in umount process
This patch modify max_requests to UINT_MAX, to issue
all big range discards in umount.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Chao Yu
47cca3d62a f2fs: remove redundant block plug
For buffered IO, we don't need to use block plug to cache bio,
for direct IO, generic f2fs_direct_IO has already added block
plug, so let's remove redundant one in .write_iter.

As Yunlei described in his patch:

-f2fs_file_write_iter
  -blk_start_plug
    -__generic_file_write_iter
	...
	  -do_blockdev_direct_IO
	    -blk_start_plug
		...
	    -blk_finish_plug
	...
  -blk_finish_plug

which may conduct performance decrease in our platform

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:45 -07:00
Yunlong Song
8045249829 f2fs: remove unmatched zero_user_segment when convert inline dentry
Since the layout of regular dentry block is different from inline dentry
block, zero_user_segment starting from MAX_INLINE_DATA(dir) is not
correct for regular dentry block, besides, bitmap is already copied and
used, so there is no necessary to zero page at all, so just remove the
zero_user_segment is OK.

Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:44 -07:00
Chao Yu
59c844088b f2fs: introduce private inode status mapping
Previously, we use generic FS_*_FL defined by vfs to indicate inode status
for each bit of i_flags, so f2fs's flag status definition is tied to vfs'
one, it will be hard for f2fs to reuse bits f2fs never used to indicate
new status..

In order to solve this issue, we introduce private inode status mapping,
Note, for these bits have already been persisted into disk, we should
never change their definition, for other ones, we can remap them for
later new coming status.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 11:31:44 -07:00
Jaegeuk Kim
e555da9f31 f2fs: run fstrim asynchronously if runtime discard is on
We don't need to wait for whole bunch of discard candidates in fstrim, since
runtime discard will issue them in idle time.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-31 10:20:48 -07:00
Dave Jiang
80660f2025 dax: change bdev_dax_supported() to support boolean returns
The function return values are confusing with the way the function is
named. We expect a true or false return value but it actually returns
0/-errno.  This makes the code very confusing. Changing the return values
to return a bool where if DAX is supported then return true and no DAX
support returns false.

Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-31 08:58:34 -07:00
Darrick J. Wong
ba23cba9b3 fs: allow per-device dax status checking for filesystems
Change bdev_dax_supported so it takes a bdev parameter.  This enables
multi-device filesystems like xfs to check that a dax device can work for
the particular filesystem.  Once that's in place, actually fix all the
parts of XFS where we need to be able to distinguish between datadev and
rtdev.

This patch fixes the problem where we screw up the dax support checking
in xfs if the datadev and rtdev have different dax capabilities.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
[rez: Re-added __bdev_dax_supported() for !CONFIG_FS_DAX cases]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-31 08:58:33 -07:00
Adam Manzanares
087e566916 fs: iomap dio set bio prio from kiocb prio
Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb during direct IO.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-31 10:50:56 -04:00
Adam Manzanares
074111ca5f fs: blkdev set bio prio from kiocb prio
Now that kiocb has an ioprio field copy this over to the bio when it is
created from the kiocb.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-31 10:50:55 -04:00
Adam Manzanares
d9a08a9e61 fs: Add aio iopriority support
This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

This patch depends on block: add ioprio_check_cap function.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-31 10:50:55 -04:00
Adam Manzanares
fc28724d67 fs: Convert kiocb rw_hint from enum to u16
In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assignment.

Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-31 10:50:54 -04:00
Tetsuo Handa
543b8f8662 fuse: don't keep dead fuse_conn at fuse_fill_super().
syzbot is reporting use-after-free at fuse_kill_sb_blk() [1].
Since sb->s_fs_info field is not cleared after fc was released by
fuse_conn_put() when initialization failed, fuse_kill_sb_blk() finds
already released fc and tries to hold the lock. Fix this by clearing
sb->s_fs_info field after calling fuse_conn_put().

[1] https://syzkaller.appspot.com/bug?id=a07a680ed0a9290585ca424546860464dd9658db

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+ec3986119086fe4eec97@syzkaller.appspotmail.com>
Fixes: 3b463ae0c6 ("fuse: invalidation reverse calls")
Cc: John Muir <john@jmuir.com>
Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@redhat.com>
Cc: <stable@vger.kernel.org> # v2.6.31
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:11 +02:00
Miklos Szeredi
6becdb601b fuse: fix control dir setup and teardown
syzbot is reporting NULL pointer dereference at fuse_ctl_remove_conn() [1].
Since fc->ctl_ndents is incremented by fuse_ctl_add_conn() when new_inode()
failed, fuse_ctl_remove_conn() reaches an inode-less dentry and tries to
clear d_inode(dentry)->i_private field.

Fix by only adding the dentry to the array after being fully set up.

When tearing down the control directory, do d_invalidate() on it to get rid
of any mounts that might have been added.

[1] https://syzkaller.appspot.com/bug?id=f396d863067238959c91c0b7cfc10b163638cac6
Reported-by: syzbot <syzbot+32c236387d66c4516827@syzkaller.appspotmail.com>
Fixes: bafa96541b ("[PATCH] fuse: add control filesystem")
Cc: <stable@vger.kernel.org> # v2.6.18
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Tejun Heo
8a301eb16d fuse: fix congested state leak on aborted connections
If a connection gets aborted while congested, FUSE can leave
nr_wb_congested[] stuck until reboot causing wait_iff_congested() to
wait spuriously which can lead to severe performance degradation.

The leak is caused by gating congestion state clearing with
fc->connected test in request_end().  This was added way back in 2009
by 26c3679101 ("fuse: destroy bdi on umount").  While the commit
description doesn't explain why the test was added, it most likely was
to avoid dereferencing bdi after it got destroyed.

Since then, bdi lifetime rules have changed many times and now we're
always guaranteed to have access to the bdi while the superblock is
alive (fc->sb).

Drop fc->connected conditional to avoid leaking congestion states.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joshua Miller <joshmiller@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v2.6.29+
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Eric W. Biederman
4ad769f3c3 fuse: Allow fully unprivileged mounts
Now that the fuse and the vfs work is complete.  Allow the fuse filesystem
to be mounted by the root user in a user namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Eric W. Biederman
e45b2546e2 fuse: Ensure posix acls are translated outside of init_user_ns
Ensure the translation happens by failing to read or write
posix acls when the filesystem has not indicated it supports
posix acls.

This ensures that modern cached posix acl support is available
and used when dealing with posix acls.  This is important
because only that path has the code to convernt the uids and
gids in posix acls into the user namespace of a fuse filesystem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 12:26:10 +02:00
Tomohiro Misono
23d0b79dfa btrfs: Add unprivileged version of ino_lookup ioctl
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER
to allow normal users to call "btrfs subvolume list/show" etc. in
combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.

This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
different. This is  because it always searches the fs/file tree
correspoinding to the fd with which this ioctl is called and also
returns the name of bottom subvolume.

The main differences from original ino_lookup ioctl are:

  1. Read + Exec permission will be checked using inode_permission()
     during path construction. -EACCES will be returned in case
     of failure.
  2. Path construction will be stopped at the inode number which
     corresponds to the fd with which this ioctl is called. If
     constructed path does not exist under fd's inode, -EACCES
     will be returned.
  3. The name of bottom subvolume is also searched and filled.

Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1)
bytes than ino_lookup ioctl because of space of subvolume's name.

Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31 11:35:24 +02:00
Tomohiro Misono
42e4b520c8 btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which returns
ROOT_REF information of the subvolume containing this inode except the
subvolume name (this is because to prevent potential name leak). The
subvolume name will be gained by user version of ino_lookup ioctl
(BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check.

The min id of root ref's subvolume to be searched is specified by
@min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search
ends, @min_id is set to the last searched root ref's subvolid + 1. Also,
if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM,
-EOVERFLOW is returned. Therefore the caller can just call this ioctl
again without changing the argument to continue search.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ style fixes and struct item renames ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31 11:35:23 +02:00
Tomohiro Misono
b64ec075bd btrfs: Add unprivileged ioctl which returns subvolume information
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns
the information of subvolume containing this inode.
(i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)

Reviewed-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Tested-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
[ minor style fixes, update struct comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-31 11:35:23 +02:00
Amir Goldstein
01b39dcc95 ovl: use inode_insert5() to hash a newly created inode
Currently, there is a small window where ovl_obtain_alias() can
race with ovl_instantiate() and create two different overlay inodes
with the same underlying real non-dir non-hardlink inode.

The race requires an adversary to guess the file handle of the
yet to be created upper inode and decode the guessed file handle
after ovl_creat_real(), but before ovl_instantiate().
This race does not affect overlay directory inodes, because those
are decoded via ovl_lookup_real() and not with ovl_obtain_alias().

This patch fixes the race, by using inode_insert5() to add a newly
created inode to cache.

If the newly created inode apears to already exist in cache (hashed
by the same real upper inode), we instantiate the dentry with the old
inode and drop the new inode, instead of silently not hashing the new
inode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:12 +02:00
Vivek Goyal
ac6a52eb65 ovl: Pass argument to ovl_get_inode() in a structure
ovl_get_inode() right now has 5 parameters. Soon this patch series will
add 2 more and suddenly argument list starts looking too long.

Hence pass arguments to ovl_get_inode() in a structure and it looks
little cleaner.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:12 +02:00
Miklos Szeredi
80ea09a002 vfs: factor out inode_insert5()
Split out common helper for race free insertion of an already allocated
inode into the cache.  Use this from iget5_locked() and
insert_inode_locked4().  Make iget5_locked() use new_inode()/iput() instead
of alloc_inode()/destroy_inode() directly.

Also export to modules for use by filesystems which want to preallocate an
inode before file/directory creation.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
b148cba403 ovl: clean up copy-up error paths
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
dd8ac699ed ovl: return EIO on internal error
EIO better represents an internal error than ENOENT.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Al Viro
f73cc77c3a ovl: make ovl_create_real() cope with vfs_mkdir() safely
vfs_mkdir() may succeed and leave the dentry passed to it unhashed and
negative.  ovl_create_real() is the last caller breaking when that
happens.

[amir: split re-factoring of ovl_create_temp() to prep patch
       add comment about unhashed dir after mkdir
       add pr_warn() if mkdir succeeds and lookup fails]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Amir Goldstein
137ec526a2 ovl: create helper ovl_create_temp()
Also used ovl_create_temp() in ovl_create_index() instead of calling
ovl_do_mkdir() directly, so now all callers of ovl_do_mkdir() are routed
through ovl_create_real(), which paves the way for Al's fix for non-hashed
result from vfs_mkdir().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Miklos Szeredi
95a1c8153a ovl: return dentry from ovl_create_real()
Al Viro suggested to simplify callers of ovl_create_real() by
returning the created dentry (or ERR_PTR) from ovl_create_real().

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:11 +02:00
Amir Goldstein
471ec5dcf4 ovl: struct cattr cleanups
* Rename to ovl_cattr

* Fold ovl_create_real() hardlink argument into struct ovl_cattr

* Create macro OVL_CATTR() to initialize struct ovl_cattr from mode

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
6cf00764b0 ovl: strip debug argument from ovl_do_ helpers
It did not prove to be useful.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Amir Goldstein
a8b9e0ceed ovl: remove WARN_ON() real inode attributes mismatch
Overlayfs should cope with online changes to underlying layer
without crashing the kernel, which is what xfstest overlay/019
checks.

This test may sometimes trigger WARN_ON() in ovl_create_or_link()
when linking an overlay inode that has been changed on underlying
layer.

Remove those WARN_ON() to prevent the stress test from failing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Miklos Szeredi
4280f74a57 ovl: Kconfig documentation fixes
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-05-31 11:06:10 +02:00
Darrick J. Wong
829bc787c1 fs: clear writeback errors in inode_init_always
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits.  Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.

This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file.  This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-30 19:43:53 -07:00
Steve French
28d59363ae smb3: add tracepoints for smb2/smb3 open
add two tracepoints for open completion. One for error one for completion (open_done).
Sample output below

            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
               | |       |   ||||       |         |
            bash-15348 [007] .... 42441.027492: smb3_enter: 	cifs_lookup: xid=45
            bash-15348 [007] .... 42441.028214: smb3_cmd_err: 	sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=105 status=0xc0000034 rc=-2
            bash-15348 [007] .... 42441.028219: smb3_open_err: xid=45 sid=0x6173e4ce tid=0xa05150e6 cr_opts=0x0 des_access=0x80 rc=-2
            bash-15348 [007] .... 42441.028225: smb3_exit_done: 	cifs_lookup: xid=45
          fop777-24560 [002] .... 42442.627617: smb3_enter: 	cifs_revalidate_dentry_attr: xid=46
          fop777-24560 [003] .... 42442.628301: smb3_cmd_err: 	sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=106 status=0xc0000034 rc=-2
          fop777-24560 [003] .... 42442.628319: smb3_open_err: xid=46 sid=0x6173e4ce tid=0xa05150e6 cr_opts=0x0 des_access=0x80 rc=-2
          fop777-24560 [003] .... 42442.628335: smb3_enter: 	cifs_atomic_open: xid=47
          fop777-24560 [003] .... 42442.629587: smb3_cmd_done: 	sid=0x6173e4ce tid=0xa05150e6 cmd=5 mid=107
          fop777-24560 [003] .... 42442.629592: smb3_open_done: xid=47 sid=0x6173e4ce tid=0xa05150e6 fid=0xb8a0984d cr_opts=0x40 des_access=0x40000080

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 21:42:34 -05:00
Steve French
5c5a41be89 cifs: add debug output to show nocase mount option
For smb1 nocase can be specified on mount.  Allow displaying it
in debug data.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 17:59:49 -05:00
Steve French
fe048402e8 smb3: add define for id for posix create context and corresponding struct
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 17:59:46 -05:00
Ronnie Sahlberg
98170fb535 cifs: update smb2_check_message to handle PDUs without a 4 byte length header
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-30 17:24:14 -05:00
Kent Overstreet
e292d7bc63 xfs: convert to bioset_init()/mempool_init()
Convert XFS to embedded bio sets.

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
8ac9f7c1fd btrfs: convert to bioset_init()/mempool_init()
Convert btrfs to embedded bio sets.

Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
52190f8abe fs: convert block_dev.c to bioset_init()
Convert block DIO code to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Steve French
b326614ea2 smb3: allow "posix" mount option to enable new SMB311 protocol extensions
If "posix" (or synonym "unix" for backward compatibility) specified on mount,
and server advertises support for SMB3.11 POSIX negotiate context, then
enable the new posix extensions on the tcon.  This can be viewed by
looking for "posix" in the mount options displayed by /proc/mounts
for that mount (ie if posix extensions allowed by server and the
experimental POSIX extensions also requested on the mount by specifying
"posix" at mount time).

Also add check to warn user if conflicting unix/nounix or posix/noposix specified
on mount.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
fcef0db6d6 smb3: add support for posix negotiate context
Unlike CIFS where UNIX/POSIX extensions had been negotiatable,
SMB3 did not have POSIX extensions yet.  Add the new SMB3.11
POSIX negotiate context to ask the server whether it can
support POSIX (and thus whether we can send the new POSIX open
context).

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
f92a720ee9 cifs: allow disabling less secure legacy dialects
To improve security it may be helpful to have additional ways to restrict the
ability to override the default dialects (SMB2.1, SMB3 and SMB3.02) on mount
with old dialects (CIFS/SMB1 and SMB2) since vers=1.0 (CIFS/SMB1) and vers=2.0
are weaker and less secure.

Add a module parameter "disable_legacy_dialects"
(/sys/module/cifs/parameters/disable_legacy_dialects) which can be set to
1 (or equivalently Y) to forbid use of vers=1.0 or vers=2.0 on mount.

Also cleans up a few build warnings about globals for various module parms.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
11911b956f cifs: make minor clarifications to module params for cifs.ko
Note which ones of the module params are cifs dialect only
(N/A for default dialect now that has moved to SMB2.1 or later)

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-30 16:06:18 -05:00
Ronnie Sahlberg
6539e7f372 cifs: show the "w" bit for writeable /proc/fs/cifs/* files
RHBZ: 1539612

Lets show the "w" bit for those files have a .write interface to set/enable/...
the feature.

Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-30 16:06:18 -05:00
Steve French
49218b4f57 smb3: add module alias for smb3 to cifs.ko
We really don't want to be encouraging people to use the old
(less secure) cifs dialect (SMB1) and it can be confusing for them
with SMB3 (or later) being recommended but the module name is cifs.

Add a module alias for "smb3" to cifs.ko to make this less confusing.

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Ronnie Sahlberg
25ad1cbd02 cifs: return error on invalid value written to cifsFYI
RHBZ: 1539617

Check that, if it is not a boolean, the value the user tries
to write to /proc/fs/cifs/cifsFYI is valid and return an error
if not.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
2018-05-30 16:06:18 -05:00
Ronnie Sahlberg
57c55cd7c7 cifs: invalidate cache when we truncate a file
RHBZ: 1566345

When truncating a file we always do this synchronously to the server.
Thus we need to make sure that the cached inode metadata is
marked as stale so that on next getattr we will refresh the metadata.
In this particular bug we want to ensure that both ctime and mtime
are updated and become visible to the application after a truncate.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Xiaoli Feng <xifeng@redhat.com>
2018-05-30 16:06:18 -05:00
Steve French
e0386e449a smb3: print tree id in debugdata in proc to be able to help logging
When loooking at the logs for the new trace-cmd tracepoints for cifs,
it would help to know which tid is for which share (UNC name) so
update /proc/fs/cifs/DebugData to display the tid.
Also display Maximal Access which was missing as well.

Now the entry for typical entry for a tcon (in proc/fs/cifs/) looks
like:

1) \\localhost\test Mounts: 1 DevInfo: 0x20 Attributes: 0x1006f
	PathComponentMax: 255 Status: 1 type: DISK
	Share Capabilities: None Aligned, Partition Aligned,	Share Flags: 0x0
	tid: 0xe0632a55	Optimal sector size: 0x200	Maximal Access: 0x1f01ff

Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
d683bcd3e5 smb3: add additional ftrace entry points for entry/exit to cifs.ko
Signed-off-by: Steve French <smfrench@gmail.com>
2018-05-30 16:06:18 -05:00
Steve French
cfe8909164 smb3: fix various xid leaks
Fix a few cases where we were not freeing the xid which led to
active requests being non-zero at unmount time.

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-05-30 16:06:18 -05:00
Long Li
57a929a66f CIFS: Introduce offset for the 1st page in data transfer structures
When direct I/O is used, the data buffer may not always align to page
boundaries. Introduce a page offset in transport data structures to
describe the location of the buffer within the page.

Also change the function to pass the page offset when sending data to
transport.

Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-05-30 16:06:12 -05:00
Omar Sandoval
ad7e1a740d Btrfs: clean up error handling in btrfs_truncate()
btrfs_truncate() uses two variables for error handling, ret and err (if
this sounds familiar, it's because btrfs_truncate_inode_items() did
something similar). This is error prone, as was made evident by "Btrfs:
fix error handling in btrfs_truncate()". We only have err because we
don't want to mask an error if we call btrfs_update_inode() and
btrfs_end_transaction(), so let's make that its own scoped return
variable and use ret everywhere else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 21:27:32 +02:00
Nikolay Borisov
c5794e5178 btrfs: Factor out write portion of btrfs_get_blocks_direct
Now that the read side is extracted into its own function, do the same
to the write side. This leaves btrfs_get_blocks_direct_write with the
sole purpose of handling common locking required. Also flip the
condition in btrfs_get_blocks_direct_write so that the write case
comes first and we check for if (Create) rather than if (!create). This
is purely subjective but I believe makes reading a bit more "linear".
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:44 +02:00
Nikolay Borisov
1c8d0175df btrfs: Factor out read portion of btrfs_get_blocks_direct
Currently this function handles both the READ and WRITE dio cases. This
is facilitated by a bunch of 'if' statements, a goto short-circuit
statement and a very perverse aliasing of "!created"(READ) case
by setting lockstart = lockend and checking for lockstart < lockend for
detecting the write. Let's simplify this mess by extracting the
READ-only code into a separate __btrfs_get_block_direct_read function.
This is only the first step, the next one will be to factor out the
write side as well. The end goal will be to have the common locking/
unlocking code in btrfs_get_blocks_direct and then it will call either
the read|write subvariants. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 19:01:43 +02:00
Chao Yu
cba608493d f2fs: turn down IO priority of discard from background
In order to avoid interfering normal r/w IO, let's turn down IO
priority of discard issued from background.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 08:58:59 -07:00
Chao Yu
377224c471 f2fs: don't split checkpoint in fstrim
Now, we issue discard asynchronously in separated thread instead of in
checkpoint, after that, we won't encounter long latency in checkpoint
due to huge number of synchronous discard command handling, so, we don't
need to split checkpoint to do trim in batch, merge it and obsolete
related sysfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 08:58:59 -07:00
Jaegeuk Kim
8bb4f2535c f2fs: issue discard commands proactively in high fs utilization
In the high utilization like over 80%, we don't expect huge # of large discard
commands, but do many small pending discards which affects FTL GCs a lot.
Let's issue them in that case.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-30 08:58:59 -07:00
Su Yue
9132c4ff6f btrfs: return ENOMEM if path allocation fails in btrfs_cross_ref_exist
The error code does not match the reason of failure and may confuse the
callers.

Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 17:33:58 +02:00
Kees Cook
1389053e1b btrfs: raid56: Remove VLA usage
In the quest to remove all stack VLA usage from the kernel[1], this
allocates the working buffers during regular init, instead of using stack
space. This refactors the allocation code a bit to make it easier
to review.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 17:15:43 +02:00
Darrick J. Wong
d25522f10c xfs: repair superblocks
If one of the backup superblocks is found to differ seriously from
superblock 0, write out a fresh copy from the in-core sb.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
7e85bc6c87 xfs: add helpers to attach quotas to inodes
Add a helper routine to attach quota information to inodes that are
about to undergo repair.  If that fails, we need to schedule a
quotacheck for the next mount but allow the corrupted metadata repair to
continue.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:15 -07:00
Darrick J. Wong
04a2b7b254 xfs: recover AG btree roots from rmap data
Add a helper function to help us recover btree roots from the rmap data.
Callers pass in a list of rmap owner codes, buffer ops, and magic
numbers.  We iterate the rmap records looking for owner matches, and
then read the matching blocks to see if the magic number & uuid match.
If so, we then read-verify the block, and if that passes then we retain
a pointer to the block with the highest level, assuming that by the end
of the call we will have found the root.  This will be used to reset the
AGF/AGI btree root fields during their rebuild procedures.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
12c6510e2f xfs: add helpers to dispose of old btree blocks after a repair
Now that we've plumbed in the ability to construct a list of dead btree
blocks following a repair, add more helpers to dispose of them.  This is
done by examining the rmapbt -- if the btree was the only owner we can
free the block, otherwise it's crosslinked and we can only remove the
rmapbt record.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
64a39d876e xfs: add helpers to collect and sift btree block pointers during repair
Add some helpers to assemble a list of fs block extents.  Generally,
repair functions will iterate the rmapbt to make a list (1) of all
extents owned by the nominal owner of the metadata structure; then they
will iterate all other structures with the same rmap owner to make a
list (2) of active blocks; and finally we have a subtraction function to
subtract all the blocks in (2) from (1), with the result that (1) is now
a list of blocks that were owned by the old btree and must be disposed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
73d6b42aa4 xfs: add helpers to allocate and initialize fresh btree roots
Add a pair of helper functions to allocate and initialize fresh btree
roots.  The repair functions will use these as part of recreating
corrupted metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
0a9633fa2f xfs: add helpers to deal with transaction allocation and rolling
For repairs, we need to reserve at least as many blocks as we think
we're going to need to rebuild the data structure, and we're going to
need some helpers to roll transactions while maintaining locks on the AG
headers so that other threads cannot wander into the middle of a repair.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2018-05-30 08:03:14 -07:00
Darrick J. Wong
51863d7dd7 xfs: grab the per-ag structure whenever relevant
Grab and hold the per-AG data across a scrub run whenever relevant.
This helps us avoid repeated trips through rcu and the radix tree
in the repair code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2018-05-30 08:03:14 -07:00
Su Yue
090a127afa btrfs: return error value if create_io_em failed in cow_file_range
In cow_file_range(), create_io_em() may fail, but its return value is
not recorded.  Then return value may be 0 even it failed which is a
wrong behavior.

Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.

Fixes: 6f9994dbab ("Btrfs: create a helper to create em for IO")
CC: stable@vger.kernel.org # 4.11+
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:51:08 +02:00
Gu JinXiang
6b0cb1f901 btrfs: drop useless member qgroup_reserved of btrfs_pending_snapshot
Since there is no more use of qgroup_reserved member in struct
btrfs_pending_snapshot, remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:54 +02:00
Gu JinXiang
c4c129db5d btrfs: drop unused parameter qgroup_reserved
Since commit 7775c8184e ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata.  So remove it.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien
e73e81b6d0 btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:

1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Ethan Lien
78d4295b1e btrfs: lift some btrfs_cross_ref_exist checks in nocow path
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.

A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:

[global]
group_reporting
time_based
thread=1
ioengine=libaio
bs=4k
iodepth=32
size=64G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/nocow/testfile

IOPS result:   unpatched     patched

1 fio round:     46670        46958
snapshot
2 fio round:     51826        54498
3 fio round:     59767        61289

After snapshot, the first fio get about 5% performance gain. As we
continually write to the same file, all writes will resume to nocow mode
and eventually we have no performance gain.

Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi
d19577912d btrfs: Remove fs_info argument from btrfs_uuid_tree_rem
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:53 +02:00
Lu Fengqi
cdb345a877 btrfs: Remove fs_info argument from btrfs_uuid_tree_add
This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:52 +02:00
Liu Bo
f9ddfd0592 Btrfs: remove unused check of skip_locking
The check is superfluous since all callers who set search_for_commit
also have skip_locking set.

ASSERT() is put in place to ensure skip_locking is set by new callers.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:52 +02:00
Liu Bo
d80bb3f905 Btrfs: remove always true check in unlock_up
As unlock_up() is written as

for () {
   if (!path->locks[i])
       break;
   ...
   if (... && path->locks[i]) {
   }
}

Apparently, @path->locks[i] is always true at this 'if'.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:51 +02:00
Liu Bo
662c653bfd Btrfs: grab write lock directly if write_lock_level is the max level
Typically, when acquiring root node's lock, btrfs tries its best to get
read lock and trade for write lock if @write_lock_level implies to do so.

In case of (cow && (p->keep_locks || p->lowest_level)), write_lock_level
is set to BTRFS_MAX_LEVEL, which means we need to acquire root node's
write lock directly.

In this particular case, the dance of acquiring read lock and then trading
for write lock can be saved.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:51 +02:00
Liu Bo
1fc28d8e2e Btrfs: move get root out of btrfs_search_slot to a helper
It's good to have a helper instead of having all get-root details
open-coded.  The new helper locks (if necessary) and sets root node of
the path.

Also invert the checks to make the code flow easier to read.  There is
no functional change in this commit.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:44 +02:00
Liu Bo
e6a1d6fd27 Btrfs: use more straightforward extent_buffer_uptodate check
If parent_transid "0" is passed to btrfs_buffer_uptodate(),
btrfs_buffer_uptodate() is equivalent to extent_buffer_uptodate(), but
extent_buffer_uptodate() is preferred since we don't have to look into
verify_parent_transid().

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:44 +02:00
Liu Bo
ca19b4a699 Btrfs: remove superfluous free_extent_buffer in read_block_for_search
read_block_for_search() can be simplified as:

tmp = find_extent_buffer();
if (tmp)
   return;

...

free_extent_buffer();
read_tree_block();

Apparently, @tmp must be NULL at this point, free_extent_buffer() is not
needed.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:44 +02:00
Lu Fengqi
4ca6168327 btrfs: drop unused space_info parameter from create_space_info
Since commit dc2d3005d2 ("btrfs: remove dead create_space_info
calls"), there is only one caller btrfs_init_space_info. However, it
doesn't need create_space_info to return space_info at all.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Liu Bo
ff76a864cc Btrfs: add parent_transid parameter to veirfy_level_key
As verify_level_key() is checked after verify_parent_transid(), i.e.

if (verify_parent_transid())
   ret = -EIO;
else if (verify_level_key())
   ret = -EUCLEAN;

if parent_transid is 0, verify_parent_transid() skips verifying
parent_transid and considers eb as valid, and if verify_level_key()
reports something wrong, we're not going to know if it's caused by
corrupted metadata or non-checkecd eb (e.g. stale eb).

The stale eb can be from an outdated raid1 mirror after a degraded
mount, see eg "btrfs: fix reading stale metadata blocks after degraded
raid1 mounts" (02a3307aa9) for more details.

@parent_transid is able to tell whether the eb's generation has been
verified by the caller.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Qu Wenruo
9593bf4967 btrfs: qgroup: show more meaningful qgroup_rescan_init error message
Error message from qgroup_rescan_init() mostly looks like:

  BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115

Which is far from meaningful, and sometimes confusing as for above
-EINPROGRESS it's mostly (despite the init race) harmless, but sometimes
it can also indicate problem if the return value is -EINVAL.

Change it to some more meaningful messages like:

  BTRFS info (device nvme0n1p1): qgroup rescan is already in progress

And

  BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update the messages and level ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Omar Sandoval
fd4e994bd1 Btrfs: fix memory and mount leak in btrfs_ioctl_rm_dev_v2()
If we have invalid flags set, when we error out we must drop our writer
counter and free the buffer we allocated for the arguments. This bug is
trivially reproduced with the following program on 4.7+:

	#include <fcntl.h>
	#include <stdint.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/ioctl.h>
	#include <sys/stat.h>
	#include <sys/types.h>
	#include <linux/btrfs.h>
	#include <linux/btrfs_tree.h>

	int main(int argc, char **argv)
	{
		struct btrfs_ioctl_vol_args_v2 vol_args = {
			.flags = UINT64_MAX,
		};
		int ret;
		int fd;

		if (argc != 2) {
			fprintf(stderr, "usage: %s PATH\n", argv[0]);
			return EXIT_FAILURE;
		}

		fd = open(argv[1], O_WRONLY);
		if (fd == -1) {
			perror("open");
			return EXIT_FAILURE;
		}

		ret = ioctl(fd, BTRFS_IOC_RM_DEV_V2, &vol_args);
		if (ret == -1)
			perror("ioctl");

		close(fd);
		return EXIT_SUCCESS;
	}

When unmounting the filesystem, we'll hit the
WARN_ON(mnt_get_writers(mnt)) in cleanup_mnt() and also may prevent the
filesystem to be remounted read-only as the writer count will stay
lifted.

Fixes: 6b526ed70c ("btrfs: introduce device delete by devid")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Qu Wenruo
de885e3ee2 btrfs: lzo: Harden inline lzo compressed extent decompression
For inlined extent, we only have one segment, thus less things to check.
And further more, inlined extent always has the csum in its leaf header,
it's less probable to have corrupted data.

Anyway, still check header and segment header.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:43 +02:00
Qu Wenruo
314bfa473b btrfs: lzo: Add header length check to avoid potential out-of-bounds access
James Harvey reported that some corrupted compressed extent data can
lead to various kernel memory corruption.

Such corrupted extent data belongs to inode with NODATASUM flags, thus
data csum won't help us detecting such bug.

If lucky enough, KASAN could catch it like:

BUG: KASAN: slab-out-of-bounds in lzo_decompress_bio+0x384/0x7a0 [btrfs]
Write of size 4096 at addr ffff8800606cb0f8 by task kworker/u16:0/2338

CPU: 3 PID: 2338 Comm: kworker/u16:0 Tainted: G           O      4.17.0-rc5-custom+ #50
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
Call Trace:
 dump_stack+0xc2/0x16b
 print_address_description+0x6a/0x270
 kasan_report+0x260/0x380
 memcpy+0x34/0x50
 lzo_decompress_bio+0x384/0x7a0 [btrfs]
 end_compressed_bio_read+0x99f/0x10b0 [btrfs]
 bio_endio+0x32e/0x640
 normal_work_helper+0x15a/0xea0 [btrfs]
 process_one_work+0x7e3/0x1470
 worker_thread+0x1b0/0x1170
 kthread+0x2db/0x390
 ret_from_fork+0x22/0x40
...

The offending compressed data has the following info:

Header:			length 32768		(looks completely valid)
Segment 0 Header:	length 3472882419	(obviously out of bounds)

Then when handling segment 0, since it's over the current page, we need
the copy the compressed data to temporary buffer in workspace, then such
large size would trigger out-of-bounds memory access, screwing up the
whole kernel.

Fix it by adding extra checks on header and segment headers to ensure we
won't access out-of-bounds, and even checks the decompressed data won't
be out-of-bounds.

Reported-by: James Harvey <jamespharvey20@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-30 16:46:38 +02:00
Al Viro
1da92779e2 aio: sanitize the limit checking in io_submit(2)
as it is, the logics in native io_submit(2) is "if asked for
more than LONG_MAX/sizeof(pointer) iocbs to submit, don't
bother with more than LONG_MAX/sizeof(pointer)" (i.e.
512M requests on 32bit and 1E requests on 64bit) while
compat io_submit(2) goes with "stop after the first
PAGE_SIZE/sizeof(pointer) iocbs", i.e. 1K or so.  Which is
	* inconsistent
	* *way* too much in native case
	* possibly too little in compat one
and
	* wrong anyway, since the natural point where we
ought to stop bothering is ctx->nr_events

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:20:17 -04:00
Al Viro
67ba049f94 aio: fold do_io_submit() into callers
get rid of insane "copy array of 32bit pointers into an array of
native ones" glue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:19:29 -04:00
Al Viro
95af8496ac aio: shift copyin of iocb into io_submit_one()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:18:31 -04:00
Al Viro
d2988bd412 aio_read_events_ring(): make a bit more readable
The logics for 'avail' is
	* not past the tail of cyclic buffer
	* no more than asked
	* not past the end of buffer
	* not past the end of a page

Unobfuscate the last part.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:18:17 -04:00
Al Viro
9061d14a8a aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
... so just make them return 0 when caller does not need to destroy iocb

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:17:40 -04:00
Al Viro
3c96c7f4ca aio: take list removal to (some) callers of aio_complete()
We really want iocb out of io_cancel(2) reach before we start tearing
it down.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 23:16:43 -04:00
Linus Torvalds
91fc957a61 Merge tag 'afs-fixes-20180529' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:

 - fix a BUG triggerable from faccessat()

 - fix the mounting of backup volumes

* tag 'afs-fixes-20180529' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix mounting of backup volumes
  afs: Fix directory permissions check
2018-05-29 15:30:16 -05:00
Jaegeuk Kim
d6290814b0 f2fs: add fsync_mode=nobarrier for non-atomic files
For non-atomic files, this patch adds an option to give nobarrier which
doesn't issue flush commands to the device.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-29 12:04:08 -07:00
Jaegeuk Kim
9a997188ff f2fs: let fstrim issue discard commands in lower priority
The fstrim gathers huge number of large discard commands, and tries to issue
without IO awareness, which results in long user-perceive IO latencies on
READ, WRITE, and FLUSH in UFS. We've observed some of commands take several
seconds due to long discard latency.

This patch limits the maximum size to 2MB per candidate, and check IO congestion
when issuing them to disk.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-05-29 12:04:08 -07:00
Souptick Joarder
05edd888d1 fs: xfs: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handlers.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-05-29 10:46:03 -07:00
Darrick J. Wong
2e050e648a xfs: fix inobt magic number check
In commit a6a781a58b ("xfs: have buffer verifier functions
report failing address") the bad magic number return was ported
incorrectly.

Fixes: a6a781a58b
Reported-by: syzbot+08ab33be0178b76851c8@syzkaller.appspotmail.com
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-05-29 10:46:03 -07:00
Darrick J. Wong
aee9a4a555 fs: clear writeback errors in inode_init_always
In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSPC) bits.  Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.

This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file.  This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files!) in generic/047.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2018-05-29 10:46:03 -07:00
Chengguang Xu
eb91537575 vfs: delete unnecessary assignment in vfs_listxattr
It seems the first error assignment in if branch is redundant.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-29 13:22:41 -04:00
Qu Wenruo
2a1f7c0cbd btrfs: lzo: document the compressed data format
Although it's not that complex, but such comment could still save
several minutes for newer reader/reviewer instead of inferring that from
the code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor wording updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:13:00 +02:00
Qu Wenruo
d5c1d68fde btrfs: compression: Add linux/sizes.h for compression.h
Since compression.h is using the SZ_* macros, and if some file includes
only compression.h without linux/sizes.h, it will cause compile error.

One example is lzo.c, if it uses BTRFS_MAX_COMPRESSED.  Fix it by adding
linux/sizes.h in compression.h

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:13:00 +02:00
Omar Sandoval
b5c40d598f Btrfs: fix clone vs chattr NODATASUM race
In btrfs_clone_files(), we must check the NODATASUM flag while the
inodes are locked. Otherwise, it's possible that btrfs_ioctl_setflags()
will change the flags after we check and we can end up with a party
checksummed file.

The race window is only a few instructions in size, between the if and
the locks which is:

3834         if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
3835                 return -EISDIR;

where the setflags must be run and toggle the NODATASUM flag (provided
the file size is 0).  The clone will block on the inode lock, segflags
takes the inode lock, changes flags, releases log and clone continues.

Not impossible but still needs a lot of bad luck to hit unintentionally.

Fixes: 0e7b824c4e ("Btrfs: don't make a file partly checksummed through file clone")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:59 +02:00
Gu Jinxiang
b89311efe6 btrfs: propagate failures of __exclude_logged_extent to upper caller
Function btrfs_exclude_logged_extents may call __exclude_logged_extent
which may fail.
Propagate the failures of __exclude_logged_extent to upper caller.

Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:58 +02:00
Nikolay Borisov
d4b20733d2 btrfs: Streamline shared ref check in alloc_reserved_tree_block
Instead of setting "parent" to ref->parent only when dealing with
a shared ref and subsequently performing another check to see
if (parent > 0), check the "node->type" directly and act accordingly.
This makes the code more streamline. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:57 +02:00
Nikolay Borisov
21ebfbe7e0 btrfs: Pass btrfs_delayed_extent_op to alloc_reserved_tree_block
Instead of taking only specific member of this structure, which results
in 2 extra arguments, just take the delayed_extent_op struct and
reference the arguments inside the functions. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:57 +02:00
Nikolay Borisov
4e6bd4e0aa btrfs: Simplify alloc_reserved_tree_block interface
This function currently takes 7 parameters, most of which are proxies
for values from btrfs_delayed_ref_node struct which is not passed. This
patch simplifies the interface of the function by simply passing said
delayed ref node struct to the function. This enables us to:

1. Move locals variables and init code related to them from
   run_delayed_tree_ref which should only be used inside
   alloc_reserved_tree_block, such as skinny_metadata and the btrfs_key,
   representing the extent being inserted. This removes the need for the
   "ins" argument. Instead, it's replaced by a local var with a more
   verbose name - extent_key.

2. Now that we have a reference to the node in alloc_reserved_tree_block
   the delayed_tree_ref struct can be referenced inside the function and
   this enable removing the "ref->level", "parent" and "ref_root"
   arguments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:53 +02:00
Nikolay Borisov
9dcdbe0144 btrfs: Remove fs_info argument from alloc_reserved_tree_block
This function already takes a transaction handle which contains a
reference to the fs_info. So use this and remove the extra argument.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:53 +02:00
David Sterba
315b76b462 btrfs: tests: drop newline from test_msg strings
Now that test_err strings do not need the newline, remove them also from
the test_msg.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-29 18:12:52 +02:00