Commit Graph

61822 Commits

Author SHA1 Message Date
Eric Biggers
80f2388afa f2fs: fix race conditions in ->d_compare() and ->d_hash()
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL.  For f2fs_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:

	int main()
	{
		if (fork()) {
			for (;;) {
				mkdir("subdir", 0700);
				rmdir("subdir");
			}
		} else {
			for (;;)
				access("subdir/file", 0);
		}
	}

... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.

I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a
similar crash is possible there.

Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-24 10:04:09 -08:00
Eric Biggers
5515eae647 f2fs: fix dcache lookup of !casefolded directories
Do the name comparison for non-casefolded directories correctly.

This is analogous to ext4's commit 66883da1ee ("ext4: fix dcache
lookup of !casefolded directories").

Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-24 09:53:02 -08:00
Hridya Valsaraju
fc7100ea2a f2fs: Add f2fs stats to sysfs
Currently f2fs stats are only available from /d/f2fs/status. This patch
adds some of the f2fs stats to sysfs so that they are accessible even
when debugfs is not mounted.

The following sysfs nodes are added:
-/sys/fs/f2fs/<disk>/free_segments
-/sys/fs/f2fs/<disk>/cp_foreground_calls
-/sys/fs/f2fs/<disk>/cp_background_calls
-/sys/fs/f2fs/<disk>/gc_foreground_calls
-/sys/fs/f2fs/<disk>/gc_background_calls
-/sys/fs/f2fs/<disk>/moved_blocks_foreground
-/sys/fs/f2fs/<disk>/moved_blocks_background
-/sys/fs/f2fs/<disk>/avg_vblocks

Signed-off-by: Hridya Valsaraju <hridya@google.com>
[Jaegeuk Kim: allow STAT_FS without DEBUG_FS]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-23 09:24:25 -08:00
Chao Yu
fb24fea75c f2fs: change to use rwsem for gc_mutex
Mutex lock won't serialize callers, in order to avoid starving of unlucky
caller, let's use rwsem lock instead.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:44 -08:00
Jaegeuk Kim
d7b0a23d81 f2fs: update f2fs document regarding to fsync_mode
This patch adds missing fsync_mode entry in f2fs document.

Fixes: 04485987f0 ("f2fs: introduce async IPU policy")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:44 -08:00
Jaegeuk Kim
0e7f41974e f2fs: add a way to turn off ipu bio cache
Setting 0x40 in /sys/fs/f2fs/dev/ipu_policy gives a way to turn off
bio cache, which is useufl to check whether block layer using hardware
encryption engine merges IOs correctly.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:43 -08:00
Chengguang Xu
bf2cbd3c57 f2fs: code cleanup for f2fs_statfs_project()
Calling min_not_zero() to simplify complicated prjquota
limit comparison in f2fs_statfs_project().

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:43 -08:00
Chengguang Xu
acdf217217 f2fs: fix miscounted block limit in f2fs_statfs_project()
statfs calculates Total/Used/Avail disk space in block unit,
so we should translate soft/hard prjquota limit to block unit
as well.

Below testing result shows the block/inode numbers of
Total/Used/Avail from df command are all correct afer
applying this patch.

[root@localhost quota-tools]\# ./repquota -P /dev/sdb1
*** Report for project quotas on device /dev/sdb1
Block grace time: 7days; Inode grace time: 7days
              Block limits                File limits
Project   used soft    hard  grace  used  soft  hard  grace
-----------------------------------------------------------
\#0   --   4       0       0         1     0     0
\#101 --   0       0       0         2     0     0
\#102 --   0   10240       0         2    10     0
\#103 --   0       0   20480         2     0    20
\#104 --   0   10240   20480         2    10    20
\#105 --   0   20480   10240         2    20    10

[root@localhost sdb1]\# lsattr -p t{1,2,3,4,5}
  101 ----------------N-- t1/a1
  102 ----------------N-- t2/a2
  103 ----------------N-- t3/a3
  104 ----------------N-- t4/a4
  105 ----------------N-- t5/a5

[root@localhost sdb1]\# df -hi t{1,2,3,4,5}
Filesystem     Inodes IUsed IFree IUse% Mounted on
/dev/sdb1        2.4M    21  2.4M    1% /mnt/sdb1
/dev/sdb1          10     2     8   20% /mnt/sdb1
/dev/sdb1          20     2    18   10% /mnt/sdb1
/dev/sdb1          10     2     8   20% /mnt/sdb1
/dev/sdb1          10     2     8   20% /mnt/sdb1

[root@localhost sdb1]\# df -h t{1,2,3,4,5}
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        10G  489M  9.6G   5% /mnt/sdb1
/dev/sdb1        10M     0   10M   0% /mnt/sdb1
/dev/sdb1        20M     0   20M   0% /mnt/sdb1
/dev/sdb1        10M     0   10M   0% /mnt/sdb1
/dev/sdb1        10M     0   10M   0% /mnt/sdb1

Fixes: 909110c060 ("f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()")
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:43 -08:00
Eric Biggers
644c8c92ad f2fs: fix deadlock allocating bio_post_read_ctx from mempool
Without any form of coordination, any case where multiple allocations
from the same mempool are needed at a time to make forward progress can
deadlock under memory pressure.

This is the case for struct bio_post_read_ctx, as one can be allocated
to decrypt a Merkle tree page during fsverity_verify_bio(), which itself
is running from a post-read callback for a data bio which has its own
struct bio_post_read_ctx.

Fix this by freeing first bio_post_read_ctx before calling
fsverity_verify_bio().  This works because verity (if enabled) is always
the last post-read step.

This deadlock can be reproduced by trying to read from an encrypted
verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching
mempool_alloc() to pretend that pool->alloc() always fails.

Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually
hit this bug in practice would require reading from lots of encrypted
verity files at the same time.  But it's theoretically possible, as N
available objects doesn't guarantee forward progress when > N/2 threads
each need 2 objects at a time.

Fixes: 95ae251fe8 ("f2fs: add fs-verity support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:43 -08:00
Eric Biggers
e8ce5749d7 f2fs: remove unneeded check for error allocating bio_post_read_ctx
Since allocating an object from a mempool never fails when
__GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check
for failure to allocate a bio_post_read_ctx is unnecessary.  Remove it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:42 -08:00
Jaegeuk Kim
b06af2aff2 f2fs: convert inline_dir early before starting rename
If we hit an error during rename, we'll get two dentries in different
directories.

Chao adds to check the room in inline_dir which can avoid needless
inversion. This should be done by inode_lock(&old_dir).

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:42 -08:00
Chao Yu
fe396ad8e7 f2fs: fix memleak of kobject
If kobject_init_and_add() failed, caller needs to invoke kobject_put()
to release kobject explicitly.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:42 -08:00
Chao Yu
3e5e479a39 f2fs: fix to add swap extent correctly
As Youling reported in mailing list:

https://www.linuxquestions.org/questions/linux-newbie-8/the-file-system-f2fs-is-broken-4175666043/

https://www.linux.org/threads/the-file-system-f2fs-is-broken.26490/

There is a test case can corrupt f2fs image:
- dd if=/dev/zero of=/swapfile bs=1M count=4096
- chmod 600 /swapfile
- mkswap /swapfile
- swapon --discard /swapfile

The root cause is f2fs_swap_activate() intends to return zero value
to setup_swap_extents() to enable SWP_FS mode (swap file goes through
fs), in this flow, setup_swap_extents() setups swap extent with wrong
block address range, result in discard_swap() erasing incorrect address.

Because f2fs_swap_activate() has pinned swapfile, its data block
address will not change, it's safe to let swap to handle IO through
raw device, so we can get rid of SWAP_FS mode and initial swap extents
inside f2fs_swap_activate(), by this way, later discard_swap() can trim
in right address range.

Fixes: 4969c06a0d ("f2fs: support swap file w/ DIO")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:42 -08:00
Jaegeuk Kim
4eea93e3ff f2fs: run fsck when getting bad inode during GC
This is to avoid inifinite GC when trying to disable checkpoint.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:42 -08:00
Chao Yu
4c8ff7095b f2fs: support data compression
This patch tries to support compression in f2fs.

- New term named cluster is defined as basic unit of compression, file can
be divided into multiple clusters logically. One cluster includes 4 << n
(n >= 0) logical pages, compression size is also cluster size, each of
cluster can be compressed or not.

- In cluster metadata layout, one special flag is used to indicate cluster
is compressed one or normal one, for compressed cluster, following metadata
maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs stores
data including compress header and compressed data.

- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
all logical blocks in file are valid and cluster compress ratio is lower
than specified threshold.

- To enable compression on regular inode, there are three ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext

Compress metadata layout:
                             [Dnode Structure]
             +-----------------------------------------------+
             | cluster 1 | cluster 2 | ......... | cluster N |
             +-----------------------------------------------+
             .           .                       .           .
       .                       .                .                      .
  .         Compressed Cluster       .        .        Normal Cluster            .
+----------+---------+---------+---------+  +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 |  | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+  +---------+---------+---------+---------+
           .                             .
         .                                           .
       .                                                           .
      +-------------+-------------+----------+----------------------------+
      | data length | data chksum | reserved |      compressed data       |
      +-------------+-------------+----------+----------------------------+

Changelog:

20190326:
- fix error handling of read_end_io().
- remove unneeded comments in f2fs_encrypt_one_page().

20190327:
- fix wrong use of f2fs_cluster_is_full() in f2fs_mpage_readpages().
- don't jump into loop directly to avoid uninitialized variables.
- add TODO tag in error path of f2fs_write_cache_pages().

20190328:
- fix wrong merge condition in f2fs_read_multi_pages().
- check compressed file in f2fs_post_read_required().

20190401
- allow overwrite on non-compressed cluster.
- check cluster meta before writing compressed data.

20190402
- don't preallocate blocks for compressed file.

- add lz4 compress algorithm
- process multiple post read works in one workqueue
  Now f2fs supports processing post read work in multiple workqueue,
  it shows low performance due to schedule overhead of multiple
  workqueue executing orderly.

20190921
- compress: support buffered overwrite
C: compress cluster flag
V: valid block address
N: NEW_ADDR

One cluster contain 4 blocks

 before overwrite   after overwrite

- VVVV		->	CVNN
- CVNN		->	VVVV

- CVNN		->	CVNN
- CVNN		->	CVVV

- CVVV		->	CVNN
- CVVV		->	CVVV

20191029
- add kconfig F2FS_FS_COMPRESSION to isolate compression related
codes, add kconfig F2FS_FS_{LZO,LZ4} to cover backend algorithm.
note that: will remove lzo backend if Jaegeuk agreed that too.
- update codes according to Eric's comments.

20191101
- apply fixes from Jaegeuk

20191113
- apply fixes from Jaegeuk
- split workqueue for fsverity

20191216
- apply fixes from Jaegeuk

20200117
- fix to avoid NULL pointer dereference

[Jaegeuk Kim]
- add tracepoint for f2fs_{,de}compress_pages()
- fix many bugs and add some compression stats
- fix overwrite/mmap bugs
- address 32bit build error, reported by Geert.
- bug fixes when handling errors and i_compressed_blocks

Reported-by: <noreply@ellerman.id.au>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-17 16:48:07 -08:00
Jaegeuk Kim
820d366736 f2fs: free sysfs kobject
Detected kmemleak.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:49 -08:00
Jaegeuk Kim
2c4e0c528e f2fs: declare nested quota_sem and remove unnecessary sems
1.
f2fs_quota_sync
 -> down_read(&sbi->quota_sem)
 -> dquot_writeback_dquots
  -> f2fs_dquot_commit
   -> down_read(&sbi->quota_sem)

2.
f2fs_quota_sync
 -> down_read(&sbi->quota_sem)
  -> f2fs_write_data_pages
   -> f2fs_write_single_data_page
    -> down_write(&F2FS_I(inode)->i_sem)

f2fs_mkdir
 -> f2fs_do_add_link
   -> down_write(&F2FS_I(inode)->i_sem)
   -> f2fs_init_inode_metadata
    -> f2fs_new_node_page
     -> dquot_alloc_inode
      -> f2fs_dquot_mark_dquot_dirty
       -> down_read(&sbi->quota_sem)

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:49 -08:00
Jaegeuk Kim
762e4db545 f2fs: don't put new_page twice in f2fs_rename
In f2fs_rename(), new_page is gone after f2fs_set_link(), but it tries
to put again when whiteout is failed and jumped to put_out_dir.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:49 -08:00
Jaegeuk Kim
5b1dbb082f f2fs: set I_LINKABLE early to avoid wrong access by vfs
This patch moves setting I_LINKABLE early in rename2(whiteout) to avoid the
below warning.

[ 3189.163385] WARNING: CPU: 3 PID: 59523 at fs/inode.c:358 inc_nlink+0x32/0x40
[ 3189.246979] Call Trace:
[ 3189.248707]  f2fs_init_inode_metadata+0x2d6/0x440 [f2fs]
[ 3189.251399]  f2fs_add_inline_entry+0x162/0x8c0 [f2fs]
[ 3189.254010]  f2fs_add_dentry+0x69/0xe0 [f2fs]
[ 3189.256353]  f2fs_do_add_link+0xc5/0x100 [f2fs]
[ 3189.258774]  f2fs_rename2+0xabf/0x1010 [f2fs]
[ 3189.261079]  vfs_rename+0x3f8/0xaa0
[ 3189.263056]  ? tomoyo_path_rename+0x44/0x60
[ 3189.265283]  ? do_renameat2+0x49b/0x550
[ 3189.267324]  do_renameat2+0x49b/0x550
[ 3189.269316]  __x64_sys_renameat2+0x20/0x30
[ 3189.271441]  do_syscall_64+0x5a/0x230
[ 3189.273410]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3189.275848] RIP: 0033:0x7f270b4d9a49

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:48 -08:00
Eric Biggers
542989b674 f2fs: don't keep META_MAPPING pages used for moving verity file blocks
META_MAPPING is used to move blocks for both encrypted and verity files.
So the META_MAPPING invalidation condition in do_checkpoint() should
consider verity too, not just encrypt.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:48 -08:00
Chao Yu
f543805fcd f2fs: introduce private bioset
In low memory scenario, we can allocate multiple bios without
submitting any of them.

- f2fs_write_checkpoint()
 - block_operations()
  - f2fs_sync_node_pages()
   step 1) flush cold nodes, allocate new bio from mempool
   - bio_alloc()
    - mempool_alloc()
   step 2) flush hot nodes, allocate a bio from mempool
   - bio_alloc()
    - mempool_alloc()
   step 3) flush warm nodes, be stuck in below call path
   - bio_alloc()
    - mempool_alloc()
     - loop to wait mempool element release, as we only
       reserved memory for two bio allocation, however above
       allocated two bios may never be submitted.

So we need avoid using default bioset, in this patch we introduce a
private bioset, in where we enlarg mempool element count to total
number of log header, so that we can make sure we have enough
backuped memory pool in scenario of allocating/holding multiple
bios.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:48 -08:00
Sahitya Tummala
0e6d01643c f2fs: cleanup duplicate stats for atomic files
Remove duplicate sbi->aw_cnt stats counter that tracks
the number of atomic files currently opened (it also shows
incorrect value sometimes). Use more relit lable sbi->atomic_files
to show in the stats.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:48 -08:00
Shin'ichiro Kawasaki
d508c94e45 f2fs: Check write pointer consistency of non-open zones
To catch f2fs bugs in write pointer handling code for zoned block
devices, check write pointers of non-open zones that current segments do
not point to. Do this check at mount time, after the fsync data recovery
and current segments' write pointer consistency fix. Or when fsync data
recovery is disabled by mount option, do the check when there is no fsync
data.

Check two items comparing write pointers with valid block maps in SIT.
The first item is check for zones with no valid blocks. When there is no
valid blocks in a zone, the write pointer should be at the start of the
zone. If not, next write operation to the zone will cause unaligned write
error. If write pointer is not at the zone start, reset the write pointer
to place at the zone start.

The second item is check between the write pointer position and the last
valid block in the zone. It is unexpected that the last valid block
position is beyond the write pointer. In such a case, report as a bug.
Fix is not required for such zone, because the zone is not selected for
next write operation until the zone get discarded.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:43:48 -08:00
Shin'ichiro Kawasaki
c426d99127 f2fs: Check write pointer consistency of open zones
On sudden f2fs shutdown, write pointers of zoned block devices can go
further but f2fs meta data keeps current segments at positions before the
write operations. After remounting the f2fs, this inconsistency causes
write operations not at write pointers and "Unaligned write command"
error is reported.

To avoid the error, compare current segments with write pointers of open
zones the current segments point to, during mount operation. If the write
pointer position is not aligned with the current segment position, assign
a new zone to the current segment. Also check the newly assigned zone has
write pointer at zone start. If not, reset write pointer of the zone.

Perform the consistency check during fsync recovery. Not to lose the
fsync data, do the check after fsync data gets restored and before
checkpoint commit which flushes data at current segment positions. Not to
cause conflict with kworker's dirfy data/node flush, do the fix within
SBI_POR_DOING protection.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-15 13:42:14 -08:00
Jaegeuk Kim
dd973007bf f2fs: set GFP_NOFS when moving inline dentries
Otherwise, it can cause circular locking dependency reported by mm.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-12-12 13:24:34 -08:00
Jaegeuk Kim
4f4460c08a f2fs: should avoid recursive filesystem ops
We need to use GFP_NOFS, since we did f2fs_lock_op().

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-12-12 13:24:34 -08:00
Jaegeuk Kim
3f188c23d7 f2fs: keep quota data on write_begin failure
This patch avoids some unnecessary locks for quota files when write_begin
fails.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-12-12 13:24:34 -08:00
Jaegeuk Kim
bdf0329924 f2fs: call f2fs_balance_fs outside of locked page
Otherwise, we can hit deadlock by waiting for the locked page in
move_data_block in GC.

 Thread A                     Thread B
 - do_page_mkwrite
  - f2fs_vm_page_mkwrite
   - lock_page
                              - f2fs_balance_fs
                                  - mutex_lock(gc_mutex)
                               - f2fs_gc
                                - do_garbage_collect
                                 - ra_data_block
                                  - grab_cache_page
   - f2fs_balance_fs
    - mutex_lock(gc_mutex)

Fixes: 39a8695824 ("f2fs: refactor ->page_mkwrite() flow")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-12-10 16:03:55 -08:00
Jaegeuk Kim
47501f87c6 f2fs: preallocate DIO blocks when forcing buffered_io
The previous preallocation and DIO decision like below.

                         allow_outplace_dio              !allow_outplace_dio
f2fs_force_buffered_io   (*) No_Prealloc / Buffered_IO   Prealloc / Buffered_IO
!f2fs_force_buffered_io  No_Prealloc / DIO               Prealloc / DIO

But, Javier reported Case (*) where zoned device bypassed preallocation but
fell back to buffered writes in f2fs_direct_IO(), resulting in stale data
being read.

In order to fix the issue, actually we need to preallocate blocks whenever
we fall back to buffered IO like this. No change is made in the other cases.

                         allow_outplace_dio              !allow_outplace_dio
f2fs_force_buffered_io   (*) Prealloc / Buffered_IO      Prealloc / Buffered_IO
!f2fs_force_buffered_io  No_Prealloc / DIO               Prealloc / DIO

Reported-and-tested-by: Javier Gonzalez <javier@javigon.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2019-12-09 15:57:45 -08:00
David Sterba
78f926f72e btrfs: add Kconfig dependency for BLAKE2B
Because the BLAKE2B code went through a different tree, it was not
available at the time the btrfs part was merged. Now that the Kconfig
symbol exists, add it to the list.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-09 17:56:06 +01:00
Linus Torvalds
a78f7cdddb 9 cifs/smb3 fixes: two timestamp fixes, one oops fix (during oplock break) for stable, two fixes found in multichannel testing, two fixes for file create when using modeforsid mount parm
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3sPUUACgkQiiy9cAdy
 T1GumAwAhh0Fk2uEV01REMgA6MgQ2hrdGE5HariSTzGifCk8cxMnq1H1u9yxtic8
 uvEJQaUmTLWrN2C+xqD2JqPmJyrPOtnL0PLCLQk2/RsPCsDgYnmdKoAehInPh17g
 J8MoKPp1/1wYhbOl7CeF0xo2rEchoh/PcPCXpt8qj+M+kBgQkI64UQ/6iY/mV9Zl
 n7WJJFDyz3D1+SaJPaVxMpNxZcMpFbGqVJYTWP4v3pL2E8wEhyWjAryLCJAFFGf7
 Y2FwOSFuifMN/qC9t83W5KkRT9I/zRQ2g5qK1tC24LiTjQ3cqkCy1SSqpKQyvKwz
 P/oRX0HsuIbr1KFzN55kg831m/V7/1B/5bf9AivfhjsAoSyp2yyVQgPeV+nQkO0r
 iQdNatohC9HlwXmrypS+GhLXnj8xLnCR4+Aj7hGSuiVLHnCOfnGjQxI40BFWaBli
 1RG9agkploMYvcjcgSgDGVFFWTeHgSQKI1DQTL2Nx4py1zj7Rv/kEgwkZ3zdEf9h
 PPl37hBM
 =gey9
 -----END PGP SIGNATURE-----

Merge tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Nine cifs/smb3 fixes:

   - one fix for stable (oops during oplock break)

   - two timestamp fixes including important one for updating mtime at
     close to avoid stale metadata caching issue on dirty files (also
     improves perf by using SMB2_CLOSE_FLAG_POSTQUERY_ATTRIB over the
     wire)

   - two fixes for "modefromsid" mount option for file create (now
     allows mode bits to be set more atomically and accurately on create
     by adding "sd_context" on create when modefromsid specified on
     mount)

   - two fixes for multichannel found in testing this week against
     different servers

   - two small cleanup patches"

* tag '5.5-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: improve check for when we send the security descriptor context on create
  smb3: fix mode passed in on create for modetosid mount option
  cifs: fix possible uninitialized access and race on iface_list
  cifs: Fix lookup of SMB connections on multichannel
  smb3: query attributes on file close
  smb3: remove unused flag passed into close functions
  cifs: remove redundant assignment to pointer pneg_ctxt
  fs: cifs: Fix atime update check vs mtime
  CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
2019-12-08 12:12:18 -08:00
Linus Torvalds
5bf9a06a5f Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs cleanups from Al Viro:
 "No common topic, just three cleanups".

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make __d_alloc() static
  fs/namespace: add __user to open_tree and move_mount syscalls
  fs/fnctl: fix missing __user in fcntl_rw_hint()
2019-12-08 11:08:28 -08:00
Linus Torvalds
95207d554b Fixes for 5.5-rc1:
- Fix a UAF when reporting writeback errors
 - Fix a race condition when handling page uptodate on a blocksize <
   pagesize file that is also fragmented
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3pJVwACgkQ+H93GTRK
 tOue8RAAklPs+kV4l1JQudPkd/lXIPCqyRKBEjNDQRfj27v5s7h/3fMqMsV6/RIs
 zTkrRc/N5Uu6SMnvMIHvTx1nAk6ZnTiN3gUHenoUQgQuxZ4jtGhIvHmyiOGgI27j
 Xd2g+aE8wdhZDnD2b+4AsnJmtvffX0g3DNDZ4KJn1d/Ejs2qQIHgeaoyj56Fz8vh
 4dqQxSg5RvL25k3qqHSVogsfxSRJxvYJefcARUL25a58UzE6DLJqeb3v41RVLc4l
 lDOX/o4lavqo1lmptWjsTn4bqqyeRfNWp2M2EGE4a411aEaNTkFB7xzQ9Y1v04kK
 VHq6XK3vieAv6VOuqZ3ySx6ihQcb9dQWXiMkJ9HDVDv4rmaLlrHECM19A5toZN/6
 0Xy6xLDqbimckyWDZLBUnkfcoCsI+uWTdNIBxkjEYFdDuL33ovUlKu+Cn63vYzoQ
 aCWnNA4NdKOBYps9aKCQV35IU1ODmOYWqkxkpSaVYKApi8Q6+2HUXOf+fXFBiV91
 zO0/nWq8RU7fGQxk8YJtMO2E0lGMaMAt03vgp+pMkd0aIo0NWFLV0q3tko5VlWNv
 v8BGkphgrOOqfsCFh8wua5x8bIXOBBElOSUjU0wBrAHb0ZXkQ3zi0nUt+sj7RFUX
 a+ojaRrXvuyZZArOvQbrHEC0HWchVmvfiyXRAPbxbLi1BiLfPW8=
 =+LzI
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fixes from Darrick Wong:
 "Fix a race condition and a use-after-free error:

   - Fix a UAF when reporting writeback errors

   - Fix a race condition when handling page uptodate on fragmented file
     with blocksize < pagesize"

* tag 'iomap-5.5-merge-14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: stop using ioend after it's been freed in iomap_finish_ioend()
  iomap: fix sub-page uptodate handling
2019-12-07 17:07:18 -08:00
Linus Torvalds
50caca9d7f Fixes for 5.5-rc1:
- Fix a crash in the log setup code when log mounting fails
 - Fix a hang when allocating space on the realtime device
 - Fix a block leak when freeing space on the realtime device
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3n5OkACgkQ+H93GTRK
 tOtUvw/+Lxnwx1AaYT7hSMS4p1buJnAV7aM6Ofm+F+GkRISWlxSEYLIjrj3NFxeI
 Fv4jnPWMGHtBzW2c4OXv9zhW7vQeuU0Z72sgxA+Wqccf53sfR5Qum+Tya+Bo8X0X
 LPeDEuA+k4UUUHvSLqscPFPZYbXgwp3dJ2i7JeuH0vx3BhFeTjV1nlOeo1z3QBzN
 LLVjuBg7Zms+TPorrOgS67LAWfzqCAiQDauvyODB5EW+UElK6pvpOklFzc7TUPr2
 PIvmjEE8UNN2y9QuEEKJ26t43ejAzG016yGMxyW74i5wU33R7PHkdgG1pWwWRZzr
 yvNClg2CMtLu+Bcx8Lc9X23mW0DqPkUYDohWCe/Tytvz5kaKCEq3WlqwaG5EdMOg
 gSJlivpoOTduQ26V4fPvD1/fjGpJLWyfHPs9p43wie+K/NuUcTIvr4BGGQszjS/n
 5Zr630g6Tq5VrBMl0f1P2NuEbeQEvmbWNTW2TIvvHZTgMd8mZdvX28IXj0dAhBb/
 2U5o1NF8F6VeRGYoZFJI70RIfiLYzOQEmsA5hAyJfUQQ18u8zDJuPV4isO4/XjQf
 d32E36cvP3CKVQYn7hAiMD5O8jOFckYL287qd4uDYutmleHEcfzc0H4pTW+66IHp
 IuzkPCgvOkd4h4qnhtHSDoSlKd7kc1Ai3hBhCdA9zd6IUAVsZ6Y=
 =Ps4B
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Fix a couple of resource management errors and a hang:

   - fix a crash in the log setup code when log mounting fails

   - fix a hang when allocating space on the realtime device

   - fix a block leak when freeing space on the realtime device"

* tag 'xfs-5.5-merge-17' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix mount failure crash on invalid iclog memory access
  xfs: don't check for AG deadlock for realtime files in bunmapi
  xfs: fix realtime file data space leak
2019-12-07 17:05:33 -08:00
Linus Torvalds
316933cf74 orangefs: posix open permission checking...
Orangefs has no open, and orangefs checks file permissions
 on each file access. Posix requires that file permissions
 be checked on open and nowhere else. Orangefs-through-the-kernel
 needs to seem posix compliant.
 
 The VFS opens files, even if the filesystem provides no
 method. We can see if a file was successfully opened for
 read and or for write by looking at file->f_mode.
 
 When writes are flowing from the page cache, file is no
 longer available. We can trust the VFS to have checked
 file->f_mode before writing to the page cache.
 
 The mode of a file might change between when it is opened
 and IO commences, or it might be created with an arbitrary mode.
 
 We'll make sure we don't hit EACCES during the IO stage by
 using UID 0.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJd6qoUAAoJEM9EDqnrzg2+SHQQAI8osluG1xDle0Ur0y/XQrWR
 z/1+mA/ZJpkaC2KPJ3F/B93ZR7TSSb6xB/u/EoxfqVQDoVpodP3PzcvSosRsePOk
 OYo67xit7YRcg2nQF5kEjR+wYbW/T1j55oQzrWxLvYr+FhlDZLJyn0xuSaGvvuKQ
 kWqNwPQpIZwNR1ZJ6Yjif86kR4sWF5htoy976x5ScvoeOb08dNHQn2je5oXH/eKH
 zwWBVYTeZTAIVCs9YV2UM4gi5/0pysjSL58jP7+ckLj79ozBoyhc9cRB4ez0cFyc
 4+4dW9zZ1GAfvmbsFzvCfKb2Syz4JkStGJQGST+cgH9ldp70R8AdRjzYfZGXa2af
 9I/jRgrVBsU/jo++a1npMy2j44+2GvhoValzKePwiCGTOB/f80XsmB9p9qci8JCv
 ucVzJwbhjxPKphUpnW8Gg7F2gWr2ULhv+wKRmAb3tF+bIFPjn7KjyzFfUAS3FY1s
 iwgci0Mw9NLLlvX511N0wiUGo6V9A9r7XsZQjScmm/3ybUhMyJAYoe81OO60Xwnv
 2s+V0Tv9ah4b+EF0J0qtQ7GzsoKDBu+ZWqGieiOXDWTVixY2gV6CetnR7veeSeQh
 s9OeqY8qaSYiV9KtBNZp56IS4PuADDgxnRB1pXTUUPgapuElEtvYC1BUovidMMmh
 kLQEpYdSGrkLRah4hKsg
 =9AOz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs update from Mike Marshall:
 "orangefs: posix open permission checking...

  Orangefs has no open, and orangefs checks file permissions on each
  file access. Posix requires that file permissions be checked on open
  and nowhere else. Orangefs-through-the-kernel needs to seem posix
  compliant.

  The VFS opens files, even if the filesystem provides no method. We can
  see if a file was successfully opened for read and or for write by
  looking at file->f_mode.

  When writes are flowing from the page cache, file is no longer
  available. We can trust the VFS to have checked file->f_mode before
  writing to the page cache.

  The mode of a file might change between when it is opened and IO
  commences, or it might be created with an arbitrary mode.

  We'll make sure we don't hit EACCES during the IO stage by using
  UID 0"

[ This is "posixish", but not a great solution in the long run, since a
  proper secure network server shouldn't really trust the client like this.
  But proper and secure POSIX behavior requires an open method and a
  resulting cookie for IO of some kind, or similar.    - Linus ]

* tag 'for-linus-5.5-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: posix open permission checking...
2019-12-07 16:59:25 -08:00
Linus Torvalds
911d137ab0 This is a relatively quiet cycle for nfsd, mainly various bugfixes.
Possibly most interesting is Trond's fixes for some callback races that
 were due to my incomplete understanding of rpc client shutdown.
 Unfortunately at the last minute I've started noticing a new
 intermittent failure to send callbacks.  As the logic seems basically
 correct, I'm leaving Trond's patches in for now, and hope to find a fix
 in the next week so I don't have to revert those patches.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl3r3AAVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+rjkP/3L6DZs0Uv0BYbGq5Gmit0uoPSQk
 8BT7oQhbagCh+ULRYWCnK6cz82wejR4Gzq4PLyl5x5Vcc5x+bLoPI9YgiRZlIbZu
 ZvSg93E6SITLfq5xRlDC0MlIVZkI+HoIfyYgv1aYiWvQ3834bcx4DxVm9h7cNpT3
 x37anEFi1lv3n9fct3obOrs3AvCS76XyA6VVhcSLJ77amKQ+O7LI0crqUc6cuX2i
 CkTwTSDwyCrzkx3dZ2xDPDTbLecxw+Ce4adaby5v3GEQo3TOCmEWX92D3dvzfMmv
 ICU07FsVOILnIT/fmC91b1+JWVRLjUUBw5EPmDduwSP/yw4YnIEODFEP/wAUAmMJ
 vJ9hi9c1rThQ9n8h08RIwA2snhnpXRxKCWhpIRY6WM8DhHL9Y9AuVPYTKxhQOjPK
 l3wbOGcMW63NrTOPHHN7hTB0vDLgPKIXYVIrMvZTd/P7CghDDEbhT1gDvx/IL3Uq
 WrHKbJtK7rbx9i2bh5f6fH0DRrv7lxbD0ffunRRa3twPAe6zsG9WPjsbZZraZzEg
 O7/o3wZu2N7MpL5bXPfzB+5ylOTxvNWew07NJjA4BIOfwin3bw/71YfB0Vnoairv
 PhmbN2Dj4/t82ld0JU5GJWojpUfH4ARXM2Li9WO99wzx+KrxScsqGPnRMFe9dC7b
 Q7ltP1p0gUbkJ88Z
 =b2zA
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "This is a relatively quiet cycle for nfsd, mainly various bugfixes.

  Possibly most interesting is Trond's fixes for some callback races
  that were due to my incomplete understanding of rpc client shutdown.
  Unfortunately at the last minute I've started noticing a new
  intermittent failure to send callbacks. As the logic seems basically
  correct, I'm leaving Trond's patches in for now, and hope to find a
  fix in the next week so I don't have to revert those patches"

* tag 'nfsd-5.5' of git://linux-nfs.org/~bfields/linux: (24 commits)
  nfsd: depend on CRYPTO_MD5 for legacy client tracking
  NFSD fixing possible null pointer derefering in copy offload
  nfsd: check for EBUSY from vfs_rmdir/vfs_unink.
  nfsd: Ensure CLONE persists data and metadata changes to the target file
  SUNRPC: Fix backchannel latency metrics
  nfsd: restore NFSv3 ACL support
  nfsd: v4 support requires CRYPTO_SHA256
  nfsd: Fix cld_net->cn_tfm initialization
  lockd: remove __KERNEL__ ifdefs
  sunrpc: remove __KERNEL__ ifdefs
  race in exportfs_decode_fh()
  nfsd: Drop LIST_HEAD where the variable it declares is never used.
  nfsd: document callback_wq serialization of callback code
  nfsd: mark cb path down on unknown errors
  nfsd: Fix races between nfsd4_cb_release() and nfsd4_shutdown_callback()
  nfsd: minor 4.1 callback cleanup
  SUNRPC: Fix svcauth_gss_proxy_init()
  SUNRPC: Trace gssproxy upcall results
  sunrpc: fix crash when cache_head become valid before update
  nfsd: remove private bin2hex implementation
  ...
2019-12-07 16:56:00 -08:00
Linus Torvalds
fb9bf40cf0 NFS client updates for Linux 5.5
Highlights include:
 
 Features:
 - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded copy
   of a file from one source server to a different target server).
 - New RDMA tracepoints for debugging congestion control and Local Invalidate
   WRs.
 
 Bugfixes and cleanups
 - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
   layoutreturn
 - Handle bad/dead sessions correctly in nfs41_sequence_process()
 - Various bugfixes to the delegation return operation.
 - Various bugfixes pertaining to delegations that have been revoked.
 - Cleanups to the NFS timespec code to avoid unnecessary conversions
   between timespec and timespec64.
 - Fix unstable RDMA connections after a reconnect
 - Close race between waking an RDMA sender and posting a receive
 - Wake pending RDMA tasks if connection fails
 - Fix MR list corruption, and clean up MR usage
 - Fix another RPCSEC_GSS issue with MIC buffer space
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl3qp8QACgkQZwvnipYK
 APIZfBAAuFhLUA2Ua9OQOPDJDkQ1IFDBfYGG48aqVu3GIXS9LkEvTavLm/P9ocm+
 ijGsUv2iw4x9H4S7OGuzLQm5zmTNsQAlPXD+3+xQS7cjPjh5HCyIAEgpov+JEGae
 CeZoSvhtdBd0xB71t2zAKEdHkqc47Jxz3Db0FX22zTTnDvdhArfggisZUt4Xq5Qb
 cPcs8R1E5yBZqJFHKObOUP4itVYsXte/VFhtWpjRFqzaZ/t7xNpPVOBH8cli7aI9
 E6DqdbIjUreyn62FVWYIeGhwvsKdxv+Slc5ZOEbD45jUryovyCAZxhqDmcAg/0q0
 uykplL0cv8MeiZ68wmlxdir/n36hWduiGqa0UKMg2+BAbdudGKJ7xPhkGYP2uZqo
 zoZGjd+Hl8AunMBUaT7YAxWOzuIXeMP338szTL6sSBPxT75WmmNJAh3J4b22G7Bl
 eGrcJcckDBnvfRCia40l8g9NLHmVKqS9qNKxSWMlMlBmwd1HE0oEE1ddCx9bGHKe
 srf0S14RPQBRF6r+Nv0cx5S+CiptDtGiILR+cn5ZDra5YYCPX5kkJ6VEqw/m4yNE
 AKjjj5gim+jWYdBOTMU3u5KNNqFx37xnOCdC+5DvhMNWRHf2O/I5JSKtuKaZht+5
 PEuwcYfQvaZGp3fCEh38zzOX2qWUhRMbXUqSv5F0DbuWK7OAABQ=
 =VZFk
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - NFSv4.2 now supports cross device offloaded copy (i.e. offloaded
     copy of a file from one source server to a different target
     server).

   - New RDMA tracepoints for debugging congestion control and Local
     Invalidate WRs.

  Bugfixes and cleanups

   - Drop the NFSv4.1 session slot if nfs4_delegreturn_prepare waits for
     layoutreturn

   - Handle bad/dead sessions correctly in nfs41_sequence_process()

   - Various bugfixes to the delegation return operation.

   - Various bugfixes pertaining to delegations that have been revoked.

   - Cleanups to the NFS timespec code to avoid unnecessary conversions
     between timespec and timespec64.

   - Fix unstable RDMA connections after a reconnect

   - Close race between waking an RDMA sender and posting a receive

   - Wake pending RDMA tasks if connection fails

   - Fix MR list corruption, and clean up MR usage

   - Fix another RPCSEC_GSS issue with MIC buffer space"

* tag 'nfs-for-5.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  SUNRPC: Capture completion of all RPC tasks
  SUNRPC: Fix another issue with MIC buffer space
  NFS4: Trace lock reclaims
  NFS4: Trace state recovery operation
  NFSv4.2 fix memory leak in nfs42_ssc_open
  NFSv4.2 fix kfree in __nfs42_copy_file_range
  NFS: remove duplicated include from nfs4file.c
  NFSv4: Make _nfs42_proc_copy_notify() static
  NFS: Fallocate should use the nfs4_fattr_bitmap
  NFS: Return -ETXTBSY when attempting to write to a swapfile
  fs: nfs: sysfs: Remove NULL check before kfree
  NFS: remove unneeded semicolon
  NFSv4: add declaration of current_stateid
  NFSv4.x: Drop the slot if nfs4_delegreturn_prepare waits for layoutreturn
  NFSv4.x: Handle bad/dead sessions correctly in nfs41_sequence_process()
  nfsv4: Move NFSPROC4_CLNT_COPY_NOTIFY to end of list
  SUNRPC: Avoid RPC delays when exiting suspend
  NFS: Add a tracepoint in nfs_fh_to_dentry()
  NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
  NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
  ...
2019-12-07 16:50:55 -08:00
Steve French
231e2a0ba5 smb3: improve check for when we send the security descriptor context on create
We had cases in the previous patch where we were sending the security
descriptor context on SMB3 open (file create) in cases when we hadn't
mounted with with "modefromsid" mount option.

Add check for that mount flag before calling ad_sd_context in
open init.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-07 17:38:22 -06:00
Linus Torvalds
85190d15f4 pipe: don't use 'pipe_wait() for basic pipe IO
pipe_wait() may be simple, but since it relies on the pipe lock, it
means that we have to do the wakeup while holding the lock.  That's
unfortunate, because the very first thing the waked entity will want to
do is to get the pipe lock for itself.

So get rid of the pipe_wait() usage by simply releasing the pipe lock,
doing the wakeup (if required) and then using wait_event_interruptible()
to wait on the right condition instead.

wait_event_interruptible() handles races on its own by comparing the
wakeup condition before and after adding itself to the wait queue, so
you can use an optimistic unlocked condition for it.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 13:53:09 -08:00
Linus Torvalds
a28c8b9db8 pipe: remove 'waiting_writers' merging logic
This code is ancient, and goes back to when we only had a single page
for the pipe buffers.  The exact history is hidden in the mists of time
(ie "before git", and in fact predates the BK repository too).

At that long-ago point in time, it actually helped to try to merge big
back-and-forth pipe reads and writes, and not limit pipe reads to the
single pipe buffer in length just because that was all we had at a time.

However, since then we've expanded the pipe buffers to multiple pages,
and this logic really doesn't seem to make sense.  And a lot of it is
somewhat questionable (ie "hmm, the user asked for a non-blocking read,
but we see that there's a writer pending, so let's wait anyway to get
the extra data that the writer will have").

But more importantly, it makes the "go to sleep" logic much less
obvious, and considering the wakeup issues we've had, I want to make for
less of those kinds of things.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 13:21:01 -08:00
Linus Torvalds
f467a6a664 pipe: fix and clarify pipe read wakeup logic
This is the read side version of the previous commit: it simplifies the
logic to only wake up waiting writers when necessary, and makes sure to
use a synchronous wakeup.  This time not so much for GNU make jobserver
reasons (that pipe never fills up), but simply to get the writer going
quickly again.

A bit less verbose commentary this time, if only because I assume that
the write side commentary isn't going to be ignored if you touch this
code.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 12:54:26 -08:00
Linus Torvalds
1b6b26ae70 pipe: fix and clarify pipe write wakeup logic
The pipe rework ends up having been extra painful, partly becaused of
actual bugs with ordering and caching of the pipe state, but also
because of subtle performance issues.

In particular, the pipe rework caused the kernel build to inexplicably
slow down.

The reason turns out to be that the GNU make jobserver (which limits the
parallelism of the build) uses a pipe to implement a "token" system: a
parallel submake will read a character from the pipe to get the job
token before starting a new job, and will write a character back to the
pipe when it is done.  The overall job limit is thus easily controlled
by just writing the appropriate number of initial token characters into
the pipe.

But to work well, that really means that the old behavior of write
wakeups being synchronous (WF_SYNC) is very important - when the pipe
writer wakes up a reader, we want the reader to actually get scheduled
immediately.  Otherwise you lose the parallelism of the build.

The pipe rework lost that synchronous wakeup on write, and we had
clearly all forgotten the reasons and rules for it.

This rewrites the pipe write wakeup logic to do the required Wsync
wakeups, but also clarifies the logic and avoids extraneous wakeups.

It also ends up addign a number of comments about what oit does and why,
so that we hopefully don't end up forgetting about this next time we
change this code.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 12:14:28 -08:00
Linus Torvalds
ad910e36da pipe: fix poll/select race introduced by the pipe rework
The kernel wait queues have a basic rule to them: you add yourself to
the wait-queue first, and then you check the things that you're going to
wait on.  That avoids the races with the event you're waiting for.

The same goes for poll/select logic: the "poll_wait()" goes first, and
then you check the things you're polling for.

Of course, if you use locking, the ordering doesn't matter since the
lock will serialize with anything that changes the state you're looking
at. That's not the case here, though.

So move the poll_wait() first in pipe_poll(), before you start looking
at the pipe state.

Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-07 10:41:17 -08:00
Patrick Steinhardt
38a2204f52 nfsd: depend on CRYPTO_MD5 for legacy client tracking
The legacy client tracking infrastructure of nfsd makes use of MD5 to
derive a client's recovery directory name. As the nfsd module doesn't
declare any dependency on CRYPTO_MD5, though, it may fail to allocate
the hash if the kernel was compiled without it. As a result, generation
of client recovery directories will fail with the following error:

    NFSD: unable to generate recoverydir name

The explicit dependency on CRYPTO_MD5 was removed as redundant back in
6aaa67b5f3 (NFSD: Remove redundant "select" clauses in fs/Kconfig
2008-02-11) as it was already implicitly selected via RPCSEC_GSS_KRB5.
This broke when RPCSEC_GSS_KRB5 was made optional for NFSv4 in commit
df486a2590 (NFS: Fix the selection of security flavours in Kconfig) at
a later point.

Fix the issue by adding back an explicit dependency on CRYPTO_MD5.

Fixes: df486a2590 (NFS: Fix the selection of security flavours in Kconfig)
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-07 11:28:52 -05:00
Olga Kornievskaia
18f428d4e2 NFSD fixing possible null pointer derefering in copy offload
Static checker revealed possible error path leading to possible
NULL pointer dereferencing.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: e0639dc580: ("NFSD introduce async copy feature")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-12-07 11:28:46 -05:00
David Howells
76f6777c9c pipe: Fix iteration end check in fuse_dev_splice_write()
Fix the iteration end check in fuse_dev_splice_write().  The iterator
position can only be compared with == or != since wrappage may be involved.

Fixes: 8cefc107ca ("pipe: Use head and tail pointers for the ring, not cursor and length")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-06 13:57:04 -08:00
Linus Torvalds
ec057595cb pipe: fix incorrect caching of pipe state over pipe_wait()
Similarly to commit 8f868d68d3 ("pipe: Fix missing mask update after
pipe_wait()") this fixes a case where the pipe rewrite ended up caching
the pipe state incorrectly over a pipe lock drop event.

It wasn't quite as obvious, because you needed to splice data from a
pipe to a file, which is a fairly unusual operation, but it's completely
wrong.

Make sure we load the pipe head/tail/size information only after we've
waited for there to be data in the pipe.

While in that file, also make one of the splice helper functions use the
canonical arghument order for pipe_empty().  That's syntactic - pipe
emptiness is just that head and tail are equal, and thus mixing up head
and tail doesn't really matter.  It's still wrong, though.

Reported-by: David Sterba <dsterba@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-06 12:40:35 -08:00
Steve French
fdef665ba4 smb3: fix mode passed in on create for modetosid mount option
When using the special SID to store the mode bits in an ACE (See
http://technet.microsoft.com/en-us/library/hh509017(v=ws.10).aspx)
which is enabled with mount parm "modefromsid" we were not
passing in the mode via SMB3 create (although chmod was enabled).
SMB3 create allows a security descriptor context to be passed
in (which is more atomic and thus preferable to setting the mode
bits after create via a setinfo).

This patch enables setting the mode bits on create when using
modefromsid mount option.  In addition it fixes an endian
error in the definition of the Control field flags in the SMB3
security descriptor. It also makes the ACE type of the special
SID better match the documentation (and behavior of servers
which use this to store mode bits in SMB3 ACLs).

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-12-06 14:15:52 -06:00
Linus Torvalds
9feb1af97e for-linus-20191205
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3puL4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjOJD/4vV2ENSYcwZ7Qezgn9tx5NEdADN6jC0DXe
 tJDnA0sDLfLTlVM9I1U2/wD6Wu31cvV2dHKmxO+WWWRP2M0orI00UA3I6BssVptY
 42e/YJd13IM+iVrhfhm+hmcAHvQUmWHTPZg/7F7OZPkEtNcbCjn6s+HTyzu9gqtt
 /kbJVDJ75TR9dEWCPY/A4jUcJYLw2leoHI4u2eqYRsTFNJHUQoaFUHDyNHyj7UNI
 kIUi2UixJv+a4pkFvyLPKhJcLr6DBC2TeUnfP2Re5oX1Z/XkDR7iX+L5dUwRHHbT
 4M7aJJ+mHDm8Z4IwwBqsSDRU5wUpzuplwBQBY/EQilUZAyOALVzUQABlRa0zxH8t
 0D4U6LDDOopYeK0by/FGdD88S6Z0LKm0HJETbdPaGd9fqSlhBn19iGeBKYk0cCIu
 Uecm9DKF+QLfKhTbN6h8nPyR3bfkqyUptQJ1UnheWo9f6L32bfp19sdI9LdtRUnS
 k661QApajaYQu/641u9CDZiI+DvI+57Wm9I4eCF3GXqOYWPAxSgWP7rJVgS+mfA3
 wo99/r11SruaA1Wqf+bOoN/wjsQCiUPFa8po8sh05ER4MH5Brb5EFlg04ZlQZkR9
 jOdiL8ISM9t86eyKwtR4PanE7/pPEYUjEWm4ntCC6ViTwCvgV3d6s2We6QPw1j2l
 nQ9p/c2bhg==
 =7ZFw
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191205' of git://git.kernel.dk/linux-block

Pull more block and io_uring updates from Jens Axboe:
 "I wasn't expecting this to be so big, and if I was, I would have used
  separate branches for this. Going forward I'll be doing separate
  branches for the current tree, just like for the next kernel version
  tree. In any case, this contains:

   - Series from Christoph that fixes an inherent race condition with
     zoned devices and revalidation.

   - null_blk zone size fix (Damien)

   - Fix for a regression in this merge window that caused busy spins by
     sending empty disk uevents (Eric)

   - Fix for a regression in this merge window for bfq stats (Hou)

   - Fix for io_uring creds allocation failure handling (me)

   - io_uring -ERESTARTSYS send/recvmsg fix (me)

   - Series that fixes the need for applications to retain state across
     async request punts for io_uring. This one is a bit larger than I
     would have hoped, but I think it's important we get this fixed for
     5.5.

   - connect(2) improvement for io_uring, handling EINPROGRESS instead
     of having applications needing to poll for it (me)

   - Have io_uring use a hash for poll requests instead of an rbtree.
     This turned out to work much better in practice, so I think we
     should make the switch now. For some workloads, even with a fair
     amount of cancellations, the insertion sort is just too expensive.
     (me)

   - Various little io_uring fixes (me, Jackie, Pavel, LimingWu)

   - Fix for brd unaligned IO, and a warning for the future (Ming)

   - Fix for a bio integrity data leak (Justin)

   - bvec_iter_advance() improvement (Pavel)

   - Xen blkback page unmap fix (SeongJae)

  The major items in here are all well tested, and on the liburing side
  we continue to add regression and feature test cases. We're up to 50
  topic cases now, each with anywhere from 1 to more than 10 cases in
  each"

* tag 'for-linus-20191205' of git://git.kernel.dk/linux-block: (33 commits)
  block: fix memleak of bio integrity data
  io_uring: fix a typo in a comment
  bfq-iosched: Ensure bio->bi_blkg is valid before using it
  io_uring: hook all linked requests via link_list
  io_uring: fix error handling in io_queue_link_head
  io_uring: use hash table for poll command lookups
  io-wq: clear node->next on list deletion
  io_uring: ensure deferred timeouts copy necessary data
  io_uring: allow IO_SQE_* flags on IORING_OP_TIMEOUT
  null_blk: remove unused variable warning on !CONFIG_BLK_DEV_ZONED
  brd: warn on un-aligned buffer
  brd: remove max_hw_sectors queue limit
  xen/blkback: Avoid unmapping unmapped grant pages
  io_uring: handle connect -EINPROGRESS like -EAGAIN
  block: set the zone size in blk_revalidate_disk_zones atomically
  block: don't handle bio based drivers in blk_revalidate_disk_zones
  block: allocate the zone bitmaps lazily
  block: replace seq_zones_bitmap with conv_zones_bitmap
  block: simplify blkdev_nr_zones
  block: remove the empty line at the end of blk-zoned.c
  ...
2019-12-06 10:08:59 -08:00
Linus Torvalds
0aecba6173 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_inode/d_flags memory ordering fixes from Al Viro:
 "Fallout from tree-wide audit for ->d_inode/->d_flags barriers use.
  Basically, the problem is that negative pinned dentries require
  careful treatment - unless ->d_lock is locked or parent is held at
  least shared, another thread can make them positive right under us.

  Most of the uses turned out to be safe - the main surprises as far as
  filesystems are concerned were

   - race in dget_parent() fastpath, that might end up with the caller
     observing the returned dentry _negative_, due to insufficient
     barriers. It is positive in memory, but we could end up seeing the
     wrong value of ->d_inode in CPU cache. Fixed.

   - manual checks that result of lookup_one_len_unlocked() is positive
     (and rejection of negatives). Again, insufficient barriers (we
     might end up with inconsistent observed values of ->d_inode and
     ->d_flags). Fixed by switching to a new primitive that does the
     checks itself and returns ERR_PTR(-ENOENT) instead of a negative
     dentry. That way we get rid of boilerplate converting negatives
     into ERR_PTR(-ENOENT) in the callers and have a single place to
     deal with the barrier-related mess - inside fs/namei.c rather than
     in every caller out there.

  The guts of pathname resolution *do* need to be careful - the race
  found by Ritesh is real, as well as several similar races.
  Fortunately, it turns out that we can take care of that with fairly
  local changes in there.

  The tree-wide audit had not been fun, and I hate the idea of repeating
  it. I think the right approach would be to annotate the places where
  we are _not_ guaranteed ->d_inode/->d_flags stability and have sparse
  catch regressions. But I'm still not sure what would be the least
  invasive way of doing that and it's clearly the next cycle fodder"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/namei.c: fix missing barriers when checking positivity
  fix dget_parent() fastpath race
  new helper: lookup_positive_unlocked()
  fs/namei.c: pull positivity check into follow_managed()
2019-12-06 09:06:58 -08:00