Commit Graph

61919 Commits

Author SHA1 Message Date
Eric Biggers
9b02e4987a ext4: clean up len and offset checks in ext4_fallocate()
- Fix some comments.

- Consistently access i_size directly rather than using i_size_read(),
  since in all relevant cases we're under inode_lock().

- Simplify the alignment checks by using the IS_ALIGNED() macro.

- In ext4_insert_range(), do the check against s_maxbytes in a way
  that is safe against signed overflow.  (This doesn't currently matter
  for ext4 due to ext4's limited max file size, but this is something
  other filesystems have gotten wrong.  We might as well do it safely.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:54 -05:00
Eric Biggers
dd6683e6ef ext4: remove ext4_{ind,ext}_calc_metadata_amount()
Remove the ext4_ind_calc_metadata_amount() and
ext4_ext_calc_metadata_amount() functions, which have been unused since
commit 71d4f7d032 ("ext4: remove metadata reservation checks").

Also remove the i_da_metadata_calc_last_lblock and
i_da_metadata_calc_len fields from struct ext4_inode_info, as these were
only used by these removed functions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231180444.46586-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:54 -05:00
Eric Biggers
fd5fe25356 ext4: remove unneeded check for error allocating bio_post_read_ctx
Since allocating an object from a mempool never fails when
__GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check
for failure to allocate a bio_post_read_ctx is unnecessary.  Remove it.

Also remove the redundant assignment to ->bi_private.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181256.47770-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:54 -05:00
Eric Biggers
68e45330e3 ext4: fix deadlock allocating bio_post_read_ctx from mempool
Without any form of coordination, any case where multiple allocations
from the same mempool are needed at a time to make forward progress can
deadlock under memory pressure.

This is the case for struct bio_post_read_ctx, as one can be allocated
to decrypt a Merkle tree page during fsverity_verify_bio(), which itself
is running from a post-read callback for a data bio which has its own
struct bio_post_read_ctx.

Fix this by freeing the first bio_post_read_ctx before calling
fsverity_verify_bio().  This works because verity (if enabled) is always
the last post-read step.

This deadlock can be reproduced by trying to read from an encrypted
verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching
mempool_alloc() to pretend that pool->alloc() always fails.

Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually
hit this bug in practice would require reading from lots of encrypted
verity files at the same time.  But it's theoretically possible, as N
available objects isn't enough to guarantee forward progress when > N/2
threads each need 2 objects at a time.

Fixes: 22cfe4b48c ("ext4: add fs-verity read support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:54 -05:00
Eric Biggers
547c556f4d ext4: fix deadlock allocating crypto bounce page from mempool
ext4_writepages() on an encrypted file has to encrypt the data, but it
can't modify the pagecache pages in-place, so it encrypts the data into
bounce pages and writes those instead.  All bounce pages are allocated
from a mempool using GFP_NOFS.

This is not correct use of a mempool, and it can deadlock.  This is
because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never
fail" mode for mempool_alloc() where a failed allocation will fall back
to waiting for one of the preallocated elements in the pool.

But since this mode is used for all a bio's pages and not just the
first, it can deadlock waiting for pages already in the bio to be freed.

This deadlock can be reproduced by patching mempool_alloc() to pretend
that pool->alloc() always fails (so that it always falls back to the
preallocations), and then creating an encrypted file of size > 128 KiB.

Fix it by only using GFP_NOFS for the first page in the bio.  For
subsequent pages just use GFP_NOWAIT, and if any of those fail, just
submit the bio and start a new one.

This will need to be fixed in f2fs too, but that's less straightforward.

Fixes: c9af28fdd4 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:54 -05:00
Naoto Kobayashi
8f27fd0ab5 ext4: Delete ext4_kvzvalloc()
Since we're not using ext4_kvzalloc(), delete this function.

Signed-off-by: Naoto Kobayashi <naoto.kobayashi4c@gmail.com>
Link: https://lore.kernel.org/r/20191227080523.31808-2-naoto.kobayashi4c@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Eric Biggers
d85926474f ext4: re-enable extent zeroout optimization on encrypted files
For encrypted files, commit 36086d43f6 ("ext4 crypto: fix bugs in
ext4_encrypted_zeroout()") disabled the optimization where when a write
occurs to the middle of an unwritten extent, the head and/or tail of the
extent (when they aren't too large) are zeroed out, turned into an
initialized extent, and merged with the part being written to.  This
optimization helps prevent fragmentation of the extent tree.

However, disabling this optimization also made fscrypt_zeroout_range()
nearly impossible to test, as now it's only reachable via the very rare
case in ext4_split_extent_at() where allocating a new extent tree block
fails due to ENOSPC.  'gce-xfstests -c ext4/encrypt -g auto' doesn't
even hit this at all.

It's preferable to avoid really rare cases that are hard to test.

That commit also cited data corruption in xfstest generic/127 as a
reason to disable the extent zeroout optimization, but that's no longer
reproducible anymore.  It also cited fscrypt_zeroout_range() having poor
performance, but I've written a patch to fix that.

Therefore, re-enable the extent zeroout optimization on encrypted files.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226161114.53606-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Eric Biggers
33b4cc2501 ext4: only use fscrypt_zeroout_range() on regular files
fscrypt_zeroout_range() is only for encrypted regular files, not for
encrypted directories or symlinks.

Fortunately, currently it seems it's never called on non-regular files.
But to be safe ext4 should explicitly check S_ISREG() before calling it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226161022.53490-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Eric Biggers
457b1e353c ext4: allow ZERO_RANGE on encrypted files
When ext4 encryption support was first added, ZERO_RANGE was disallowed,
supposedly because test failures (e.g. ext4/001) were seen when enabling
it, and at the time there wasn't enough time/interest to debug it.

However, there's actually no reason why ZERO_RANGE can't work on
encrypted files.  And it fact it *does* work now.  Whole blocks in the
zeroed range are converted to unwritten extents, as usual; encryption
makes no difference for that part.  Partial blocks are zeroed in the
pagecache and then ->writepages() encrypts those blocks as usual.
ext4_block_zero_page_range() handles reading and decrypting the block if
needed before actually doing the pagecache write.

Also, f2fs has always supported ZERO_RANGE on encrypted files.

As far as I can tell, the reason that ext4/001 was failing in v4.1 was
actually because of one of the bugs fixed by commit 36086d43f6 ("ext4
crypto: fix bugs in ext4_encrypted_zeroout()").  The bug made
ext4_encrypted_zeroout() always return a positive value, which caused
unwritten extents in encrypted files to sometimes not be marked as
initialized after being written to.  This bug was not actually in
ZERO_RANGE; it just happened to trigger during the extents manipulation
done in ext4/001 (and probably other tests too).

So, let's enable ZERO_RANGE on encrypted files on ext4.

Tested with:
	gce-xfstests -c ext4/encrypt -g auto
	gce-xfstests -c ext4/encrypt_1k -g auto

Got the same set of test failures both with and without this patch.
But with this patch 6 fewer tests are skipped: ext4/001, generic/008,
generic/009, generic/033, generic/096, and generic/511.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226154216.4808-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Eric Biggers
834f1565fa ext4: handle decryption error in __ext4_block_zero_page_range()
fscrypt_decrypt_pagecache_blocks() can fail, because it uses
skcipher_request_alloc(), which uses kmalloc(), which can fail; and also
because it calls crypto_skcipher_decrypt(), which can fail depending on
the driver that actually implements the crypto.

Therefore it's not appropriate to WARN on decryption error in
__ext4_block_zero_page_range().

Remove the WARN and just handle the error instead.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226154105.4704-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Eric Biggers
284b3f6edb ext4: remove unnecessary selections from EXT3_FS
Since EXT3_FS already selects EXT4_FS, there's no reason for it to
redundantly select all the selections of EXT4_FS -- notwithstanding the
comments that claim otherwise.

Remove these redundant selections to avoid confusion.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191226153920.4466-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:53 -05:00
zhengbin
4756ee183f ext4: use true,false for bool variable
Fixes coccicheck warning:

fs/ext4/extents.c:5271:6-12: WARNING: Assignment of 0/1 to bool variable
fs/ext4/extents.c:5287:4-10: WARNING: Assignment of 0/1 to bool variable

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1577241959-138695-1-git-send-email-zhengbin13@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:53 -05:00
Eric Biggers
46797ad75a ext4: uninline ext4_inode_journal_mode()
Determining an inode's journaling mode has gotten more complicated over
time.  Move ext4_inode_journal_mode() from an inline function into
ext4_jbd2.c to reduce the compiled code size.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191209233602.117778-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:52 -05:00
Eric Biggers
64c314ff82 ext4: remove unnecessary ifdefs in htree_dirblock_to_tree()
The ifdefs for CONFIG_FS_ENCRYPTION in htree_dirblock_to_tree() are
unnecessary, as the called functions are already stubbed out when
!CONFIG_FS_ENCRYPTION.  Remove them.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20191209213225.18477-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17 16:24:52 -05:00
Chengguang Xu
7063743f68 ext4: remove unnecessary assignment in ext4_htree_store_dirent()
We have allocated memory using kzalloc() so don't have
to set 0 again in last byte.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Link: https://lore.kernel.org/r/20191206054317.3107-1-cgxu519@mykernel.net
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:52 -05:00
Theodore Ts'o
d4c5e960bf ext4: avoid fetching btime in ext4_getattr() unless requested
Linus observed that an allmodconfig build which does a lot of stat(2)
calls that ext4_getattr() was a noticeable (1%) amount of CPU time,
due to the cache line for i_extra_isize getting pulled in.  Since the
normal stat system call doesn't return btime, it's a complete waste.
So only calculate btime when it is explicitly requested.

[ Fixed to check against request_mask instead of query_flags. ]

Link: https://lore.kernel.org/r/CAHk-=wivmk_j6KbTX+Er64mLrG8abXZo0M10PNdAnHc8fWXfsQ@mail.gmail.com
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17 16:24:25 -05:00
Jan Kara
8cd115bdda ext4: Optimize ext4 DIO overwrites
Currently we start transaction for mapping every extent for writing
using direct IO. This is unnecessary when we know we are overwriting
already allocated blocks and the overhead of starting a transaction can
be significant especially for multithreaded workloads doing small writes.
Use iomap operations that avoid starting a transaction for direct IO
overwrites.

This improves throughput of 4k random writes - fio jobfile:
[global]
rw=randrw
norandommap=1
invalidate=0
bs=4k
numjobs=16
time_based=1
ramp_time=30
runtime=120
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32

from 3018MB/s to 4059MB/s in my test VM running test against simulated
pmem device (note that before iomap conversion, this workload was able
to achieve 3708MB/s because old direct IO path avoided transaction start
for overwrites as well). For dax, the win is even larger improving
throughput from 3042MB/s to 4311MB/s.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:57:18 -05:00
Theodore Ts'o
4549b49f82 ext4: export information about first/last errors via /sys/fs/ext4/<dev>
Make {first,last}_error_{ino,block,line,func,errcode} available via
sysfs.

Also add a missing newline for {first,last}_error_time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:29:10 -05:00
Theodore Ts'o
46f870d690 ext4: simulate various I/O and checksum errors when reading metadata
This allows us to test various error handling code paths

Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:28:31 -05:00
Theodore Ts'o
878520ac45 ext4: save the error code which triggered an ext4_error() in the superblock
This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes.  Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:28:23 -05:00
Theodore Ts'o
a562c687d1 Merge branch 'rk/inode_lock' into dev
These are ilock patches which helps improve the current inode lock scalabiliy
problem in ext4 DIO mixed read/write workload case. The problem was first
reported by Joseph [1]. This should help improve mixed read/write workload
cases for databases which use directIO.

These patches are based upon upstream discussion with Jan Kara & Joseph [2].

The problem really is that in case of DIO overwrites, we start with
a exclusive lock and then downgrade it later to shared lock. This causes a
scalability problem in case of mixed DIO read/write workload case.
i.e. if we have any ongoing DIO reads and then comes a DIO writes,
(since writes starts with excl. inode lock) then it has to wait until the
shared lock is released (which only happens when DIO read is completed).
Same is true for vice versa as well.
The same can be easily observed with perf-tools trace analysis [3].

For more details, including performance numbers, please see [4].

[1] https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/
[2] https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/
[3] https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/perf.report
[4] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com
2019-12-26 10:02:07 -05:00
Ritesh Harjani
bc6385dab1 ext4: Move to shared i_rwsem even without dioread_nolock mount opt
We were using shared locking only in case of dioread_nolock mount option in case
of DIO overwrites. This mount condition is not needed anymore with current code,
since:-

1. No race between buffered writes & DIO overwrites. Since buffIO writes takes
exclusive lock & DIO overwrites will take shared locking. Also DIO path will
make sure to flush and wait for any dirty page cache data.

2. No race between buffered reads & DIO overwrites, since there is no block
allocation that is possible with DIO overwrites. So no stale data exposure
should happen. Same is the case between DIO reads & DIO overwrites.

3. Also other paths like truncate is protected, since we wait there for any DIO
in flight to be over.

Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22 23:57:27 -05:00
Ritesh Harjani
aa9714d0e3 ext4: Start with shared i_rwsem in case of DIO instead of exclusive
Earlier there was no shared lock in DIO read path. But this patch
(16c5468859: ext4: Allow parallel DIO reads)
simplified some of the locking mechanism while still allowing for parallel DIO
reads by adding shared lock in inode DIO read path.

But this created problem with mixed read/write workload. It is due to the fact
that in DIO path, we first start with exclusive lock and only when we determine
that it is a ovewrite IO, we downgrade the lock. This causes the problem, since
we still have shared locking in DIO reads.

So, this patch tries to fix this issue by starting with shared lock and then
switching to exclusive lock only when required based on ext4_dio_write_checks().

Other than that, it also simplifies below cases:-

1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was
abused in the sense that it was not really checking for AIO anywhere also it
used to check for extending writes. So this API was renamed and simplified to
ext4_unaligned_io() which actully only checks if the IO is really unaligned.

Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial
block and that will require serialization against other direct IOs in the same
block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO
we also need to wait for any outstanding IOs to complete so that conversion from
unwritten to written is completed before anyone try to map the overlapping block.
Hence we take exclusive inode lock and also wait for inode_dio_wait() for
unaligned DIO case. Please note since we are anyway taking an exclusive lock in
unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO.

2. Added ext4_extending_io(). This checks if the IO is extending the file.

3. Added ext4_dio_write_checks(). In this we start with shared inode lock and
only switch to exclusive lock if required. So in most cases with aligned,
non-extending, dioread_nolock & overwrites, it tries to write with a shared
lock. If not, then we restart the operation in ext4_dio_write_checks(), after
acquiring exclusive lock.

Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22 23:57:27 -05:00
Ritesh Harjani
f629afe336 ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT
Apparently our current rwsem code doesn't like doing the trylock, then
lock for real scheme.  So change our dax read/write methods to just do the
trylock for the RWF_NOWAIT case.
This seems to fix AIM7 regression in some scalable filesystems upto ~25%
in some cases. Claimed in commit 942491c9e6 ("xfs: fix AIM7 regression")

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22 23:57:27 -05:00
Theodore Ts'o
cf2834a5ed ext4: treat buffers contining write errors as valid in ext4_sb_bread()
In commit 7963e5ac90 ("ext4: treat buffers with write errors as
containing valid data") we missed changing ext4_sb_bread() to use
ext4_buffer_uptodate().  So fix this oversight.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22 23:57:08 -05:00
Linus Torvalds
9efa3ed504 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Eric's s_inodes softlockup fixes + Jan's fix for recent regression
  from pipe rework"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: call fsnotify_sb_delete after evict_inodes
  fs: avoid softlockups in s_inodes iterators
  pipe: Fix bogus dereference in iov_iter_alignment()
2019-12-22 17:00:04 -08:00
Linus Torvalds
c601747175 Fixes for 5.5:
- Minor documentation fixes
 - Fix a file corruption due to read racing with an insert range
 operation.
 - Fix log reservation overflows when allocating large rt extents
 - Fix a buffer log item flags check
 - Don't allow administrators to mount with sunit= options that will
 cause later xfs_repair complaints about the root directory being
 suspicious because the fs geometry appeared inconsistent
 - Fix a non-static helper that should have been static
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl3890IACgkQ+H93GTRK
 tOuNkw//e7lT3Xys+dd60Regn+EkYkuI6myN4TsRNYHI7fQ0C9FqkGyYltJmEvqr
 QhUvQD1BU9Czb5Ghba+DpYz2dpqLYrbVTdmK4jNGjt9K0xNEU7zL297/PyE5Y54t
 il5nAxZVZ9x0aadKS0yhIt+Q3+dN29O2ablcRcErPi6H5EM3csjmPnrHKD+irG5j
 MhY5NNWvU1//qU4w2q8ikRKGhMrDYLWo57iJoIX2y17Sw+HXrDsEGoavOpyaoy0v
 T5m4OfBxU9FD8UhqI86Pua9HG8AlZK+IPT9pZjYGYWT8mkuTppSWjSHJU6HBGqF2
 fNrfMpQK/2H5KrqBTVvAzbhYcby8L1tZXUg+4w5iJuvAlHqb/IuBd+Y+nbSbduL/
 O9k3Ao0PL6Yt78knNf2F1943ioAI0zbhjDhmKtX17qfAojWQz6CAJmP5OWPWPprh
 FHA9WT0OzArXF77E+srfYyChclQzllBTOmYNKU//sXgKnqe33fgRIN6Il3T6V3w1
 5ifI/0N+FV2Z+yRqE0gSaqLNdPATMNzuGorQsv7P+TRPtD70aB8dhRzcVzqzfbTm
 C7owl3FGQFTCS/PIwPTRsLfqt3vt9mvc7pUMZOIu7uP63T2daPZ2amTbav/poXgb
 5Zih0pknWS8iQM4bwaPMLEr51Wp3Yo8gDPuW1jKJ13FCOuXU70E=
 =JXs9
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Fix a few bugs that could lead to corrupt files, fsck complaints, and
  filesystem crashes:

   - Minor documentation fixes

   - Fix a file corruption due to read racing with an insert range
     operation.

   - Fix log reservation overflows when allocating large rt extents

   - Fix a buffer log item flags check

   - Don't allow administrators to mount with sunit= options that will
     cause later xfs_repair complaints about the root directory being
     suspicious because the fs geometry appeared inconsistent

   - Fix a non-static helper that should have been static"

* tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Make the symbol 'xfs_rtalloc_log_count' static
  xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
  xfs: split the sunit parameter update into two parts
  xfs: refactor agfl length computation function
  libxfs: resync with the userspace libxfs
  xfs: use bitops interface for buf log item AIL flag check
  xfs: fix log reservation overflows when allocating large rt extents
  xfs: stabilize insert range start boundary to avoid COW writeback race
  xfs: fix Sphinx documentation warning
2019-12-22 10:59:06 -08:00
Linus Torvalds
a396560706 Ext4 bug fixes (including a regression fix) for 5.5
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl3/fDEACgkQ8vlZVpUN
 gaMZ6Qf/f973waBpA1E9GgAvB4AymRvGbqPJhW2lDDhEl36oXVpUw6EgIKWgNQPS
 HP6NhYXZakrpEak6Uk2MtiTmcm+6lqDJ+bCslCMylNh9/Y1yUrED2r8l7S3nGv4g
 hVB7Eah7E+sutDyrDQhYhcQo3GJjt8CbwRLgo8fbhSVrZ7qdfb0lWQmVnruc+72b
 3VAeMzPJb0wRY6myxLN4Pw6oEMR1WKVsXm3I9gNXboE2XvgVvnNn2tJxP+xml8rW
 uGxzWTo7QQNN2bUyjZBa6Mm44lMpHr7JT0nMwkIGV5v3eAYuBgeSwIXUskfw29q7
 sP9xNP2voU3M6TyWuT0+cHpoeZasPg==
 =K63f
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Ext4 bug fixes, including a regression fix"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: clarify impact of 'commit' mount option
  ext4: fix unused-but-set-variable warning in ext4_add_entry()
  jbd2: fix kernel-doc notation warning
  ext4: use RCU API in debug_print_tree
  ext4: validate the debug_want_extra_isize mount option at parse time
  ext4: reserve revoke credits in __ext4_new_inode
  ext4: unlock on error in ext4_expand_extra_isize()
  ext4: optimize __ext4_check_dir_entry()
  ext4: check for directory entries too close to block end
  ext4: fix ext4_empty_dir() for directories with holes
2019-12-22 10:41:48 -08:00
Jan Stancek
0dd1e3773a pipe: fix empty pipe check in pipe_write()
LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb,
with read side observing empty pipe and sleeping and write
side running out of space and then sleeping as well. In this
scenario there are 5 writers and 1 reader.

Problem is that after pipe_write() reacquires pipe lock, it
re-checks for empty pipe with potentially stale 'head' and
doesn't wake up read side anymore. pipe->tail can advance
beyond 'head', because there are multiple writers.

Use pipe->head for empty pipe check after reacquiring lock
to observe current state.

Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
         Without patch it hanged within a minute.

Fixes: 1b6b26ae70 ("pipe: fix and clarify pipe write wakeup logic")
Reported-by: Rachel Sibley <rasibley@redhat.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-22 09:47:47 -08:00
Yunfeng Ye
68d7b2d838 ext4: fix unused-but-set-variable warning in ext4_add_entry()
Warning is found when compile with "-Wunused-but-set-variable":

fs/ext4/namei.c: In function ‘ext4_add_entry’:
fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used
[-Wunused-but-set-variable]
  struct ext4_sb_info *sbi;
                       ^~~
Fix this by moving the variable @sbi under CONFIG_UNICODE.

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-21 21:00:53 -05:00
Linus Torvalds
f8f04d0859 io_uring-5.5-20191220
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl39CowQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpufZD/4t5p6e5S1GO915Y35+q8ooOjd7Ci4a+QJh
 mYV1KVlrvt+uPXSMHZoEj2JQhI3quEb2IHmUE5IydFBfIwJl2soD7mAsky2iQNaC
 ULMQFCW33vVnfz7WyuHwkEHmdgEuKg8OeGWVSMEsjrFqygHYSWR94wmqJiYKpkgm
 Klw4guyGzVfjHutxEDRM3QzHbHmy9xwSNDpJR9Vyr/s0GPOLCavpE71/1ztoc3mq
 UbolvEirXwUgGNArC/YyHhJAMM+lWNYplWBdGM3YrKzmV2oqKQY9+148IOWeV3Yl
 vmHLX0/s2WsbKnZPqE5DeuDc8X1fspJHcrQDn+BeM8w8TdGaSUHxgMIuyFulDzWr
 +3cDjVGaKo3J41xnX7u7v3ph8qnDjMz6k6o6IaQtWz7MCwzJKpCEyF/dJcDIJfxU
 7gaOnP5ltDf5wJWzfsOCYFA3/CTL2mshdHD4lg0siEp7CksfX6BFGoLNqdk5qhuv
 0md3X0nMkTSsXd09tRjYBQUaKOJ9NnvD63bGIcSshUAMZ2JQUOcFdPNfkSwb3jq+
 OMFnz/t6C8VOnyLwBYleYr8r5bum80lVzwvDa4LZNGivyeD/ne1HO+22WA5xDwod
 8yNm/hBhy5FGtoucQU2Vo2P9SiAil586PLz9HxAtD9eUOBTLQxVOOrv9x8vVhAW3
 Jln/sNGwYg==
 =6pRI
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Here's a set of fixes that should go into 5.5-rc3 for io_uring.

  This is bigger than I'd like it to be, mainly because we're fixing the
  case where an application reuses sqe data right after issue. This
  really must work, or it's confusing. With 5.5 we're flagging us as
  submit stable for the actual data, this must also be the case for
  SQEs.

  Honestly, I'd really like to add another series on top of this, since
  it cleans it up considerable and prevents any SQE reuse by design. I
  posted that here:

    https://lore.kernel.org/io-uring/20191220174742.7449-1-axboe@kernel.dk/T/#u

  and may still send it your way early next week once it's been looked
  at and had some more soak time (does pass all regression tests). With
  that series, we've unified the prep+issue handling, and only the prep
  phase even has access to the SQE.

  Anyway, outside of that, fixes in here for a few other issues that
  have been hit in testing or production"

* tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block:
  io_uring: io_wq_submit_work() should not touch req->rw
  io_uring: don't wait when under-submitting
  io_uring: warn about unhandled opcode
  io_uring: read opcode and user_data from SQE exactly once
  io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
  io_uring: make IORING_OP_CANCEL_ASYNC deferrable
  io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
  io_uring: make HARDLINK imply LINK
  io_uring: any deferred command must have stable sqe data
  io_uring: remove 'sqe' parameter to the OP helpers that take it
  io_uring: fix pre-prepped issue with force_nonblock == true
  io-wq: re-add io_wq_current_is_worker()
  io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG
  io_uring: fix stale comment and a few typos
2019-12-20 13:30:49 -08:00
Chen Wandun
5084bf6b20 xfs: Make the symbol 'xfs_rtalloc_log_count' static
Fix the following sparse warning:

fs/xfs/libxfs/xfs_trans_resv.c:206:1: warning: symbol 'xfs_rtalloc_log_count' was not declared. Should it be static?

Fixes: b1de6fc752 ("xfs: fix log reservation overflows when allocating large rt extents")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-20 08:07:31 -08:00
Darrick J. Wong
13eaec4b2a xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
and swidth values could cause xfs_repair to fail loudly.  The problem
here is that repair calculates the where mkfs should have allocated the
root inode, based on the superblock geometry.  The allocation decisions
depend on sunit, which means that we really can't go updating sunit if
it would lead to a subsequent repair failure on an otherwise correct
filesystem.

Port from xfs_repair some code that computes the location of the root
inode and teach mount to skip the ondisk update if it would cause
problems for repair.  Along the way we'll update the documentation,
provide a function for computing the minimum AGFL size instead of
open-coding it, and cut down some indenting in the mount code.

Note that we allow the mount to proceed (and new allocations will
reflect this new geometry) because we've never screened this kind of
thing before.  We'll have to wait for a new future incompat feature to
enforce correct behavior, alas.

Note that the geometry reporting always uses the superblock values, not
the incore ones, so that is what xfs_info and xfs_growfs will report.

[1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d

Reported-by: Alex Lyakas <alex@zadara.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong
4f5b1b3a8f xfs: split the sunit parameter update into two parts
If the administrator provided a sunit= mount option, we need to validate
the raw parameter, convert the mount option units (512b blocks) into the
internal unit (fs blocks), and then validate that the (now cooked)
parameter doesn't screw anything up on disk.  The incore inode geometry
computation can depend on the new sunit option, but a subsequent patch
will make validating the cooked value depends on the computed inode
geometry, so break the sunit update into two steps.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong
1cac233cfe xfs: refactor agfl length computation function
Refactor xfs_alloc_min_freelist to accept a NULL @pag argument, in which
case it returns the largest possible minimum length.  This will be used
in an upcoming patch to compute the length of the AGFL at mkfs time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2019-12-19 07:53:48 -08:00
Darrick J. Wong
af952aeb4a libxfs: resync with the userspace libxfs
Prepare to resync the userspace libxfs with the kernel libxfs.  There
were a few things I missed -- a couple of static inline directory
functions that have to be exported for xfs_repair; a couple of directory
naming functions that make porting much easier if they're /not/ static
inline; and a u16 usage that should have been uint16_t.

None of these things are bugs in their own right; this just makes
porting xfsprogs easier.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2019-12-19 07:53:47 -08:00
Brian Foster
826f7e3413 xfs: use bitops interface for buf log item AIL flag check
The xfs_log_item flags were converted to atomic bitops as of commit
22525c17ed ("xfs: log item flags are racy"). The assert check for
AIL presence in xfs_buf_item_relse() still uses the old value based
check. This likely went unnoticed as XFS_LI_IN_AIL evaluates to 0
and causes the assert to unconditionally pass. Fix up the check.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: 22525c17ed ("xfs: log item flags are racy")
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-12-19 07:53:47 -08:00
Jens Axboe
fd6c2e4c06 io_uring: io_wq_submit_work() should not touch req->rw
I've been chasing a weird and obscure crash that was userspace stack
corruption, and finally narrowed it down to a bit flip that made a
stack address invalid. io_wq_submit_work() unconditionally flips
the req->rw.ki_flags IOCB_NOWAIT bit, but since it's a generic work
handler, this isn't valid. Normal read/write operations own that
part of the request, on other types it could be something else.

Move the IOCB_NOWAIT clear to the read/write handlers where it belongs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-18 12:19:41 -07:00
Pavel Begunkov
7c504e6520 io_uring: don't wait when under-submitting
There is no reliable way to submit and wait in a single syscall, as
io_submit_sqes() may under-consume sqes (in case of an early error).
Then it will wait for not-yet-submitted requests, deadlocking the user
in most cases.

Don't wait/poll if can't submit all sqes

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-18 10:01:49 -07:00
Eric Sandeen
1edc8eb2e9 fs: call fsnotify_sb_delete after evict_inodes
When a filesystem is unmounted, we currently call fsnotify_sb_delete()
before evict_inodes(), which means that fsnotify_unmount_inodes()
must iterate over all inodes on the superblock looking for any inodes
with watches.  This is inefficient and can lead to livelocks as it
iterates over many unwatched inodes.

At this point, SB_ACTIVE is gone and dropping refcount to zero kicks
the inode out out immediately, so anything processed by
fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop.

After that, the call to evict_inodes will evict everything else with a
zero refcount.

This should speed things up overall, and avoid livelocks in
fsnotify_unmount_inodes().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-18 00:03:01 -05:00
Eric Sandeen
04646aebd3 fs: avoid softlockups in s_inodes iterators
Anything that walks all inodes on sb->s_inodes list without rescheduling
risks softlockups.

Previous efforts were made in 2 functions, see:

c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
ac05fbb inode: don't softlockup when evicting inodes

but there hasn't been an audit of all walkers, so do that now.  This
also consistently moves the cond_resched() calls to the bottom of each
loop in cases where it already exists.

One loop remains: remove_dquot_ref(), because I'm not quite sure how
to deal with that one w/o taking the i_lock.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-12-18 00:03:01 -05:00
Jens Axboe
e781573e2f io_uring: warn about unhandled opcode
Now that we have all the opcodes handled in terms of command prep and
SQE reuse, add a printk_once() to warn about any potentially new and
unhandled ones.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
d625c6ee49 io_uring: read opcode and user_data from SQE exactly once
If we defer a request, we can't be reading the opcode again. Ensure that
the user_data and opcode fields are stable. For the user_data we already
have a place for it, for the opcode we can fill a one byte hold and store
that as well. For both of them, assign them when we originally read the
SQE in io_get_sqring(). Any code that uses sqe->opcode or sqe->user_data
is switched to req->opcode and req->user_data.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
b29472ee7b io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the timeout remove op into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
fbf23849b1 io_uring: make IORING_OP_CANCEL_ASYNC deferrable
If we defer this command as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the async cancel op into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
0969e783e3 io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
If we defer these commands as part of a link, we have to make sure that
the SQE data has been read upfront. Integrate the poll add/remove into
the prep handling to make it safe for SQE reuse.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Pavel Begunkov
ffbb8d6b76 io_uring: make HARDLINK imply LINK
The rules are as follows, if IOSQE_IO_HARDLINK is specified, then it's a
link and there is no need to set IOSQE_IO_LINK separately, though it
could be there. Add proper check and ensure that IOSQE_IO_HARDLINK
implies IOSQE_IO_LINK.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:27 -07:00
Jens Axboe
8ed8d3c3bc io_uring: any deferred command must have stable sqe data
We're currently not retaining sqe data for accept, fsync, and
sync_file_range. None of these commands need data outside of what
is directly provided, hence it can't go stale when the request is
deferred. However, it can get reused, if an application reuses
SQE entries.

Ensure that we retain the information we need and only read the sqe
contents once, off the submission path. Most of this is just moving
code into a prep and finish function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
fc4df999e2 io_uring: remove 'sqe' parameter to the OP helpers that take it
We pass in req->sqe for all of them, no need to pass it in as the
request is always passed in. This is a necessary prep patch to be
able to cleanup/fix the request prep path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00
Jens Axboe
b7bb4f7da0 io_uring: fix pre-prepped issue with force_nonblock == true
Some of these code paths assume that any force_nonblock == true issue
is not prepped, but that's not true if we did prep as part of link setup
earlier. Check if we already have an async context allocate before
setting up a new one.

Cleanup the async context setup in general, we have a lot of duplicated
code there.

Fixes: 03b1230ca1 ("io_uring: ensure async punted sendmsg/recvmsg requests copy data")
Fixes: f67676d160 ("io_uring: ensure async punted read/write requests copy iovec")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-12-17 19:57:20 -07:00