The debug message of decode_attr_lease_time incorrectly
says "file size". Fix it to "lease time".
Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Some strings should be put into a sequence.
Thus use the corresponding function “seq_puts”.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
A single character (line break) should be put into a sequence.
Thus use the corresponding function “seq_putc”.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit be4c2d4723.
That commit caused a severe memory leak in nfs_readdir_make_qstr().
When listing a directory with more than 100 files (this is how many
struct nfs_cache_array_entry elements fit in one 4kB page), all
allocated file name strings past those 100 leak.
The root of the leakage is that those string pointers are managed in
pages which are never linked into the page cache.
fs/nfs/dir.c puts pages into the page cache by calling
read_cache_page(); the callback function nfs_readdir_filler() will
then fill the given page struct which was passed to it, which is
already linked in the page cache (by do_read_cache_page() calling
add_to_page_cache_lru()).
Commit be4c2d4723 added another (local) array of allocated pages, to
be filled with more data, instead of discarding excess items received
from the NFS server. Those additional pages can be used by the next
nfs_readdir_filler() call (from within the same nfs_readdir() call).
The leak happens when some of those additional pages are never used
(copied to the page cache using copy_highpage()). The pages will be
freed by nfs_readdir_free_pages(), but their contents will not. The
commit did not invoke nfs_readdir_clear_array() (and doing so would
have been dangerous, because it did not track which of those pages
were already copied to the page cache, risking double free bugs).
How to reproduce the leak:
- Use a kernel with CONFIG_SLUB_DEBUG_ON.
- Create a directory on a NFS mount with more than 100 files with
names long enough to use the "kmalloc-32" slab (so we can easily
look up the allocation counts):
for i in `seq 110`; do touch ${i}_0123456789abcdef; done
- Drop all caches:
echo 3 >/proc/sys/vm/drop_caches
- Check the allocation counter:
grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
30564391 nfs_readdir_add_to_array+0x73/0xd0 age=534558/4791307/6540952 pid=370-1048386 cpus=0-47 nodes=0-1
- Request a directory listing and check the allocation counters again:
ls
[...]
grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
30564511 nfs_readdir_add_to_array+0x73/0xd0 age=207/4792999/6542663 pid=370-1048386 cpus=0-47 nodes=0-1
There are now 120 new allocations.
- Drop all caches and check the counters again:
echo 3 >/proc/sys/vm/drop_caches
grep nfs_readdir /sys/kernel/slab/kmalloc-32/alloc_calls
30564401 nfs_readdir_add_to_array+0x73/0xd0 age=735/4793524/6543176 pid=370-1048386 cpus=0-47 nodes=0-1
110 allocations are gone, but 10 have leaked and will never be freed.
Unhelpfully, those allocations are explicitly excluded from KMEMLEAK,
that's why my initial attempts with KMEMLEAK were not successful:
/*
* Avoid a kmemleak false positive. The pointer to the name is stored
* in a page cache page which kmemleak does not scan.
*/
kmemleak_not_leak(string->name);
It would be possible to solve this bug without reverting the whole
commit:
- keep track of which pages were not used, and call
nfs_readdir_clear_array() on them, or
- manually link those pages into the page cache
But for now I have decided to just revert the commit, because the real
fix would require complex considerations, risking more dangerous
(crash) bugs, which may seem unsuitable for the stable branches.
Signed-off-by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
NFSoRDMA client updates for 5.3
New features:
- Add a way to place MRs back on the free list
- Reduce context switching
- Add new trace events
Bugfixes and cleanups:
- Fix a BUG when tracing is enabled with NFSv4.1
- Fix a use-after-free in rpcrdma_post_recvs
- Replace use of xdr_stream_pos in rpcrdma_marshal_req
- Fix occasional transport deadlock
- Fix show_nfs_errors macros, other tracing improvements
- Remove RPCRDMA_REQ_F_PENDING and fr_state
- Various simplifications and refactors
When triggering an nfs_xdr_status trace point, record the task ID
and XID of the failing RPC to better pinpoint the problem.
This feels like a bit of a layering violation.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add missing symbolic flag names and display flags variables in
hexadecimal to improve observability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For improved readability, add nfs_show_status() call-sites in the
generic NFS trace points so that the symbolic status code name is
displayed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
I noticed that NFS status values stopped working again.
trace_print_symbols_seq() takes an unsigned long. Passing a negative
errno or negative NFSERR value just confuses it, and since we're
using C macros here and not static inline functions, all bets are
off due to implicit type conversion.
Straight-line the calling conventions so that error codes are stored
in the trace record as positive values in an unsigned long field,
mapped to symbolic as an unsigned long, and displayed as a negative
value, to continue to enable grepping on "error=-".
It's often the case that an error value that is positive is a byte
count but when it's negative, it's an error (e.g. nfs4_write). Fix
those cases so that the value that is eventually stored in the
error field is a positive NFS status or errno, or zero.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Don't bail out before cleaning up a new allocation if the wait for
searching for a matching nfs client is interrupted. Memory leaks.
Reported-by: syzbot+7fe11b49c1cc30e3fce2@syzkaller.appspotmail.com
Fixes: 950a578c61 ("NFS: make nfs_match_client killable")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The NFS protocol doesn't support deduplication, so turn it off again.
Fixes: ce96e888fe ("Fix nfs4.2 return -EINVAL when do dedupe operation")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
On the NFS client there is no low-impact way to determine the nfs4
lease time or whether the lease is expired, so add these to mountstats
with times displayed in seconds.
If the lease is not expired, display lease_expired=0. Otherwise,
display lease_expired=seconds_since_expired, similar to 'age:' line
in mountstats.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Now that the VM promises never to recurse back into the filesystem
layer on writeback, remove all the GFP_NOFS references etc from
the generic writeback code.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
With NFSv4.1, different network connections need to be explicitly
bound to a session. During session startup, this is not possible
so only a single connection must be used for session startup.
So add a task flag to disable the default round-robin choice of
connections (when nconnect > 1) and force the use of a single
connection.
Then use that flag on all requests for session management - for
consistence, include NFSv4.0 management (SETCLIENTID) and session
destruction
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the user specifies -onconnect=<number> mount option, and the transport
protocol is TCP, then set up <number> connections to the pNFS data server
as well. The connections will all go to the same IP address.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the user specifies the -onconn=<number> mount option, and the transport
protocol is TCP, then set up <number> connections to the server. The
connections will all go to the same IP address.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow the user to specify that the client should use multiple connections
to the server. For the moment, this functionality will be limited to
TCP and to NFSv4.x (x>0).
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In order to identify containers to the NFS client, we add a per-net
sysfs attribute that udev can fill with the appropriate identifier.
The identifier could be a unique hostname, but in most cases it
will probably be a persisted uuid.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
If the client detects that close-to-open cache consistency has been
violated, and that the file or directory has been changed on the
server, then do a cache invalidation when we're done working with
the file.
The reason we don't do an immediate cache invalidation is that we
want to avoid performance problems due to false positives. Also,
note that we cannot guarantee cache consistency in this situation
even if we do invalidate the cache.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
According to the open() manpage, Linux reserves the access mode 3
to mean "check for read and write permission on the file and return
a file descriptor that can't be used for reading or writing."
Currently, the NFSv4 code will ask the server to open the file,
and will use an incorrect share access mode of 0. Since it has
an incorrect share access mode, the client later forgets to send
a corresponding close, meaning it can leak stateids on the server.
Fixes: ce4ef7c0a8 ("NFS: Split out NFS v4 file operations")
Cc: stable@vger.kernel.org # 3.6+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When mapping the NFSv4 context to an open mode and access mode,
we need to treat the FMODE_EXEC flag differently. For the open
mode, FMODE_EXEC means we need read share access. For the access
mode checking, we need to verify that the user actually has
execute access.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Pull vfs fixlet from Al Viro:
"Fix bogus default y in Kconfig (VALIDATE_FS_PARSER)
That thing should not be turned on by default, especially since it's
not quiet in case it finds no problems. Geert has sent the obvious fix
quite a few times, but it fell through the cracks"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: VALIDATE_FS_PARSER should default to n
failures on high-memory machines and fixing the DRC over RDMA.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl0fiP4VHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG++dwP/RkuKV6sjmopr5/SNK334UDGbpAk
h+VWAJSrnrLDqk2Ezwz4cO8XrPxMEiQQdKtiyIM51oshq9EN0u0gdOS7ycdz9mAm
qm1WgekRH3vNiNYU+im2r7CXXrBUYeggi1clbOoJnEwsKxV0qFG74OxO8fB5gNMP
Jeq46nofwSAjxLQBwTkHJhs0cV7rmJhq6mVWJ4lzD6JTzMH7FqO0zoJJHfyP5xb+
SXSheqWK4eTKhvPJR0JwyiUBUXYzNZyDoNlyRCSVylfxha3cTxF0GeG1Pm2uS+sm
V8QOnqud+4cF5Qa2zAZ2T/w7dgsfouAODQOi/gzDwE0EM+FojbOnk0CJwL7wuzB3
flkmWOMER0RV92z7gWqh5JDQFoHeVkldZfFTrYfVWIcPV+pGLiRayzt+dlSbpaYj
09jMlBLHXxHwCqPT5u8GFKveMNluYyIoKc3s38eojX2u3eg9HNsXoCfmbC2RGaZ4
iT2E5isl7donclTHDKEU7RkWnaboSQoB+oodMWH8TN7p9FfgpsaObWTebrsbvvBN
DMrdc+nJ+x78krI9XKpSQOmPpJH9siaIFn6nVq5oLNjytaH+UrrArvkLMKhuSW6L
8u5fU3SvL0Eriuz7EtgZwosy+VPvFpXgQVMdJ+z/cm32mOeFgcz4BNEMaQuFJVaN
fbY6fM46Xngo0A3x
=f2gw
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Two more quick bugfixes for nfsd: fixing a regression causing mount
failures on high-memory machines and fixing the DRC over RDMA"
* tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix overflow causing non-working mounts on 1 TB machines
svcrdma: Ignore source port when computing DRC hash
CONFIG_VALIDATE_FS_PARSER is a debugging tool to check that the parser
tables are vaguely sane. It was set to default to 'Y' for the moment to
catch errors in upcoming fs conversion development.
Make sure it is not enabled by default in the final release of v5.1.
Fixes: 31d921c7fb ("vfs: Add configuration parser helpers")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Fix xarray entry association for mixed mappings
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdHpMDAAoJEB7SkWpmfYgCJ0wP/2fgMgoh1YBv4DU4gsNs328G
pyTX3i6d+KoEmfvlcoOzU3lNtjf2S81H3QIkfTJG75uE9/3jNKOh+tPunj3/wIOv
iHbpBZVK5OpE2f9FFNM785cTt7hBqBtR38N/PdKGFPSMzN9vX794rmvS8Ri0bHd4
9zmSvdnrkdu0U4BGmBRZdUOCUIdDrtQClKJBRtG5Ksb194zf7lt/jm4k9WFMqfYA
89mR/KHQhmhDnjyBynQa0TRtShlf/DsxtWiPyLT9FzD1RZt9+tFVLANRQEmFFAp2
eb+b+LT35AdEEwErv7RkCCGSGKA7KXy7+hyETsoPMBXG08Q77nogG+zb/j0wCwvK
SsQpo1aqmIeJRBQOXbYeqhn6VR9YuVMFlgaSfOcn/noQDHUIeKis2pnOMeKX36MN
xHTQm9hSgG+0Des5UoAd4eNH5fDmwuUJK4o4kVGG3pKPuTzyW1gLyo7ItewleE7c
7rOhLMU55mqUizxsBfEOO6qiJED64iQ2K3mve5I22YetaYCfdrNnU8x5iS9FEcir
CATYHilKevowwVZg2a7Iy5FquYsQghMhZe4iF5gzjF+zEpj0Wi8S0nVsCtX8Js3S
p/f3TFmc2i4ZCFPbOAAdYOuagU6pFbuSFQyjtljo5sjD3JKBXitgEaAepFXMtVSA
EXEQYBgj5CoQ8ClF/eiA
=NNXl
-----END PGP SIGNATURE-----
Merge tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fix from Dan Williams:
"A single dax fix that has been soaking awaiting other fixes under
discussion to join it. As it is getting late in the cycle lets proceed
with this fix and save follow-on changes for post-v5.3-rc1.
- Fix xarray entry association for mixed mappings"
* tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix xarray entry association for mixed mappings
When IOCB_CMD_POLL is used on a userfaultfd, aio_poll() disables IRQs
and takes kioctx::ctx_lock, then userfaultfd_ctx::fd_wqh.lock.
This may have to wait for userfaultfd_ctx::fd_wqh.lock to be released by
userfaultfd_ctx_read(), which in turn can be waiting for
userfaultfd_ctx::fault_pending_wqh.lock or
userfaultfd_ctx::event_wqh.lock.
But elsewhere the fault_pending_wqh and event_wqh locks are taken with
IRQs enabled. Since the IRQ handler may take kioctx::ctx_lock, lockdep
reports that a deadlock is possible.
Fix it by always disabling IRQs when taking the fault_pending_wqh and
event_wqh locks.
Commit ae62c16e10 ("userfaultfd: disable irqs when taking the
waitqueue lock") didn't fix this because it only accounted for the
fd_wqh lock, not the other locks nested inside it.
Link: http://lkml.kernel.org/r/20190627075004.21259-1-ebiggers@kernel.org
Fixes: bfe4037e72 ("aio: implement IOCB_CMD_POLL")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reported-by: syzbot+fab6de82892b6b9c6191@syzkaller.appspotmail.com
Reported-by: syzbot+53c0b767f7ca0dc0c451@syzkaller.appspotmail.com
Reported-by: syzbot+a3accb352f9c22041cfa@syzkaller.appspotmail.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 10a68cdf10 (nfsd: fix performance-limiting session
calculation) (Linux 5.1-rc1 and 4.19.31), shares from NFS servers with
1 TB of memory cannot be mounted anymore. The mount just hangs on the
client.
The gist of commit 10a68cdf10 is the change below.
-avail = clamp_t(int, avail, slotsize, avail/3);
+avail = clamp_t(int, avail, slotsize, total_avail/3);
Here are the macros.
#define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
#define clamp_t(type, val, lo, hi) min_t(type, max_t(type, val, lo), hi)
`total_avail` is 8,434,659,328 on the 1 TB machine. `clamp_t()` casts
the values to `int`, which for 32-bit integers can only hold values
−2,147,483,648 (−2^31) through 2,147,483,647 (2^31 − 1).
`avail` (in the function signature) is just 65536, so that no overflow
was happening. Before the commit the assignment would result in 21845,
and `num = 4`.
When using `total_avail`, it is causing the assignment to be
18446744072226137429 (printed as %lu), and `num` is then 4164608182.
My next guess is, that `nfsd_drc_mem_used` is then exceeded, and the
server thinks there is no memory available any more for this client.
Updating the arguments of `clamp_t()` and `min_t()` to `unsigned long`
fixes the issue.
Now, `avail = 65536` (before commit 10a68cdf10 `avail = 21845`), but
`num = 4` remains the same.
Fixes: c54f24e338 (nfsd: fix performance-limiting session calculation)
Cc: stable@vger.kernel.org
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl0b67IACgkQiiy9cAdy
T1H/Ngv/XNc9l/OEHwyWZ1QnSCBKZyLyD5ZcKQRFFkfiktmQ8FtPUzf4qKlHxX1h
ssefwBIbkW1+DG2sgvrL7OfqnPDnSezVoifvRmbh0nFX8anWhtChZMc0s+xiLtz2
SbDBugNSkc8l9fvQz5A6VPJ3TcNA+VsSE2rr1HuimS9S4RAy1RsPhhWNyUh3GV5A
SWuD7bsnxZ7/H2l+hx+s2O5RLDFoeniEIGFTsH9/f7Q19YGJtf6arnUlyUaZjkXK
bPV2jZyalRUznK7RSFDLu49fS2zH8/m6MfBYyat31SZVtLFcQC/ijhKYTWr8wrKu
+iQPlX+IDk4rfH/++7PXJJv1sKFLZNEs22dOi1YG0FgkRtMNA8HzmJqVFLcgoB2d
QD7Ahj4dE0ghXv1dLMjfKdchNbkrWiygfpje54AkhU9SWUIS/EljDbQSq3e/wpAW
i9HxCGCmmTPFzVKDVhyaBXHi6h5pzd7FfNNS4iJ2Lsy5PRLOHBMxaX1wknu/8vP0
IIWuB9Hh
=1zkr
-----END PGP SIGNATURE-----
Merge tag '5.2-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fix from Steve French:
"SMB3 fix (for stable as well) for crash mishandling one of the Windows
reparse point symlink tags"
* tag '5.2-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix crash querying symlinks stored as reparse-points
sys_move_mount() crashes by dereferencing the pointer MNT_NS_INTERNAL,
a.k.a. ERR_PTR(-EINVAL), if the old mount is specified by fd for a
kernel object with an internal mount, such as a pipe or memfd.
Fix it by checking for this case and returning -EINVAL.
[AV: what we want is is_mounted(); use that instead of making the
condition even more convoluted]
Reproducer:
#include <unistd.h>
#define __NR_move_mount 429
#define MOVE_MOUNT_F_EMPTY_PATH 0x00000004
int main()
{
int fds[2];
pipe(fds);
syscall(__NR_move_mount, fds[0], "", -1, "/", MOVE_MOUNT_F_EMPTY_PATH);
}
Reported-by: syzbot+6004acbaa1893ad013f0@syzkaller.appspotmail.com
Fixes: 2db154b3ea ("vfs: syscall: Add move_mount(2) to move mounts around")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Account XArray nodes for the page cache to the appropriate cgroup
(Johannes Weiner)
Fix idr_get_next() when called under the RCU lock (Matthew Wilcox)
Add a test for xa_insert() (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAl0WuKsACgkQDpNsjXcp
gj73zgf9Eb477PuwYZpFBA9ZxI5v/6WyqbaWXKdqEhotARgIUuv1CfVnkt1IJE6P
Z3QCRABZ3pIKHgIErJN53B7AdvdONUO4Xf9VFBqmxeWE7F9L3sROOpXc8IrR26kV
hITQn8mwgacNQ8mLtQmcSFaCVC2E7yVNBhVd5zmcA6jNIAFsOJcP06KLJTe94OXe
AB9TJvswxpzAEX8emHQ/a1SFBNZWJ7b53hBcu8CJn8CuGDxmo1/+qqoRyNY+WrDO
OohFk2u1j6Esfc6j0k+Akt8mEFyfU2oxFfv5MjL0KYEyrHoU84eZljFGgf7rQqGj
fqH9RO8J8eoj4D/3XaLL5QYRLIxRaw==
=AXZy
-----END PGP SIGNATURE-----
Merge tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax
Pull XArray fixes from Matthew Wilcox:
- Account XArray nodes for the page cache to the appropriate cgroup
(Johannes Weiner)
- Fix idr_get_next() when called under the RCU lock (Matthew Wilcox)
- Add a test for xa_insert() (Matthew Wilcox)
* tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax:
XArray tests: Add check_insert
idr: Fix idr_get_next race with idr_remove
mm: fix page cache convergence regression
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL
mm, swap: fix THP swap out
fork,memcg: alloc_thread_stack_node needs to set tsk->stack
MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info
mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
mm/page_idle.c: fix oops because end_pfn is larger than max_pfn
initramfs: fix populate_initrd_image() section mismatch
mm/oom_kill.c: fix uninitialized oc->constraint
mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
signal: remove the wrong signal_pending() check in restore_user_sigmask()
fs/binfmt_flat.c: make load_flat_shared_library() work
mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask
fs/proc/array.c: allow reporting eip/esp for all coredumping threads
mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual address
Stable bugfixes:
- SUNRPC: Fix up calculation of client message length # 5.1+
- NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O # 4.8+
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl0Wf3EACgkQ18tUv7Cl
QOs2ORAA5/CXFa471jUldOsHejxfFoddFBkuqf8qZ1AF3TZdFuITAsq+xydxfO5U
hYzUUlOTKedEi+ISYLFs1tjU/nYRQJv7fFZxVwq6uDZ53Z/doiMLAIR67Eq7EcTY
KBWA9zdldnBzb0S87+hkbmaNPR5pjqxBzLEfMmOQEAAh5pSGf5YSeUNTXLGj4wBd
iXf25o1VSjUmNpSHaA3KsrqTJ4mJ7+i/17Iny1c4xRgZbJtoTm44DpceHCheJpbl
DymRSgjSr0vFjJbufcKkbF2OPp1ZsnkDiKyJmZzgPOa3+TMGzisU5yiASoac6D+j
gs426yEz9rvR/TMZtFS05nfu2clKuS8foLGwZelJ7XjQSXJgObCb4xf97jLIOWNb
J+BWwsTmUIQS+fMUQDA+rlbyepJ+skVZpbjmUy+/Uy52oqtnYK6uTD469NdmxBwr
7z2pnCUjJFTqo6BHeCQgR5XlSt1MGDByamcVAWONS+9zJttRhfUOjq0PIOLSrsBK
5zRzJxtBoYLwP5py3zKAeV9RcvDNSgh5U6P0hhFRtHfqMUmtGeA58nNND2S6Qm3/
vAB7WZL0aVSvc3zpz7qdctitMESQNspCkMooAp/EoIime3YkqKCS+AgED9jKLhJR
/5eqtr6tehh6A4dshzSlDF7cFrKyUd+ulS0IN8vt1V2TYgOQmbY=
=FATk
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull two more NFS client fixes from Anna Schumaker:
"These are both stable fixes.
One to calculate the correct client message length in the case of
partial transmissions. And the other to set the proper TCP timeout for
flexfiles"
* tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O
SUNRPC: Fix up calculation of client message length
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl0WLe4THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizPrB/4tNUS8J9mW9Zd3xLAzZmwjq+WAfCV8
wp3IjBHCgvn9SmTYOJtozjTLJVlmeGNVyrCaWbtzQ2YLKvyBTCUF4kg9EG7FMX9a
ixzlHb2+Wu46LYWiA7jhUnoKNMMl1swm01BOvfmGprSwV70BAEF0i2/D7WHikolX
rgcwGb58vUMmXQ1VGfIO9Pox2a8jaZNj82BZnDniMDxetZ5sRsZXGy43s14zC6Lt
YnwDT70Y7+Pr9SwHMA5bnZ8kCtQpr0qAHmDVhEd965Io1XZ+2/EHF5IwqK0xGg+e
KUQdRyhMWjIGG34SWMt5tbT+9Lzeju4CAka9NPSJ1tRtFnk1AvpILbnB
=TpFR
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.2-rc7' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A small fix for a potential -rc1 regression from Jeff"
* tag 'ceph-for-5.2-rc7' of git://github.com/ceph/ceph-client:
ceph: fix ceph_mdsc_build_path to not stop on first component
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0WM9YQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmkyEADVPjXlIZETBpAl/oK/StNc1NMdfgBiWaX7
kQHbFu3V4soDpvR8iQvMVyFc7dUpwo9lmgxIOcZSfdCf/ciJ/G4trhH4UljXfRsj
2vdKV3rZXragrclN0zGtW90sBBYxSilaezzRQbnnXjEgGaHFkeJJR3xW00UMoGrm
GDO2gSQdhDKqhJtKjiCASkyN9uWMkcLFdsGErPgA6e4S3NTbaLKaY/xFUCcMF7aX
N1aYkIfdyl38QUU/N+5WLgiJYHkiZNqcrJ+a5aECioqqiNh9ST+UR1jCgo7tlt4h
b3Gb5mxP0CPUuTh3VQD8GHCaPzDsxUIxThJkz5aih3M9NEQmm5Du0GDChaDuMoUR
zyFT/Yl4JfeO93mlpxGUyC5WyFCQdj0QOBuyxInCchvJC5kbpRflMuKt+xRYlSqg
331njdykyKkgutagLzzTME38RPUbttZVmbc6K422PXKkYW+FOlS352FZpl5qxDOu
5+ihOXOLvO09VXu6kcC5UH4Yi6nuGYDS95oIZhJ0OODx10xnKSE4ZozlPXAEreAR
NVJN7vbHVqLnphuplRK9Kh0VngdIhLkeTsUxaTnX6UQSioHPDJPqPP5nfSu9Xkyo
e+2UAXkfVjnw45jAu8Mrsu0KhabCB5Pde8Jk+kmqPcuWXQEN5OHqeA09vtvKj81J
lIagz1NZxw==
=WzXj
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190628' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Just two small fixes.
One from Paolo, fixing a silly mistake in BFQ. The other one is from
me, ensuring that we have ->file cleared in the io_uring request a bit
earlier. That avoids a use-before-free, if we encounter an error
before ->file is assigned"
* tag 'for-linus-20190628' of git://git.kernel.dk/linux-block:
block, bfq: fix operator in BFQQ_TOTALLY_SEEKY
io_uring: ensure req->file is cleared on allocation
This is the minimal fix for stable, I'll send cleanups later.
Commit 854a6ed568 ("signal: Add restore_user_sigmask()") introduced
the visible change which breaks user-space: a signal temporary unblocked
by set_user_sigmask() can be delivered even if the caller returns
success or timeout.
Change restore_user_sigmask() to accept the additional "interrupted"
argument which should be used instead of signal_pending() check, and
update the callers.
Eric said:
: For clarity. I don't think this is required by posix, or fundamentally to
: remove the races in select. It is what linux has always done and we have
: applications who care so I agree this fix is needed.
:
: Further in any case where the semantic change that this patch rolls back
: (aka where allowing a signal to be delivered and the select like call to
: complete) would be advantage we can do as well if not better by using
: signalfd.
:
: Michael is there any chance we can get this guarantee of the linux
: implementation of pselect and friends clearly documented. The guarantee
: that if the system call completes successfully we are guaranteed that no
: signal that is unblocked by using sigmask will be delivered?
Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
Fixes: 854a6ed568 ("signal: Add restore_user_sigmask()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Eric Wong <e@80x24.org>
Tested-by: Eric Wong <e@80x24.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org> [5.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_flat_shared_library() is broken: It only calls load_flat_file() if
prepare_binprm() returns zero, but prepare_binprm() returns the number of
bytes read - so this only happens if the file is empty.
Instead, call into load_flat_file() if the number of bytes read is
non-negative. (Even if the number of bytes is zero - in that case,
load_flat_file() will see nullbytes and return a nice -ENOEXEC.)
In addition, remove the code related to bprm creds and stop using
prepare_binprm() - this code is loading a library, not a main executable,
and it only actually uses the members "buf", "file" and "filename" of the
linux_binprm struct. Instead, call kernel_read() directly.
Link: http://lkml.kernel.org/r/20190524201817.16509-1-jannh@google.com
Fixes: 287980e49f ("remove lots of IS_ERR_VALUE abuses")
Signed-off-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0a1eb2d474 ("fs/proc: Stop reporting eip and esp in /proc/PID/stat")
stopped reporting eip/esp and fd7d56270b ("fs/proc: Report eip/esp in
/prod/PID/stat for coredumping") reintroduced the feature to fix a
regression with userspace core dump handlers (such as minicoredumper).
Because PF_DUMPCORE is only set for the primary thread, this didn't fix
the original problem for secondary threads. Allow reporting the eip/esp
for all threads by checking for PF_EXITING as well. This is set for all
the other threads when they are killed. coredump_wait() waits for all the
tasks to become inactive before proceeding to invoke a core dumper.
Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de
Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de
Fixes: fd7d56270b ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reported-by: Jan Luebbe <jlu@pengutronix.de>
Tested-by: Jan Luebbe <jlu@pengutronix.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a typo where we're confusing the default TCP retrans value
(NFS_DEF_TCP_RETRANS) for the default TCP timeout value.
Fixes: 15d03055cf ("pNFS/flexfiles: Set reasonable default ...")
Cc: stable@vger.kernel.org # 4.8+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We never parsed/returned any data from .get_link() when the object is a windows reparse-point
containing a symlink. This results in the VFS layer oopsing accessing an uninitialized buffer:
...
[ 171.407172] Call Trace:
[ 171.408039] readlink_copy+0x29/0x70
[ 171.408872] vfs_readlink+0xc1/0x1f0
[ 171.409709] ? readlink_copy+0x70/0x70
[ 171.410565] ? simple_attr_release+0x30/0x30
[ 171.411446] ? getname_flags+0x105/0x2a0
[ 171.412231] do_readlinkat+0x1b7/0x1e0
[ 171.412938] ? __ia32_compat_sys_newfstat+0x30/0x30
...
Fix this by adding code to handle these buffers and make sure we do return a valid buffer
to .get_link()
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAl0UnRoACgkQjrBW1T7s
sS1T0w/+PFooDZNaKJkhJCGm0XyRDYmmuivEX9ydUR1x9/doRbDZTqfjsQBJLoVK
PulxDiuFbQWXzhBJFEMuU6YBR2fjFqUGsXz5qAXPB0zaahWcSY/0Y8VCU/PKq7A6
3oJPl/lYwYkLTYUKsnN08hByosUA7WeRQRAxbSFWdCTlUfIw72mDhprMGjJIVAlu
snLA5lUoy7hyoFdXR5qNhYAcX8sASmi01hXhdnsKMOv4z2Vb5NoQsgqL1W8tAnsf
BdJKL82Qd7vWQahlbOtur46aeJAL2ukGSTskuA2jOQqsKxmpos+hWq36gToq7usa
XgPii0Rz7/2s6ZvhmxV5kmzqHylT9giU1DxWybSVo9IZBsU2i1o9DV+yBY50tr45
s0bmpSA/u4DP2uT8oRvh47LbDqiQFA8dyVWQKE25smSdjekuZHTO0tgXf8mwC2CW
hDci4z+ONOyqIQyFrhP7UaKuSK6tAAUbYKtXUIN6rnuq1FjuTA2+wtIlOPZuDZQ2
yrsSUefh4/sFMBSAgoGTg9f+PiCejBMKcxoqhU2/27mvkiInAyDPfoc4oGQcinOy
OVX3B0A8B88l26sDkWdv15d92E1GKzZLj8h66TlYwDpN+seevftKtpblZ9fJWsSf
0NejoMV/GcA/KsAp1sxqWwouRob8H6pbGXWb97DYRA2IVyjK3q4=
=APjA
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190627' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux
Pull pidfd fixes from Christian Brauner:
"Userspace tools and libraries such as strace or glibc need a cheap and
reliable way to tell whether CLONE_PIDFD is supported. The easiest way
is to pass an invalid fd value in the return argument, perform the
syscall and verify the value in the return argument has been changed
to a valid fd.
However, if CLONE_PIDFD is specified we currently check if pidfd == 0
and return EINVAL if not.
The check for pidfd == 0 was originally added to enable us to abuse
the return argument for passing additional flags along with
CLONE_PIDFD in the future.
However, extending legacy clone this way would be a terrible idea and
with clone3 on the horizon and the ability to reuse CLONE_DETACHED
with CLONE_PIDFD there's no real need for this clutch. So remove the
pidfd == 0 check and help userspace out.
Also, accordig to Al, anon_inode_getfd() should only be used past the
point of no failure and ksys_close() should not be used at all since
it is far too easy to get wrong. Al's motto being "basically, once
it's in descriptor table, it's out of your control". So Al's patch
switches back to what we already had in v1 of the original patchset
and uses a anon_inode_getfile() + put_user() + fd_install() sequence
in the success path and a fput() + put_unused_fd() in the failure
path.
The other two changes should be trivial"
* tag 'for-linus-20190627' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
proc: remove useless d_is_dir() check
copy_process(): don't use ksys_close() on cleanups
samples: make pidfd-metadata fail gracefully on older kernels
fork: don't check parent_tidptr with CLONE_PIDFD
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXRMn5vu3V2unywtrAQICpA/+IIINk6MJVQDzGhOnvWrbGdPnOdJEUyLN
B9U4bLZJRg/j+Sqodn+fXIfsEO4FQflkSJD+xoBi4pzBZcr0xkLUVOog/1S7dv4J
bPVT9p2f3ITNiatmisOrUe1InuHa6Wb/cUnQaLLRhd7NqbawKGRQG4tv4CGwKn67
dJIOOm/iTCs1ACES4C5QOpU7/DWK38Pn3BbnN21bFzDgfbtbdDTaFFkhFtXy78oB
Gcj5g+ULpkKBcuJThFuJUPZ9E4qICNZR4kJXEULSvykDDRzluhJmQ+v8btm6NJsq
hMqTrT9M2y114V1OqXj3me7tA6wOEAfTQ0WzpzF2SmyFQKnSly/EkWc4HZXFD/8O
BczCcABUbuKNE/pJSELx6k1M0+00QfeLcjHPc6joZFCni3lMdYWOncn/syyHw5P+
rc9JQsy3+dLcFsaVQ5eGmX6NDc70dCrAlS6MllIzSBcwAVCctTKwm0meaSW6B2y6
VymPy+cqi1RxMKyiQ0hAeU7Xe6yqFcl6rtonfCQqRLxkfzrCXkDp6/ELOXBzDft1
ey6+N3WsmWW7YSPuM/SIZKV66rshlflj0w+FRluZEEAF1NYeYqXUDvK/S8KC9kPG
AXUDvhI+tBpxg1AVz94JN714VmkbY23xV0g44eQsdqSQm2YvsxiFCSWZZ6L/KEWe
kWQc6BGDCB0=
=YTdG
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190620' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
"The in-kernel AFS client has been undergoing testing on opendev.org on
one of their mirror machines. They are using AFS to hold data that is
then served via apache, and Ian Wienand had reported seeing oopses,
spontaneous machine reboots and updates to volumes going missing. This
patch series appears to have fixed the problem, very probably due to
patch (2), but it's not 100% certain.
(1) Fix the printing of the "vnode modified" warning to exclude checks
on files for which we don't have a callback promise from the
server (and so don't expect the server to tell us when it
changes).
Without this, for every file or directory for which we still have
an in-core inode that gets changed on the server, we may get a
message logged when we next look at it. This can happen in bulk
if, for instance, someone does "vos release" to update a R/O
volume from a R/W volume and a whole set of files are all changed
together.
We only really want to log a message if the file changed and the
server didn't tell us about it or we failed to track the state
internally.
(2) Fix accidental corruption of either afs_vlserver struct objects or
the the following memory locations (which could hold anything).
The issue is caused by a union that points to two different
structs in struct afs_call (to save space in the struct). The call
cleanup code assumes that it can simply call the cleanup for one
of those structs if not NULL - when it might be actually pointing
to the other struct.
This means that every Volume Location RPC op is going to corrupt
something.
(3) Fix an uninitialised spinlock. This isn't too bad, it just causes
a one-off warning if lockdep is enabled when "vos release" is
called, but the spinlock still behaves correctly.
(4) Fix the setting of i_block in the inode. This causes du, for
example, to produce incorrect results, but otherwise should not be
dangerous to the kernel"
* tag 'afs-fixes-20190620' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix setting of i_blocks
afs: Fix uninitialised spinlock afs_volume::cb_break_lock
afs: Fix vlserver record corruption
afs: Fix over zealous "vnode modified" warnings