When we create a new lockd client, we want to be able to pass the
correct credential of the process that created the struct nlm_host.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If two different containers that share the same network namespace attempt
to mount the same filesystem, we should not allow them to share the same
super block if they do not share the same user namespace, since the
user mappings on the wire will need to differ.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When mapping NFS identities, we want to substitute for the uids and
gids on the wire as we would for the AUTH_UNIX creds.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When mapping NFS identities using the NFSv4 idmapper, we want to substitute
for the uids and gids that would normally go on the wire as part of a
NFSv3 request. So we use the same mapping in the NFSv4 upcall as we
use in the NFSv3 RPC call (i.e. the mapping stored in the rpc_clnt cred).
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When mapping NFS identities, we want to substitute for the uids and
gids on the wire as we would for the AUTH_UNIX creds.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Store the credential of the mount process so that we can determine
information such as the user namespace.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When converting kuids to AUTH_UNIX creds, etc we will want to use the
same user namespace as the process that created the rpc client.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The lock context already references and tracks the open context, so
take the opportunity to save some space in struct nfs_page.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a helper for when we remove the explicit pointer to the open
context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Force the lock context to keep a reference to the parent open
context so that we can guarantee the validity of the latter.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the server is unable to immediately execute an RPC call, and returns
an NFS4ERR_DELAY then we can assume it is safe to interrupt the operation
in order to handle ordinary signals. This allows the application to
service timer interrupts that would otherwise have to wait until the
server is again able to respond.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When the client is reading or writing using pNFS, and hits an error
on the DS, then it typically sends a LAYOUTERROR and/or LAYOUTRETURN
to the MDS, before redirtying the failed pages, and going for a new
round of reads/writebacks. The problem is that if the server has no
way to fix the DS, then we may need a way to interrupt this loop
after a set number of attempts have been made.
This patch adds an optional module parameter that allows the admin
to specify how many times to retry the read/writeback process before
failing with a fatal error.
The default behaviour is to retry forever.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
All the callers of nfs_create_request() are now creating page group
heads, so we can remove the redundant 'last' page argument.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We require all NFS I/O subrequests to duplicate the lock context as well
as the open context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Replace the NFS custom error reporting mechanism with the generic
mapping_set_error().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
vfs_fsync() has the side effect of clearing unreported writeback errors,
so we need to make sure that we do not abuse it in situations where
applications might not normally expect us to report those errors.
The solution is to replace calls to vfs_fsync() with calls to nfs_wb_all().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The NFS read code can trigger writeback while holding the page lock.
If an error then triggers a call to nfs_write_error_remove_page(),
we can deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When flushing out dirty pages, the fact that we may hit fatal errors
is not a reason to stop writeback. Those errors are reported through
fsync(), not through the flush mechanism.
Fixes: a6598813a4 ("NFS: Don't write back further requests if there...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Add a mount option that exposes the ETIMEDOUT errors that occur during
soft timeouts to the application. This allows aware applications to
distinguish between server disk IO errors and client timeout errors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
When we introduce the 'softerr' mount option, we will see the RPC
layer returning ETIMEDOUT errors if the server is unresponsive. We
want to consider those errors to be fatal on par with the EIO errors
that are returned by ordinary 'soft' timeouts..
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up the RPC task sleep interfaces by replacing the task->tk_timeout
'hidden parameter' to rpc_sleep_on() with a new function that takes an
absolute timeout.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
None of the callers set the 'action' argument, so let's just remove it.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The RPC_TASK_KILLED flag should really not be set from another context
because it can clobber data in the struct task when task->tk_flags is
changed non-atomically.
Let's therefore swap out RPC_TASK_KILLED with an atomic flag, and add
a function to set that flag and safely wake up the task.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAly7HukQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpla6D/9y7YyAyKDgv/pVQgAlDYaGSXufvrK5iK/f
uFdSWPvGuWMbx+xy/4hfSX1pV9ZRv1aRJeFOkL/qVyr+4izKrgevwj+6Kl3/mCUO
dhiqF76bnaGXNQC6YDn1IgZp2Za+WGpeNlEhwcg20Ve11U7DVBhcL/n/6NYphtUG
V7ZFoVw+yjOO9GvkUeHx24HIQdC0JrABMoXYldl/tX3H9WjB3d1ncZDS45TuemXJ
lwm/S4nyaNaDzLnO7Hv51u3tCFpuaJbcgBdKuZB/oSWhU68D26/6peW+8qAvN+ec
htibFrK6KPRQCLNMEaV2njZEyprkL/BZJz4YukwmWB8GAtsuquy3Q3wJPounGmm5
7fCG/T1asqkurwhVcHOC07R6+d8AT5ARyJn3QYFmoYCIoSwObu6xhZHHAv7Ct1Xn
lrU4it0WkYbTXVI1l4CaRUtshCIQTZwr2EsgppjAsBc1+V2KgtbxR1wkQq2q9tQZ
Fa/2KTv9Y1+7FUOf09LEvTbuUgZn4I6u4E07QwY4miFsQSEUufirfHZ5t62lIgA9
3YzUrlVQSP1PbG8IP4aCSX2o+dxhL1Js6ukdZAxM6w9RtjLqWI3zTImSSMJobjna
SF53kkpv1xuJYT+Z1YmNGbMauzLs/HhCB9ww56TUuQYW/rTDASqFc48l7+vsfPrZ
sTEkShVGOw==
=d9Ws
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190420' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A set of small fixes that should go into this series. This contains:
- Removal of unused queue member (Hou)
- Overflow bvec fix (Ming)
- Various little io_uring tweaks (me)
- kthread parking
- Only call cpu_possible() for verified CPU
- Drop unused 'file' argument to io_file_put()
- io_uring_enter vs io_uring_register deadlock fix
- CQ overflow fix
- BFQ internal depth update fix (me)"
* tag 'for-linus-20190420' of git://git.kernel.dk/linux-block:
block: make sure that bvec length can't be overflow
block: kill all_q_node in request_queue
io_uring: fix CQ overflow condition
io_uring: fix possible deadlock between io_uring_{enter,register}
io_uring: drop io_file_put() 'file' argument
bfq: update internal depth state when queue depth changes
io_uring: only test SQPOLL cpu after we've verified it
io_uring: park SQPOLL thread if it's percpu
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it. Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier. For example in Hugh's post from Jul 2017:
https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
"Not strictly relevant here, but a related note: I was very surprised
to discover, only quite recently, how handle_mm_fault() may be called
without down_read(mmap_sem) - when core dumping. That seems a
misguided optimization to me, which would also be nice to correct"
In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.
Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.
Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.
For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs. Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.
Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.
In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.
Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm(). The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.
Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIVAwUAXLGSOfu3V2unywtrAQKrLA/+MJfgZC9ejNsXqjnnqq48LtBf4B3FhYrO
qdPZMKPuy5XxSWBIIGpXtgK7PspkpAD1NCe7qpvibQoa1SJqwa2gc46aN7+gzp3e
HETkb4UEXNV5yXSjEA+YOQVa0MiLG2yL6satlqhN4KqoDB5Rj9KoaPFwT79K4SIU
1iME9Q//tA2M7SU8B1FWqyx5kd350A38KoXmHOux7x8K1ZyjEB2DbyKqZBtn6Y/T
nfH9ZZioep1JbPFwOuG5NYQH0hP2jUj8ogDnrzzWzn1rQQTWs6wVfigwAh8tc/Dv
eDKA8rV9WtZT97ExOG6cQOe0UZsM3NFhnb4P/oSCpl+jtr9WHO3pDmJPLToCr0BA
8xXW7vA+Ejx15dNz4Snnz9yj3FS0UeURSFyhn3g2AFQMK4v9Dd3ZZKOgjR9Czi98
f1UNh768LKY+siOHmoOhgZNlrI8bgamCmXOucPayGzrQ/HVHqAwR6PiUGCzaH+vk
+LfDC/Z/ezSqrfQiH/Ji5lLkQkhmYiUTcePn5xdi+kZGWqHwl9qGplVAFUVafs5O
Hz+H2F7i8F6Oc4v5hzCHp5aDUP2oEKmruaXJimN+O9SOkUPi72aG8STRpBFnBqf7
77WxNVG/XDFL6I8HukkgRMRePFDXi+2NwGOjB9k8PpAqyA1Vi6sSkCXVkwgAhLRS
NCaOR2iDaqE=
=cqBe
-----END PGP SIGNATURE-----
Merge tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
- Stop using the deprecated get_seconds().
- Don't make tracepoint strings const as the section they go in isn't
read-only.
- Differentiate failure due to unmarshalling from other failure cases.
We shouldn't abort with RXGEN_CC/SS_UNMARSHAL if it's not due to
unmarshalling.
- Add a missing unlock_page().
- Fix the interaction between receiving a notification from a server
that it has invalidated all outstanding callback promises and a
client call that we're in the middle of making that will get a new
promise.
* tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix in-progess ops to ignore server-level callback invalidation
afs: Unlock pages for __pagevec_release()
afs: Differentiate abort due to unmarshalling from other errors
afs: Avoid section confusion in CM_NAME
afs: avoid deprecated get_seconds()
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAly2PPMACgkQiiy9cAdy
T1EbQQwArLGfousJE2mkQZIdAlefjeyCLJdFxOLktA3dd3EbmgOtxg8FSBt5W04X
UqQ6i2cGL46Uecp+EaOTy2o0o3lc7eRGA83+0CYY+JLNLSS9g+RNgK2kIkcnj3h8
N/fovUg3yhdLO0tmh13FcwBDAYPqjSd4iija1icBXSkkZdp592ims3zJaGunHMGn
s237jaLPzx7xxqNaGDD/MtQtpPDzZzPlVCszTnO3anGaSHB1R7HNhR9twWTF/mmz
bfhLiOyquzzt53XyaTmtpnzMua5X+Rhii2Pq315Pb+71JQ5TBYWRIOx/8QKwdGS+
9EiiiEgqYfwmcDl1KPnm1ppTF2++mYkFBFq43cgosagDaL8YXdqtCIrsVkTxXnmv
+bKhiXpriPzsy5INuRgfWmODrbl2QLRY1qbS4I/lA0+uZxFu7palwmSBkMOgcorn
0TnejxsajTtYBpkzYKOuiRN7z4ukvR4FJd6xaxqSkyX+Jlda4wnwPTRVyBxOcL06
EaAxdi7X
=vsJs
-----END PGP SIGNATURE-----
Merge tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
"Five small SMB3 fixes, all also for stable - an important fix for an
oplock (lease) bug, a handle leak, and three bugs spotted by KASAN"
* tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: keep FileInfo handle live during oplock break
cifs: fix handle leak in smb2_query_symlink()
cifs: Fix lease buffer length error
cifs: Fix use-after-free in SMB2_read
cifs: Fix use-after-free in SMB2_write
This is a leftover from when the rings initially were not free flowing,
and hence a test for tail + 1 == head would indicate full. Since we now
let them wrap instead of mask them with the size, we need to check if
they drift more than the ring size from each other.
This fixes a case where we'd overwrite CQ ring entries, if the user
failed to reap completions. Both cases would ultimately result in lost
completions as the application violated the depth it asked for. The only
difference is that before this fix we'd return invalid entries for the
overflowed completions, instead of properly flagging it in the
cq_ring->overflow variable.
Reported-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull networking fixes from David Miller:
1) Handle init flow failures properly in iwlwifi driver, from Shahar S
Matityahu.
2) mac80211 TXQs need to be unscheduled on powersave start, from Felix
Fietkau.
3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau.
4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed.
5) Avoid checksum complete with XDP in mlx5, also from Saeed.
6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon.
7) Partial sent TLS record leak fix from Jakub Kicinski.
8) Reject zero size iova range in vhost, from Jason Wang.
9) Allow pending work to complete before clcsock release from Karsten
Graul.
10) Fix XDP handling max MTU in thunderx, from Matteo Croce.
11) A lot of protocols look at the sa_family field of a sockaddr before
validating it's length is large enough, from Tetsuo Handa.
12) Don't write to free'd pointer in qede ptp error path, from Colin Ian
King.
13) Have to recompile IP options in ipv4_link_failure because it can be
invoked from ARP, from Stephen Suryaputra.
14) Doorbell handling fixes in qed from Denis Bolotin.
15) Revert net-sysfs kobject register leak fix, it causes new problems.
From Wang Hai.
16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva.
17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay
Aleksandrov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits)
socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
tcp: tcp_grow_window() needs to respect tcp_space()
ocelot: Clean up stats update deferred work
ocelot: Don't sleep in atomic context (irqs_disabled())
net: bridge: fix netlink export of vlan_stats_per_port option
qed: fix spelling mistake "faspath" -> "fastpath"
tipc: set sysctl_tipc_rmem and named_timeout right range
tipc: fix link established but not in session
net: Fix missing meta data in skb with vlan packet
net: atm: Fix potential Spectre v1 vulnerabilities
net/core: work around section mismatch warning for ptp_classifier
net: bridge: fix per-port af_packet sockets
bnx2x: fix spelling mistake "dicline" -> "decline"
route: Avoid crash from dereferencing NULL rt->from
MAINTAINERS: normalize Woojung Huh's email address
bonding: fix event handling for stacked bonds
Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
qed: Fix the DORQ's attentions handling
qed: Fix missing DORQ attentions
...
In the oplock break handler, writing pending changes from pages puts
the FileInfo handle. If the refcount reaches zero it closes the handle
and waits for any oplock break handler to return, thus causing a deadlock.
To prevent this situation:
* We add a wait flag to cifsFileInfo_put() to decide whether we should
wait for running/pending oplock break handlers
* We keep an additionnal reference of the SMB FileInfo handle so that
for the rest of the handler putting the handle won't close it.
- The ref is bumped everytime we queue the handler via the
cifs_queue_oplock_break() helper.
- The ref is decremented at the end of the handler
This bug was triggered by xfstest 464.
Also important fix to address the various reports of
oops in smb2_push_mandatory_locks
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
If we enter smb2_query_symlink() for something that is not a symlink
and where the SMB2_open() would succeed we would never end up
closing this handle and would thus leak a handle on the server.
Fix this by immediately calling SMB2_close() on successfull open.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
There is a KASAN slab-out-of-bounds:
BUG: KASAN: slab-out-of-bounds in _copy_from_iter_full+0x783/0xaa0
Read of size 80 at addr ffff88810c35e180 by task mount.cifs/539
CPU: 1 PID: 539 Comm: mount.cifs Not tainted 4.19 #10
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0xdd/0x12a
print_address_description+0xa7/0x540
kasan_report+0x1ff/0x550
check_memory_region+0x2f1/0x310
memcpy+0x2f/0x80
_copy_from_iter_full+0x783/0xaa0
tcp_sendmsg_locked+0x1840/0x4140
tcp_sendmsg+0x37/0x60
inet_sendmsg+0x18c/0x490
sock_sendmsg+0xae/0x130
smb_send_kvec+0x29c/0x520
__smb_send_rqst+0x3ef/0xc60
smb_send_rqst+0x25a/0x2e0
compound_send_recv+0x9e8/0x2af0
cifs_send_recv+0x24/0x30
SMB2_open+0x35e/0x1620
open_shroot+0x27b/0x490
smb2_open_op_close+0x4e1/0x590
smb2_query_path_info+0x2ac/0x650
cifs_get_inode_info+0x1058/0x28f0
cifs_root_iget+0x3bb/0xf80
cifs_smb3_do_mount+0xe00/0x14c0
cifs_do_mount+0x15/0x20
mount_fs+0x5e/0x290
vfs_kern_mount+0x88/0x460
do_mount+0x398/0x31e0
ksys_mount+0xc6/0x150
__x64_sys_mount+0xea/0x190
do_syscall_64+0x122/0x590
entry_SYSCALL_64_after_hwframe+0x44/0xa9
It can be reproduced by the following step:
1. samba configured with: server max protocol = SMB2_10
2. mount -o vers=default
When parse the mount version parameter, the 'ops' and 'vals'
was setted to smb30, if negotiate result is smb21, just
update the 'ops' to smb21, but the 'vals' is still smb30.
When add lease context, the iov_base is allocated with smb21
ops, but the iov_len is initiallited with the smb30. Because
the iov_len is longer than iov_base, when send the message,
copy array out of bounds.
we need to keep the 'ops' and 'vals' consistent.
Fixes: 9764c02fcb ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)")
Fixes: d5c7076b77 ("smb3: add smb3.1.1 to default dialect list")
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
There is a KASAN use-after-free:
BUG: KASAN: use-after-free in SMB2_read+0x1136/0x1190
Read of size 8 at addr ffff8880b4e45e50 by task ln/1009
Should not release the 'req' because it will use in the trace.
Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> 4.18+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
There is a KASAN use-after-free:
BUG: KASAN: use-after-free in SMB2_write+0x1342/0x1580
Read of size 8 at addr ffff8880b6a8e450 by task ln/4196
Should not release the 'req' because it will use in the trace.
Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> 4.18+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
- Avoid a crash scenario with architectures like powerpc that require
'pgtable_deposit' for the zero page.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJctNryAAoJEB7SkWpmfYgCzcMP/37LJbb4SYNwnDIW4BF33ril
ZwtPeJJVTR56Ojo+Dy1v9084zeyhUHHewz0Oqx15dm6k/N5SS19yKNFKQDOK+4OC
zbaWD5UOtllU3RQ2ORUOUoqNGF278+h4VVVQMntVaHhdt5f120tgHXxmKoB5Z5zH
Gcy0vZNHoJ5lVYfKjKYG0b0/dWWOD1ZEjTkZjTa4DjhVSQcFauN8DxJ4hSyumYqs
HDnZZt44RTTUS5W3BTlhuaSEcZaDOznmyj1HmKXNg3ghxguKACho4xhA7xFKqT8O
03WZxDBFnOXZb3yfKpHB6RclkJgrtmD5U5GStzl5SobLPb2E/TPQzCRhZ/kcFPZ8
RE2JkgdGl8gqCDRqRsC/tbF3dETO66vxUyf5utNv0ttBk7qLMwTGTKm3VQz7Xvu2
SLkwv6Rlw4UT6ML8nd2kNhf8xRkaLl6j1B6zWDy7wEoFPXWW+My0PPpsJZcbTeza
eib2ood7AlPHRU0/mW2ZrGHGabbS6kNGeQlod9U5sikkE7ZA/LwzyFl4b/uCqYNP
NKGQdz0iHVcq8lFPXEmZ7vP2krd6uUWIv9KaiwmjBBMf9w3ZAzS85c7HFAZD0zgC
tTHm6stMhpdS3ndyIxMBf0sL7AB/Q9BH7jJwDK/P8QObovezW2zZ4CPx/gQYJ2XU
LTeCJmQh3xcCpI3f/eka
=ijtJ
-----END PGP SIGNATURE-----
Merge tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull fsdax fix from Dan Williams:
"A single filesystem-dax fix. It has been lingering in -next for a long
while and there are no other fsdax fixes on the horizon:
- Avoid a crash scenario with architectures like powerpc that require
'pgtable_deposit' for the zero page"
* tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
fs/dax: Deposit pagetable even when installing zero page
If we have multiple threads, one doing io_uring_enter() while the other
is doing io_uring_register(), we can run into a deadlock between the
two. io_uring_register() must wait for existing users of the io_uring
instance to exit. But it does so while holding the io_uring mutex.
Callers of io_uring_enter() may need this mutex to make progress (and
eventually exit). If we wait for users to exit in io_uring_register(),
we can't do so with the io_uring mutex held without potentially risking
a deadlock.
Drop the io_uring mutex while waiting for existing callers to exit. This
is safe and guaranteed to make forward progress, since we already killed
the percpu ref before doing so. Hence later callers of io_uring_enter()
will be rejected.
Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge page ref overflow branch.
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since the fget/fput handling was reworked in commit 09bb839434, we
never call io_file_put() with state == NULL (and hence file != NULL)
anymore. Remove that case.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently call cpu_possible() even if we don't use the CPU. Move the
test under the SQ_AFF branch, which is the only place where we'll use
the value. Do the cpu_possible() test AFTER we've limited it to a max
of NR_CPUS. This avoids triggering the following warning:
WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn
if CONFIG_DEBUG_PER_CPU_MAPS is enabled.
While in there, also move the SQ thread idle period assignment inside
SETUP_SQPOLL, as we don't use it otherwise either.
Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyw354QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiMyEAC4THUReCrTuv9oFRNg5uILVYIq51nP8dw7
XamC7A92jPXd6vl/QVjmvLwT34/Y2XvX0t62RBsk849CEjgGYTeF1/qI3tMkpN7c
huupab3aYM/Rrv4i1KSPQu6iIto3DYqfmREaGJJ1Ikbu/CKDuUGyEo+Z4wrKUPon
GWnE8QMS2fdc764eVzKKqB+GryaEiHmeD1N4NnPs+nla14ysueUvJUikkTt/Laef
h7nOmz9mrqE6u1xVHNpo0TlW0oJdLfaDIL9ghwHFJXqvriTh8Tg2tEHpXI6vSTTt
StnPbTA1s1uhHs4rWYl8J0UXSZnRRp0Ep8jCvqEb9CJ23uHCNyGEoy/R7q+x2quf
T+ruolMXY7IIJP30ZMHar374YfajJdw7EH/565nlbLnjSBXhqjmc07kQ7mIYvpg6
JgureSdDwOOHpfrJgVq5es48ndt5HBYUBPzkvVGTgkeSJkMydkkM1qZeYEnai105
8EnUFusRUnYZtb73HBPjKS7i0BZZvZlI1oKYHabiMtajqcKyvwDP2tTmhqXYLDLY
9uloW0u2B0lddfzCb9hTYZOroNWfifo4vuSU5DHvnJoKvf4z3auDxaFD9N8fGn6S
aZsRjMCpFqFd0YEnZPbsctgPg2Licrs02uPntlzBTJ0ByH20pX4OepYrvgQk3vao
tOQ1jRYMKw==
=cISy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Set of fixes that should go into this round. This pull is larger than
I'd like at this time, but there's really no specific reason for that.
Some are fixes for issues that went into this merge window, others are
not. Anyway, this contains:
- Hardware queue limiting for virtio-blk/scsi (Dongli)
- Multi-page bvec fixes for lightnvm pblk
- Multi-bio dio error fix (Jason)
- Remove the cache hint from the io_uring tool side, since we didn't
move forward with that (me)
- Make io_uring SETUP_SQPOLL root restricted (me)
- Fix leak of page in error handling for pc requests (Jérôme)
- Fix BFQ regression introduced in this merge window (Paolo)
- Fix break logic for bio segment iteration (Ming)
- Fix NVMe cancel request error handling (Ming)
- NVMe pull request with two fixes (Christoph):
- fix the initial CSN for nvme-fc (James)
- handle log page offsets properly in the target (Keith)"
* tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
block: fix the return errno for direct IO
nvmet: fix discover log page when offsets are used
nvme-fc: correct csn initialization and increments on error
block: do not leak memory in bio_copy_user_iov()
lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
nvme: cancel request synchronously
blk-mq: introduce blk_mq_complete_request_sync()
scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
virtio-blk: limit number of hw queues by nr_cpu_ids
block, bfq: fix use after free in bfq_bfqq_expire
io_uring: restrict IORING_SETUP_SQPOLL to root
tools/io_uring: remove IOCQE_FLAG_CACHEHIT
block: don't use for-inside-for in bio_for_each_segment_all
Highlights include:
Stable fixes:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return values
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcsfeVAAoJEA4mA3inWBJcPjAQAIPERRVWjg7xRz6CJzt2yoM1
ApPj965DCnC9bGcGAH2U+TbCWJOi3lJwaZOPTL0ut/Tcv9PpKETRqk+rrjUcFRy1
1b1HH16GivprOmHgCRyqo5Qj2ZiaGNpY3tJfxl/6eIiSpHKPZLa4zY+q2KfK/YNI
SOVyNU0Gq08p4AiKr3CG5VVZGdNgRMrnzBYJqeTh1zZ7erWE2nJoE+pmvcLhZR0w
uxshbTWbJT21KLEI+PXTyGtFkz5jNaKy4Ts07MRBJdQjDv73MUW8CcqFZicSjtqx
zdKYa1VH9pEOjFOs57xGELSnYRdB00Vgd9/b6MqKyWH8iJzXFbgjEusMWiU45aeF
NLg9ySSU8LeY93SxV66CHG57NIgHqwZu6P+lO3efRzuHgEGceDsz0WwDF2KNIZlm
/vOmbk0I+woneFUeNDWAXD9/ETUJ8RCNk1/b1UlbkUL7aD5WSLDp1bKPifk/WA6E
Mtgwmqz1Vso3cIPglWcAgsfEAYJZSJVDMfRIhm2dy7vVU0nfW12I00G8BShgr8f7
mxAxd/V+1/Q9ftPENgC9z5LWKYQjfjksnYRHXW1m5c92Yoe9TF0yiNyDmT5hBR6w
MvUN2j3yeQBqk6JHZxtH/mmdSRD0o5kxvFrEqMj1PpP8X8DpWupQA8SZKnHq0wlj
8Q7LRum+wmhbiKCmZ+1F
=vRPB
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fix:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be
sleeping" as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped
working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return
values"
* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
NFSv4.1 fix incorrect return value in copy_file_range
xprtrdma: Fix helper that drains the transport
NFS: Fix handling of reply page vector
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
The in-kernel afs filesystem client counts the number of server-level
callback invalidation events (CB.InitCallBackState* RPC operations) that it
receives from the server. This is stored in cb_s_break in various
structures, including afs_server and afs_vnode.
If an inode is examined by afs_validate(), say, the afs_server copy is
compared, along with other break counters, to those in afs_vnode, and if
one or more of the counters do not match, it is considered that the
server's callback promise is broken. At points where this happens,
AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
refetched from the server.
afs_validate() issues an FS.FetchStatus operation to get updated metadata -
and based on the updated data_version may invalidate the pagecache too.
However, the break counters are also used to determine whether to note a
new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
and whether to cache the permit data included in the YFSFetchStatus record
by the server.
The problem comes when the server sends us a CB.InitCallBackState op. The
first such instance doesn't cause cb_s_break to be incremented, but rather
causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
after last use and all the volumes have been automatically unmounted and
the server has forgotten about the client[*], this *will* likely cause an
increment.
[*] There are other circumstances too, such as the server restarting or
needing to make space in its callback table.
Note that the server won't send us a CB.InitCallBackState op until we talk
to it again.
So what happens is:
(1) A mount for a new volume is attempted, a inode is created for the root
vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
immediately, as we don't have a nominated server to talk to yet - and
we may iterate through a few to find one.
(2) Before the operation happens, afs_fetch_status(), say, notes in the
cursor (fc.cb_break) the break counter sum from the vnode, volume and
server counters, but the server->cb_s_break is currently 0.
(3) We send FS.FetchStatus to the server. The server sends us back
CB.InitCallBackState. We increment server->cb_s_break.
(4) Our FS.FetchStatus completes. The reply includes a callback record.
(5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
the callback promise was broken by checking the break counter sum from
step (2) against the current sum.
This fails because of step (3), so we don't set the callback record
and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.
This does not preclude the syscall from progressing, and we don't loop here
rechecking the status, but rather assume it's good enough for one round
only and will need to be rechecked next time.
(6) afs_validate() it triggered on the vnode, probably called from
d_revalidate() checking the parent directory.
(7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
update vnode->cb_s_break and assumes the vnode to be invalid.
(8) afs_validate() needs to calls afs_fetch_status(). Go back to step (2)
and repeat, every time the vnode is validated.
This primarily affects volume root dir vnodes. Everything subsequent to
those inherit an already incremented cb_s_break upon mounting.
The issue is that we assume that the callback record and the cached permit
information in a reply from the server can't be trusted after getting a
server break - but this is wrong since the server makes sure things are
done in the right order, holding up our ops if necessary[*].
[*] There is an extremely unlikely scenario where a reply from before the
CB.InitCallBackState could get its delivery deferred till after - at
which point we think we have a promise when we don't. This, however,
requires unlucky mass packet loss to one call.
AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
a server we've never contacted before, but this should be unnecessary.
It's also further insulated from the problem on an initial mount by
querying the server first with FS.GetCapabilities, which triggers the
CB.InitCallBackState.
Fix this by
(1) Remove AFS_SERVER_FL_NEW.
(2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
calculation.
(3) In afs_cb_is_broken(), don't include cb_s_break in the check.
Signed-off-by: David Howells <dhowells@redhat.com>
__pagevec_release() complains loudly if any page in the vector is still
locked. The pages need to be locked for generic_error_remove_page(), but
that function doesn't actually unlock them.
Unlock the pages afterwards.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jonathan Billings <jsbillin@umich.edu>
Differentiate an abort due to an unmarshalling error from an abort due to
other errors, such as ENETUNREACH. It doesn't make sense to set abort code
RXGEN_*_UNMARSHAL in such a case, so use RX_USER_ABORT instead.
Signed-off-by: David Howells <dhowells@redhat.com>
__tracepoint_str cannot be const because the tracepoint_str
section is not read-only. Remove the stray const.
Cc: dhowells@redhat.com
Cc: viro@zeniv.linux.org.uk
Signed-off-by: Andi Kleen <ak@linux.intel.com>
get_seconds() has a limited range on 32-bit architectures and is
deprecated because of that. While AFS uses the same limits for
its inode timestamps on the wire protocol, let's just use the
simpler current_time() as we do for other file systems.
This will still zero out the 'tv_nsec' field of the timestamps
internally.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Check the state of the rxrpc call backing an afs call in each iteration of
the call wait loop in case the rxrpc call has already been terminated at
the rxrpc layer.
Interrupt the wait loop and mark the afs call as complete if the rxrpc
layer call is complete.
There were cases where rxrpc errors were not passed up to afs, which could
result in this loop waiting forever for an afs call to transition to
AFS_CALL_COMPLETE while the rx call was already complete.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>