There is a memory leak in case genlmsg_put fails.
Fix this by freeing *args* before return.
Addresses-Coverity-ID: 1476406 ("Resource leak")
Fixes: 46273cf7e0 ("tipc: fix a missing check of genlmsg_put")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Muyu Yu provided a POC where user root with CAP_NET_ADMIN can create a CAN
frame modification rule that makes the data length code a higher value than
the available CAN frame data size. In combination with a configured checksum
calculation where the result is stored relatively to the end of the data
(e.g. cgw_csum_xor_rel) the tail of the skb (e.g. frag_list pointer in
skb_shared_info) can be rewritten which finally can cause a system crash.
Michael Kubecek suggested to drop frames that have a DLC exceeding the
available space after the modification process and provided a patch that can
handle CAN FD frames too. Within this patch we also limit the length for the
checksum calculations to the maximum of Classic CAN data length (8).
CAN frames that are dropped by these additional checks are counted with the
CGW_DELETED counter which indicates misconfigurations in can-gw rules.
This fixes CVE-2019-3701.
Reported-by: Muyu Yu <ieatmuttonchuan@gmail.com>
Reported-by: Marcus Meissner <meissner@suse.de>
Suggested-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Muyu Yu <ieatmuttonchuan@gmail.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.2
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
I realized the last patch calls dev_get_by_index_rcu in a branch not
holding the rcu lock. Add the calls to rcu_read_lock and rcu_read_unlock.
Fixes: ec90ad3349 ("ipv6: Consider sk_bound_dev_if when binding a socket to a v4 mapped address")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to c5ee066333 ("ipv6: Consider sk_bound_dev_if when binding a
socket to an address"), binding a socket to v4 mapped addresses needs to
consider if the socket is bound to a device.
This problem also exists from the beginning of git history.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot was able to crash one host with the following stack trace :
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8625 Comm: syz-executor4 Not tainted 4.20.0+ #8
RIP: 0010:dev_net include/linux/netdevice.h:2169 [inline]
RIP: 0010:icmp6_send+0x116/0x2d30 net/ipv6/icmp.c:426
icmpv6_send
smack_socket_sock_rcv_skb
security_sock_rcv_skb
sk_filter_trim_cap
__sk_receive_skb
dccp_v6_do_rcv
release_sock
This is because a RX packet found socket owned by user and
was stored into socket backlog. Before leaving RCU protected section,
skb->dev was cleared in __sk_receive_skb(). When socket backlog
was finally handled at release_sock() time, skb was fed to
smack_socket_sock_rcv_skb() then icmp6_send()
We could fix the bug in smack_socket_sock_rcv_skb(), or simply
make icmp6_send() more robust against such possibility.
In the future we might provide to icmp6_send() the net pointer
instead of infering it.
Fixes: d66a8acbda ("Smack: Inform peer that IPv6 traffic has been blocked")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Piotr Sawicki <p.sawicki2@partner.samsung.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I forgot to deal with IPv6 in commit 11789039da ("fou: Prevent unbounded
recursion in GUE error handler").
Now syzbot reported what might be the same type of issue, caused by
gue6_err(), that is, handling exceptions for direct UDP encapsulation in
GUE (UDP-in-UDP) leads to unbounded recursion in the GUE exception
handler.
As it probably doesn't make sense to set up GUE this way, and it's
currently not even possible to configure this, skip exception handling for
UDP (or UDP-Lite) packets encapsulated in UDP (or UDP-Lite) packets with
GUE on IPv6.
Reported-by: syzbot+4ad25edc7a33e4ab91e0@syzkaller.appspotmail.com
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: b8a51b38e4 ("fou, fou6: ICMP error handlers for FoU and GUE")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 11789039da ("fou: Prevent unbounded recursion in GUE error
handler"), I didn't take care of the case where UDP-Lite is encapsulated
into UDP or UDP-Lite with GUE. From a syzbot report about a possibly
similar issue with GUE on IPv6, I just realised the same thing might
happen with a UDP-Lite inner payload.
Also skip exception handling for inner UDP-Lite protocol.
Fixes: 11789039da ("fou: Prevent unbounded recursion in GUE error handler")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous commit fa642f0883
("openvswitch: Derive IP protocol number for IPv6 later frags")
introduces IP protocol number parsing for IPv6 later frags that can mess
up the network header length calculation logic, i.e. nh_len < 0.
However, the network header length calculation is mainly for deriving
the transport layer header in the key extraction process which the later
fragment does not apply.
Therefore, this commit skips the network header length calculation to
fix the issue.
Reported-by: Chris Mi <chrism@mellanox.com>
Reported-by: Greg Rose <gvrose8192@gmail.com>
Fixes: fa642f0883 ("openvswitch: Derive IP protocol number for IPv6 later frags")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by
__GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
directly reclaim.
The previous behavior would require reclaim up to 1 << order pages for
skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
otherwise the allocations in alloc_skb() would loop in the page allocator
looking for memory. __GFP_RETRY_MAYFAIL makes both allocations failable
under memory pressure, including for the HEAD allocation.
This can cause, among many other things, write() to fail with ENOTCONN
during RPC when under memory pressure.
These allocations should succeed as they did previous to dcda9b0471
even if it requires calling the oom killer and additional looping in the
page allocator to find memory. There is no way to specify the previous
behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
previous behavior only guaranteed that 1 << order pages would be reclaimed
before failing for order > PAGE_ALLOC_COSTLY_ORDER. That reclaim is not
guaranteed to be contiguous memory, so repeating for such large orders is
usually not beneficial.
Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
behavior, specifically not allowing alloc_skb() to fail for small orders
and oom kill if necessary rather than allowing RPCs to fail.
Fixes: dcda9b0471 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit fixes a regression in AF_INET/RTM_GETADDR and
AF_INET6/RTM_GETADDR.
Before this commit, the kernel would stop dumping addresses once the first
skb was full and end the stream with NLMSG_DONE(-EMSGSIZE). The error
shouldn't be sent back to netlink_dump so the callback is kept alive. The
userspace is expected to call back with a new empty skb.
Changes from V1:
- The error is not handled in netlink_dump anymore but rather in
inet_dump_ifaddr and inet6_dump_addr directly as suggested by
David Ahern.
Fixes: d7e38611b8 ("net/ipv4: Put target net when address dump fails due to bad attributes")
Fixes: 242afaa696 ("net/ipv6: Put target net when address dump fails due to bad attributes")
Cc: David Ahern <dsahern@gmail.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Arthur Gautier <baloo@gandi.net>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
"Several fixes here. Basically split down the line between newly
introduced regressions and long existing problems:
1) Double free in tipc_enable_bearer(), from Cong Wang.
2) Many fixes to nf_conncount, from Florian Westphal.
3) op->get_regs_len() can throw an error, check it, from Yunsheng
Lin.
4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman
driver, from Scott Wood.
5) Inifnite loop in fib_empty_table(), from Yue Haibing.
6) Use after free in ax25_fillin_cb(), from Cong Wang.
7) Fix socket locking in nr_find_socket(), also from Cong Wang.
8) Fix WoL wakeup enable in r8169, from Heiner Kallweit.
9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani.
10) Fix ptr_ring wrap during queue swap, from Cong Wang.
11) Missing shutdown callback in hinic driver, from Xue Chaojing.
12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano
Brivio.
13) BPF out of bounds speculation fixes from Daniel Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
ipv6: Consider sk_bound_dev_if when binding a socket to an address
ipv6: Fix dump of specific table with strict checking
bpf: add various test cases to selftests
bpf: prevent out of bounds speculation on pointer arithmetic
bpf: fix check_map_access smin_value test when pointer contains offset
bpf: restrict unknown scalars of mixed signed bounds for unprivileged
bpf: restrict stack pointer arithmetic for unprivileged
bpf: restrict map value pointer arithmetic for unprivileged
bpf: enable access to ax register also from verifier rewrite
bpf: move tmp variable into ax register in interpreter
bpf: move {prev_,}insn_idx into verifier env
isdn: fix kernel-infoleak in capi_unlocked_ioctl
ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
net/hamradio/6pack: use mod_timer() to rearm timers
net-next/hinic:add shutdown callback
net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT
ip: validate header length on virtual device xmit
tap: call skb_probe_transport_header after setting skb->dev
ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
net: rds: remove unnecessary NULL check
...
IPv6 does not consider if the socket is bound to a device when binding
to an address. The result is that a socket can be bound to eth0 and then
bound to the address of eth1. If the device is a VRF, the result is that
a socket can only be bound to an address in the default VRF.
Resolve by considering the device if sk_bound_dev_if is set.
This problem exists from the beginning of git history.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dump of a specific table with strict checking enabled is looping. The
problem is that the end of the table dump is not marked in the cb. When
dumping a specific table, cb args 0 and 1 are not used (they are the hash
index and entry with an hash table index when dumping all tables). Re-use
args[0] to hold a 'done' flag for the specific table dump.
Fixes: 13e38901d4 ("net/ipv6: Plumb support for filtering route dumps")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Note that there is a conflict with the rdma tree in this pull request, since
we delete a file that has been changed in the rdma tree. Hopefully that's
easy enough to resolve!
We also were unable to track down a maintainer for Neil Brown's changes to
the generic cred code that are prerequisites to his RPC cred cleanup patches.
We've been asking around for several months without any response, so
hopefully it's okay to include those patches in this pull request.
Stable bugfixes:
- xprtrdma: Yet another double DMA-unmap # v4.20
Features:
- Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
- Per-xprt rdma receive workqueues
- Drop support for FMR memory registration
- Make port= mount option optional for RDMA mounts
Other bugfixes and cleanups:
- Remove unused nfs4_xdev_fs_type declaration
- Fix comments for behavior that has changed
- Remove generic RPC credentials by switching to 'struct cred'
- Fix crossing mountpoints with different auth flavors
- Various xprtrdma fixes from testing and auditing the close code
- Fixes for disconnect issues when using xprtrdma with krb5
- Clean up and improve xprtrdma trace points
- Fix NFS v4.2 async copy reboot recovery
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlwtO50ACgkQ18tUv7Cl
QOtZWQ//e5Hhp2TnQZ6U+99YKedjwBHP6psH3GKSEdeHSNdlSpZ5ckgHxvMb9TBa
6t4ecgv5P/uYLIePQ0u2ubUFc9+TlyGi7Iacx13/YhK7kihGHDPnZhfl0QbYixV7
rwa9bFcKmOrXs8ld+Hw3P2UL22G1gMf/LHDhPNshbW7LFZmcshKz+mKTk70kwkq9
v7tFC59p6GwV8Sr2YI2NXn2fOWsUS00sQfgj2jceJYJ8PsNa+wHYF4wPj2IY5NsE
D5Oq2kLPbytBhCllOHgopNZaf4qb5BfqhVETyc1O+kDF3BZKUhQ1PoDi2FPinaHM
5/d8hS+5fr3eMBsQrPWQLXYjWQFUXnkQQJvU3Bo52AIgomsk/8uBq3FvH7XmFcBd
C8sgnuUAkAS8feMes8GCS50BTxclnGuYGdyFJyCRXoG9Kn9rMrw9EKitky6EVq0v
NmXhW79jK84a3yDXVlAIpZ8Y9BU/HQ3GviGX8lQEdZU9YiYRzDIHvpMFwzMgqaBi
XvLbr8PlLOm8GZokThS8QYT/G2Wu6IwfUq/AufVjVD4+HiL3duKKfWSGAvcm6aAa
GoRF6UG+OmjWlzKojtRc1dI+sy22Fzh+DW+Mx6tuf/b/66wkmYnW7eKcV4rt6Tm5
/JEhvTMo9q7elL/4FgCoMCcdoc5eXqQyXRXrQiOU7YHLzn2aWU0=
=DvVW
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Stable bugfixes:
- xprtrdma: Yet another double DMA-unmap # v4.20
Features:
- Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
- Per-xprt rdma receive workqueues
- Drop support for FMR memory registration
- Make port= mount option optional for RDMA mounts
Other bugfixes and cleanups:
- Remove unused nfs4_xdev_fs_type declaration
- Fix comments for behavior that has changed
- Remove generic RPC credentials by switching to 'struct cred'
- Fix crossing mountpoints with different auth flavors
- Various xprtrdma fixes from testing and auditing the close code
- Fixes for disconnect issues when using xprtrdma with krb5
- Clean up and improve xprtrdma trace points
- Fix NFS v4.2 async copy reboot recovery"
* tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
sunrpc: convert to DEFINE_SHOW_ATTRIBUTE
sunrpc: Add xprt after nfs4_test_session_trunk()
sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS
sunrpc: handle ENOMEM in rpcb_getport_async
NFS: remove unnecessary test for IS_ERR(cred)
xprtrdma: Prevent leak of rpcrdma_rep objects
NFSv4.2 fix async copy reboot recovery
xprtrdma: Don't leak freed MRs
xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
xprtrdma: Replace outdated comment for rpcrdma_ep_post
xprtrdma: Update comments in frwr_op_send
SUNRPC: Fix some kernel doc complaints
SUNRPC: Simplify defining common RPC trace events
NFS: Fix NFSv4 symbolic trace point output
xprtrdma: Trace mapping, alloc, and dereg failures
xprtrdma: Add trace points for calls to transport switch methods
xprtrdma: Relocate the xprtrdma_mr_map trace points
xprtrdma: Clean up of xprtrdma chunk trace points
xprtrdma: Remove unused fields from rpcrdma_ia
xprtrdma: Cull dprintk() call sites
...
NFSv4.2 client, and cleaning up some convoluted backchannel server code
in the process. Otherwise, miscellaneous smaller bugfixes and cleanup.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJcLR3nAAoJECebzXlCjuG+oyAQALrPSTH9Qg2AwP2eGm+AevUj
u/VFmimImIO9dYuT02t4w42w4qMIQ0/Y7R0UjT3DxG5Oixy/zA+ZaNXCCEKwSMIX
abGF4YalUISbDc6n0Z8J14/T33wDGslhy3IQ9Jz5aBCDCocbWlzXvFlmrowbb3ak
vtB0Fc3Xo6Z/Pu2GzNzlqR+f69IAmwQGJrRrAEp3JUWSIBKiSWBXTujDuVBqJNYj
ySLzbzyAc7qJfI76K635XziULR2ueM3y5JbPX7kTZ0l3OJ6Yc0PtOj16sIv5o0XK
DBYPrtvw3ZbxQE/bXqtJV9Zn6MG5ODGxKszG1zT1J3dzotc9l/LgmcAY8xVSaO+H
QNMdU9QuwmyUG20A9rMoo/XfUb5KZBHzH7HIYOmkfBidcaygwIInIKoIDtzimm4X
OlYq3TL/3QDY6rgTCZv6n2KEnwiIDpc5+TvFhXRWclMOJMcMSHJfKFvqERSv9V3o
90qrCebPA0K8Dnc0HMxcBXZ+0TqZ2QeXp/wfIjibCXqMwlg+BZhmbeA0ngZ7x7qf
2F33E9bfVJjL+VI5FcVYQf43bOTWZgD6ZKGk4T7keYl0CPH+9P70bfhl4KKy9dqc
GwYooy/y5FPb2CvJn/EETeILRJ9OyIHUrw7HBkpz9N8n9z+V6Qbp9yW7LKgaMphW
1T+GpHZhQjwuBPuJhDK0
=dRLp
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.21' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Thanks to Vasily Averin for fixing a use-after-free in the
containerized NFSv4.2 client, and cleaning up some convoluted
backchannel server code in the process.
Otherwise, miscellaneous smaller bugfixes and cleanup"
* tag 'nfsd-4.21' of git://linux-nfs.org/~bfields/linux: (25 commits)
nfs: fixed broken compilation in nfs_callback_up_net()
nfs: minor typo in nfs4_callback_up_net()
sunrpc: fix debug message in svc_create_xprt()
sunrpc: make visible processing error in bc_svc_process()
sunrpc: remove unused xpo_prep_reply_hdr callback
sunrpc: remove svc_rdma_bc_class
sunrpc: remove svc_tcp_bc_class
sunrpc: remove unused bc_up operation from rpc_xprt_ops
sunrpc: replace svc_serv->sv_bc_xprt by boolean flag
sunrpc: use-after-free in svc_process_common()
sunrpc: use SVC_NET() in svcauth_gss_* functions
nfsd: drop useless LIST_HEAD
lockd: Show pid of lockd for remote locks
NFSD remove OP_CACHEME from 4.2 op_flags
nfsd: Return EPERM, not EACCES, in some SETATTR cases
sunrpc: fix cache_head leak due to queued request
nfsd: clean up indentation, increase indentation in switch statement
svcrdma: Optimize the logic that selects the R_key to invalidate
nfsd: fix a warning in __cld_pipe_upcall()
nfsd4: fix crash on writing v4_end_grace before nfsd startup
...
Missing prototype warning fix and a syzkaller fix when a 9p server
advertises a too small msize
----------------------------------------------------------------
Adeodato Simó (1):
net/9p: include trans_common.h to fix missing prototype warning.
Dominique Martinet (1):
9p/net: put a lower bound on msize
net/9p/client.c | 21 +++++++++++++++++++++
net/9p/trans_common.c | 1 +
2 files changed, 22 insertions(+)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAlwlhYwACgkQq06b7GqY
5nAnuQ/+JLQaW+n6D1/noTBVI0RAVDZNpR00+xG0TP7CKkFvMzS8rW2+Plo3lZuA
aONbz0Tw2wOrNFNzq/kdN0avb8ZkpqBReCXTINMl6f96BxwvxV5iBR1Kjw1xjdp4
q4hWj5JktrBWerTWZn0CYb3v2aF2SE16eAMrBC66c1BmagBk0KPHVXrAtLsZuQ3T
MEnDCUYZCP0GPOZ3cyZqwpXvrBZACJaXfFSnlsx0MN6CJq4W/c/UoCIrFWwBpFH3
8Cmsht/C42Ob36PMUghLuulKhztFqFwgVHM6iGB5u1RbqyGdIvYXDbd9NwcihNjE
tbIudu0geYB8At05DcdPfQ5B7UOFxYwbLMbT//E/APxm/vjN90Fyylk62gLwFQM3
nejVFwHweL7Wzv3tTTLkJR9uCF8qiECP1BN2voyuLbAhivHM+VCggpV3U8nDtAOm
/zy5I1bkR9uhp0WInG+riq4fnW5xx1WISZrS76AwPgfHjS1i+191hEvD06b9u39f
zlha8CH2KSZUSwMhFfYOIE5iFDbA8iuu9dgZNMynMW/5DJXsvCeE8bc0LJgfeWtG
ziKG5rwKglzUuqjVFkKHKj3PSeb7fScyFdvLrMZb6d5Qc7xytf1xP4MyFRBDkIAQ
BrycbwKUE/60BLBKhcAPkq/ha7h2g+4ZETCfh9v+C6ppU0bfUfk=
=EdE3
-----END PGP SIGNATURE-----
Merge tag '9p-for-4.21' of git://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
"Missing prototype warning fix and a syzkaller fix when a 9p server
advertises a too small msize"
* tag '9p-for-4.21' of git://github.com/martinetd/linux:
9p/net: put a lower bound on msize
net/9p: include trans_common.h to fix missing prototype warning.
In ip6_neigh_lookup(), we must not return errors coming from
neigh_create(): if creation of a neighbour entry fails, the lookup should
return NULL, in the same way as it's done in __neigh_lookup().
Otherwise, callers legitimately checking for a non-NULL return value of
the lookup function might dereference an invalid pointer.
For instance, on neighbour table overflow, ndisc_router_discovery()
crashes ndisc_update() by passing ERR_PTR(-ENOBUFS) as 'neigh' argument.
Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: f8a1b43b70 ("net/ipv6: Create a neigh_lookup for FIB entries")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Multipathing: In case of NFSv3, rpc_clnt_test_and_add_xprt() adds
the xprt to xprt switch (i.e. xps) if rpc_call_null_helper() returns
success. But in case of NFSv4.1, it needs to do EXCHANGEID to verify
the path along with check for session trunking.
Add the xprt in nfs4_test_session_trunk() only when
nfs4_detect_session_trunking() returns success. Also release refcount
hold by rpc_clnt_setup_test_and_add_xprt().
Signed-off-by: Santosh kumar pradhan <santoshkumar.pradhan@wdc.com>
Tested-by: Suresh Jayaraman <suresh.jayaraman@wdc.com>
Reported-by: Aditya Agnihotri <aditya.agnihotri@wdc.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
It's OK to sleep here, we just don't want to recurse into the filesystem
as a writeout could be waiting on this.
Future work: the documentation for GFP_NOFS says "Please try to avoid
using this flag directly and instead use memalloc_nofs_{save,restore} to
mark the whole scope which cannot/shouldn't recurse into the FS layer
with a short explanation why. All allocation requests will inherit
GFP_NOFS implicitly."
But I'm not sure where to do this. Should the workqueue be arranging
that for us in the case of workqueues created with WQ_MEM_RECLAIM?
Reported-by: Trond Myklebust <trondmy@hammer.space>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we ignore the error we'll hit a null dereference a little later.
Reported-by: syzbot+4b98281f2401ab849f4b@syzkaller.appspotmail.com
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If a reply has been processed but the RPC is later retransmitted
anyway, the req->rl_reply field still contains the only pointer to
the old rpcrdma rep. When the next reply comes in, the reply handler
will stomp on the rl_reply field, leaking the old rep.
A trace event is added to capture such leaks.
This problem seems to be worsened by the restructuring of the RPC
Call path in v4.20. Fully addressing this issue will require at
least a re-architecture of the disconnect logic, which is not
appropriate during -rc.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Defensive clean up. Don't set frwr->fr_mr until we know that the
scatterlist allocation has succeeded.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make a note of the function's dependency on an earlier ib_drain_qp.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit 7c8d9e7c88 ("xprtrdma: Move Receive posting to
Receive handler"), rpcrdma_ep_post is no longer responsible for
posting Receive buffers. Update the documenting comment to reflect
this change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit f287762308 ("xprtrdma: Chain Send to FastReg WRs") was
written before commit ce5b371782 ("xprtrdma: Replace all usage of
"frmr" with "frwr""), but was merged afterwards. Thus it still
refers to FRMR and MWs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up some warnings observed when building with "make W=1".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
These are rare, but can be helpful at tracking down DMAR and other
problems.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Name them "trace_xprtrdma_op_*" so they can be easily enabled as a
group. No trace point is added where the generic layer already has
observability.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The mr_map trace points were capturing information about the previous
use of the MR rather than about the segment that was just mapped.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The chunk-related trace points capture nearly the same information
as the MR-related trace points.
Also, rename them so globbing can be used to enable or disable
these trace points more easily.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up. The last use of these fields was in commit 173b8f49b3
("xprtrdma: Demote "connect" log messages") .
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Remove dprintk() call sites that report rare or impossible
errors. Leave a few that display high-value low noise status
information.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: There's little chance of contention between the use of
rb_lock and rb_reqslock, so merge the two. This avoids having to
take both in some (possibly future) cases.
Transport tear-down is already serialized, thus there is no need for
locking at all when destroying rpcrdma_reqs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For better observability of parsing errors, return the error code
generated in the decoders to the upper layer consumer.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Since commit ffe1f0df58 ("rpcrdma: Merge svcrdma and xprtrdma
modules into one"), the forward and backchannel components are part
of the same kernel module. A separate request_module() call in the
backchannel code is no longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 431f6eb357 ("SUNRPC: Add a label for RPC calls that require
allocation on receive") didn't update similar logic in rpc_rdma.c.
I don't think this is a bug, per-se; the commit just adds more
careful checking for broken upper layer behavior.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Place the associated RPC transaction's XID in the upper 32 bits of
each RDMA segment's rdma_offset field. There are two reasons to do
this:
- The R_key only has 8 bits that are different from registration to
registration. The XID adds more uniqueness to each RDMA segment to
reduce the likelihood of a software bug on the server reading from
or writing into memory it's not supposed to.
- On-the-wire RDMA Read and Write requests do not otherwise carry
any identifier that matches them up to an RPC. The XID in the
upper 32 bits will act as an eye-catcher in network captures.
Suggested-by: Tom Talpey <ttalpey@microsoft.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Now that there is only FRWR, there is no need for a memory
registration switch. The indirect calls to the memreg operations can
be replaced with faster direct calls.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
FMR is not supported on most recent RDMA devices. It is also less
secure than FRWR because an FMR memory registration can expose
adjacent bytes to remote reading or writing. As discussed during the
RDMA BoF at LPC 2018, it is time to remove support for FMR in the
NFS/RDMA client stack.
Note that NFS/RDMA server-side uses either local memory registration
or FRWR. FMR is not used.
There are a few Infiniband/RoCE devices in the kernel tree that do
not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will
not support client-side NFS/RDMA after this patch. These are:
- mthca
- qib
- hns (RoCE)
Users of these devices can use NFS/TCP on IPoIB instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Some devices advertise a large max_fast_reg_page_list_len
capability, but perform optimally when MRs are significantly smaller
than that depth -- probably when the MR itself is no larger than a
page.
By default, the RDMA R/W core API uses max_sge_rd as the maximum
page depth for MRs. For some devices, the value of max_sge_rd is
1, which is also not optimal. Thus, when max_sge_rd is larger than
1, use that value. Otherwise use the value of the
max_fast_reg_page_list_len attribute.
I've tested this with CX-3 Pro, FastLinq, and CX-5 devices. It
reproducibly improves the throughput of large I/Os by several
percent.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
fail with EMSGSIZE. This is because the calculated value of
ri_max_segs (the max number of MRs per RPC) exceeded
RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
walk off the end of the transport header.
Once that was addressed, the ro_maxpages result has to be corrected
to account for the number of MRs needed for Reply chunks, which is
2 MRs smaller than a normal Read or Write chunk.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Transport disconnect processing does a "wake pending tasks" at
various points.
Suppose an RPC Reply is being processed. The RPC task that Reply
goes with is waiting on the pending queue. If a disconnect wake-up
happens before reply processing is done, that reply, even if it is
good, is thrown away, and the RPC has to be sent again.
This window apparently does not exist for socket transports because
there is a lock held while a reply is being received which prevents
the wake-up call until after reply processing is done.
To resolve this, all RPC replies being processed on an RPC-over-RDMA
transport have to complete before pending tasks are awoken due to a
transport disconnect.
Callers that already hold the transport write lock may invoke
->ops->close directly. Others use a generic helper that schedules
a close when the write lock can be taken safely.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
After thinking about this more, and auditing other kernel ULP imple-
mentations, I believe that a DISCONNECT cm_event will occur after a
fatal QP event. If that's the case, there's no need for an explicit
disconnect in the QP event handler.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
To address a connection-close ordering problem, we need the ability
to drain the RPC completions running on rpcrdma_receive_wq for just
one transport. Give each transport its own RPC completion workqueue,
and drain that workqueue when disconnecting the transport.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Divide the work cleanly:
- rpcrdma_wc_receive is responsible only for RDMA Receives
- rpcrdma_reply_handler is responsible only for RPC Replies
- the posted send and receive counts both belong in rpcrdma_ep
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The recovery case in frwr_op_unmap_sync needs to DMA unmap each MR.
frwr_release_mr does not DMA-unmap, but the recycle worker does.
Fixes: 61da886bf7 ("xprtrdma: Explicitly resetting MRs is ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
While chasing yet another set of DMAR fault reports, I noticed that
the frwr recycler conflates whether or not an MR has been DMA
unmapped with frwr->fr_state. Actually the two have only an indirect
relationship. It's in fact impossible to guess reliably whether the
MR has been DMA unmapped based on its fr_state field, especially as
the surrounding code and its assumptions have changed over time.
A better approach is to track the DMA mapping status explicitly so
that the recycler is less brittle to unexpected situations, and
attempts to DMA-unmap a second time are prevented.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.20
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>