Commit Graph

59280 Commits

Author SHA1 Message Date
Vasily Averin
dc15af8e9d netfilter: nf_conntrack: ct_cpu_seq_next should increase position index
If .next function does not change position index,
following .show function will repeat output related
to current position index.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:27:18 +01:00
Russell King
8640f8dc6d net: dsa: fix phylink_start()/phylink_stop() calls
Place phylink_start()/phylink_stop() inside dsa_port_enable() and
dsa_port_disable(), which ensures that we call phylink_stop() before
tearing down phylink - which is a documented requirement.  Failure
to do so can cause use-after-free bugs.

Fixes: 0e27921816 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:45:49 -08:00
Hangbin Liu
d0098e4c6b net/ipv6: remove the old peer route if change it to a new one
When we modify the peer route and changed it to a new one, we should
remove the old route first. Before the fix:

+ ip addr add dev dummy1 2001:db8::1 peer 2001:db8::2
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::2 proto kernel metric 256 pref medium
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::3
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::2 proto kernel metric 256 pref medium

After the fix:
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::3
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::3 proto kernel metric 256 pref medium

This patch depend on the previous patch "net/ipv6: need update peer route
when modify metric" to update new peer route after delete old one.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 14:43:16 -08:00
Hangbin Liu
617940123e net/ipv6: need update peer route when modify metric
When we modify the route metric, the peer address's route need also
be updated. Before the fix:

+ ip addr add dev dummy1 2001:db8::1 peer 2001:db8::2 metric 60
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 60 pref medium
2001:db8::2 proto kernel metric 60 pref medium
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::2 metric 61
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 61 pref medium
2001:db8::2 proto kernel metric 60 pref medium

After the fix:
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::2 metric 61
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 61 pref medium
2001:db8::2 proto kernel metric 61 pref medium

Fixes: 8308f3ff17 ("net/ipv6: Add support for specifying metric of connected routes")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 14:43:16 -08:00
Jakub Kicinski
6ba3da4465 nfc: add missing attribute validation for vendor subcommand
Add missing attribute validation for vendor subcommand attributes
to the netlink policy.

Fixes: 9e58095f96 ("NFC: netlink: Implement vendor command support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
88e706d516 nfc: add missing attribute validation for deactivate target
Add missing attribute validation for NFC_ATTR_TARGET_INDEX
to the netlink policy.

Fixes: 4d63adfe12 ("NFC: Add NFC_CMD_DEACTIVATE_TARGET support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
361d23e41c nfc: add missing attribute validation for SE API
Add missing attribute validation for NFC_ATTR_SE_INDEX
to the netlink policy.

Fixes: 5ce3f32b52 ("NFC: netlink: SE API implementation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
213320a679 tipc: add missing attribute validation for MTU property
Add missing attribute validation for TIPC_NLA_PROP_MTU
to the netlink policy.

Fixes: 901271e040 ("tipc: implement configuration of UDP media MTU")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
e13aaa0643 net: taprio: add missing attribute validation for txtime delay
Add missing attribute validation for TCA_TAPRIO_ATTR_TXTIME_DELAY
to the netlink policy.

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
7e6dc03eeb net: fq: add missing attribute validation for orphan mask
Add missing attribute validation for TCA_FQ_ORPHAN_MASK
to the netlink policy.

Fixes: 06eb395fa9 ("pkt_sched: fq: better control of DDOS traffic")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
b5ab1f1be6 openvswitch: add missing attribute validation for hash
Add missing attribute validation for OVS_PACKET_ATTR_HASH
to the netlink policy.

Fixes: bd1903b7c4 ("net: openvswitch: add hash info to upcall")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
b60673c4c4 nl802154: add missing attribute validation for dev_type
Add missing attribute type validation for IEEE802154_ATTR_DEV_TYPE
to the netlink policy.

Fixes: 90c049b2c6 ("ieee802154: interface type to be added")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
9322cd7c4a nl802154: add missing attribute validation
Add missing attribute validation for several u8 types.

Fixes: 2c21d11518 ("net: add NL802154 interface for configuration of 802.15.4 devices")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
ff3b63b8c2 devlink: validate length of region addr/len
DEVLINK_ATTR_REGION_CHUNK_ADDR and DEVLINK_ATTR_REGION_CHUNK_LEN
lack entries in the netlink policy. Corresponding nla_get_u64()s
may read beyond the end of the message.

Fixes: 4e54795a27 ("devlink: Add support for region snapshot read command")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
8750939b6a devlink: validate length of param values
DEVLINK_ATTR_PARAM_VALUE_DATA may have different types
so it's not checked by the normal netlink policy. Make
sure the attribute length is what we expect.

Fixes: e3b7ca18ad ("devlink: Add param set command")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Hangbin Liu
07758eb9ff net/ipv6: use configured metric when add peer route
When we add peer address with metric configured, IPv4 could set the dest
metric correctly, but IPv6 do not. e.g.

]# ip addr add 192.0.2.1 peer 192.0.2.2/32 dev eth1 metric 20
]# ip route show dev eth1
192.0.2.2 proto kernel scope link src 192.0.2.1 metric 20
]# ip addr add 2001:db8::1 peer 2001:db8::2/128 dev eth1 metric 20
]# ip -6 route show dev eth1
2001:db8::1 proto kernel metric 20 pref medium
2001:db8::2 proto kernel metric 256 pref medium

Fix this by using configured metric instead of default one.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 8308f3ff17 ("net/ipv6: Add support for specifying metric of connected routes")
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:55:55 -08:00
Pablo Neira Ayuso
84b3268027 netlink: Use netlink header as base to calculate bad attribute offset
Userspace might send a batch that is composed of several netlink
messages. The netlink_ack() function must use the pointer to the netlink
header as base to calculate the bad attribute offset.

Fixes: 2d4bc93368 ("netlink: extended ACK reporting")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:21:23 -08:00
Stefano Garzarella
3f74957fcb vsock: fix potential deadlock in transport->release()
Some transports (hyperv, virtio) acquire the sock lock during the
.release() callback.

In the vsock_stream_connect() we call vsock_assign_transport(); if
the socket was previously assigned to another transport, the
vsk->transport->release() is called, but the sock lock is already
held in the vsock_stream_connect(), causing a deadlock reported by
syzbot:

    INFO: task syz-executor280:9768 blocked for more than 143 seconds.
      Not tainted 5.6.0-rc1-syzkaller #0
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    syz-executor280 D27912  9768   9766 0x00000000
    Call Trace:
     context_switch kernel/sched/core.c:3386 [inline]
     __schedule+0x934/0x1f90 kernel/sched/core.c:4082
     schedule+0xdc/0x2b0 kernel/sched/core.c:4156
     __lock_sock+0x165/0x290 net/core/sock.c:2413
     lock_sock_nested+0xfe/0x120 net/core/sock.c:2938
     virtio_transport_release+0xc4/0xd60 net/vmw_vsock/virtio_transport_common.c:832
     vsock_assign_transport+0xf3/0x3b0 net/vmw_vsock/af_vsock.c:454
     vsock_stream_connect+0x2b3/0xc70 net/vmw_vsock/af_vsock.c:1288
     __sys_connect_file+0x161/0x1c0 net/socket.c:1857
     __sys_connect+0x174/0x1b0 net/socket.c:1874
     __do_sys_connect net/socket.c:1885 [inline]
     __se_sys_connect net/socket.c:1882 [inline]
     __x64_sys_connect+0x73/0xb0 net/socket.c:1882
     do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
     entry_SYSCALL_64_after_hwframe+0x49/0xbe

To avoid this issue, this patch remove the lock acquiring in the
.release() callback of hyperv and virtio transports, and it holds
the lock when we call vsk->transport->release() in the vsock core.

Reported-by: syzbot+731710996d79d0d58fbc@syzkaller.appspotmail.com
Fixes: 408624af4c ("vsock: use local transport when it is loaded")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 12:03:56 -08:00
David S. Miller
5c05a164d4 unix: It's CONFIG_PROC_FS not CONFIG_PROCFS
Fixes: 3a12500ed5 ("unix: define and set show_fdinfo only if procfs is enabled")
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:52:35 -08:00
Tobias Klauser
3a12500ed5 unix: define and set show_fdinfo only if procfs is enabled
Follow the pattern used with other *_show_fdinfo functions and only
define unix_show_fdinfo and set it in proto_ops if CONFIG_PROCFS
is set.

Fixes: 3c32da19a8 ("unix: Show number of pending scm files of receive queue in fdinfo")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:11:57 -08:00
Karsten Graul
a2f2ef4a54 net/smc: check for valid ib_client_data
In smc_ib_remove_dev() check if the provided ib device was actually
initialized for SMC before.

Reported-by: syzbot+84484ccebdd4e5451d91@syzkaller.appspotmail.com
Fixes: a4cf0443c4 ("smc: introduce SMC as an IB-client")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:56:25 -08:00
Paolo Abeni
dc24f8b4ec mptcp: add dummy icsk_sync_mss()
syzbot noted that the master MPTCP socket lacks the icsk_sync_mss
callback, and was able to trigger a null pointer dereference:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246
RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600
RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140
RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a
R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0
R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba
FS:  0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888
 netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989
 smack_netlabel security/smack/smack_lsm.c:2425 [inline]
 smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716
 security_inode_setsecurity+0xb2/0x140 security/security.c:1364
 __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197
 vfs_setxattr fs/xattr.c:224 [inline]
 setxattr+0x335/0x430 fs/xattr.c:451
 __do_sys_fsetxattr fs/xattr.c:506 [inline]
 __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495
 __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495
 do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440199
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199
RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8
R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20
R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
CR2: 0000000000000000

Address the issue adding a dummy icsk_sync_mss callback.
To properly sync the subflows mss and options list we need some
additional infrastructure, which will land to net-next.

Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com
Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:49:50 -08:00
Eric Dumazet
b6f6118901 ipv6: restrict IPV6_ADDRFORM operation
IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one.
While this operation sounds illogical, we have to support it.

One of the things it does for TCP socket is to switch sk->sk_prot
to tcp_prot.

We now have other layers playing with sk->sk_prot, so we should make
sure to not interfere with them.

This patch makes sure sk_prot is the default pointer for TCP IPv6 socket.

syzbot reported :
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD a0113067 P4D a0113067 PUD a8771067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
FS:  00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427
 __sock_release net/socket.c:605 [inline]
 sock_close+0xe1/0x260 net/socket.c:1283
 __fput+0x2e4/0x740 fs/file_table.c:280
 ____fput+0x15/0x20 fs/file_table.c:313
 task_work_run+0x176/0x1b0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:164 [inline]
 prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195
 syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278
 do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45c429
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429
RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004
RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c
Modules linked in:
CR2: 0000000000000000
---[ end trace 82567b5207e87bae ]---
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
FS:  00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:20:58 -08:00
Ursula Braun
51e3dfa890 net/smc: fix cleanup for linkgroup setup failures
If an SMC connection to a certain peer is setup the first time,
a new linkgroup is created. In case of setup failures, such a
linkgroup is unusable and should disappear. As a first step the
linkgroup is removed from the linkgroup list in smc_lgr_forget().

There are 2 problems:
smc_listen_decline() might be called before linkgroup creation
resulting in a crash due to calling smc_lgr_forget() with
parameter NULL.
If a setup failure occurs after linkgroup creation, the connection
is never unregistered from the linkgroup, preventing linkgroup
freeing.

This patch introduces an enhanced smc_lgr_cleanup_early() function
which
* contains a linkgroup check for early smc_listen_decline()
  invocations
* invokes smc_conn_free() to guarantee unregistering of the
  connection.
* schedules fast linkgroup removal of the unusable linkgroup

And the unused function smcd_conn_free() is removed from smc_core.h.

Fixes: 3b2dec2603 ("net/smc: restructure client and server code in af_smc")
Fixes: 2a0674fffb ("net/smc: improve abnormal termination of link groups")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:18:07 -08:00
Jiri Pirko
1521a67e60 sched: act: count in the size of action flags bitfield
The put of the flags was added by the commit referenced in fixes tag,
however the size of the message was not extended accordingly.

Fix this by adding size of the flags bitfield to the message size.

Fixes: e382267860 ("net: sched: update action implementations to support flags")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 17:10:44 -08:00
Madhuparna Bhowmik
2eb51c75dc net: core: devlink.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled.

The devlink->lock is held when devlink_dpipe_table_find()
is called in non RCU read side section. Therefore, pass struct devlink
to devlink_dpipe_table_find() for lockdep checking.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 16:59:18 -08:00
David S. Miller
574b238f64 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes:

1) Perform garbage collection from workqueue to fix rcu detected
   stall in ipset hash set types, from Jozsef Kadlecsik.

2) Fix the forceadd evaluation path, also from Jozsef.

3) Fix nft_set_pipapo selftest, from Stefano Brivio.

4) Crash when add-flush-add element in pipapo set, also from Stefano.
   Add test to cover this crash.

5) Remove sysctl entry under mutex in hashlimit, from Cong Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 16:30:17 -08:00
Cong Wang
99b79c3900 netfilter: xt_hashlimit: unregister proc file before releasing mutex
Before releasing the global mutex, we only unlink the hashtable
from the hash list, its proc file is still not unregistered at
this point. So syzbot could trigger a race condition where a
parallel htable_create() could register the same file immediately
after the mutex is released.

Move htable_remove_proc_entry() back to mutex protection to
fix this. And, fold htable_destroy() into htable_put() to make
the code slightly easier to understand.

Reported-and-tested-by: syzbot+d195fd3b9a364ddd6731@syzkaller.appspotmail.com
Fixes: c4a3922d2d ("netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()")
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26 23:25:07 +01:00
Michal Kubecek
e34f1753ee ethtool: limit bitset size
Syzbot reported that ethnl_compact_sanity_checks() can be tricked into
reading past the end of ETHTOOL_A_BITSET_VALUE and ETHTOOL_A_BITSET_MASK
attributes and even the message by passing a value between (u32)(-31)
and (u32)(-1) as ETHTOOL_A_BITSET_SIZE.

The problem is that DIV_ROUND_UP(attr_nbits, 32) is 0 for such values so
that zero length ETHTOOL_A_BITSET_VALUE will pass the length check but
ethnl_bitmap32_not_zero() check would try to access up to 512 MB of
attribute "payload".

Prevent this overflow byt limiting the bitset size. Technically, compact
bitset format would allow bitset sizes up to almost 2^18 (so that the
nest size does not exceed U16_MAX) but bitsets used by ethtool are much
shorter. S16_MAX, the largest value which can be directly used as an
upper limit in policy, should be a reasonable compromise.

Fixes: 10b518d4e6 ("ethtool: netlink bitset handling")
Reported-by: syzbot+7fd4ed5b4234ab1fdccd@syzkaller.appspotmail.com
Reported-by: syzbot+709b7a64d57978247e44@syzkaller.appspotmail.com
Reported-by: syzbot+983cb8fb2d17a7af549d@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 11:27:31 -08:00
Amritha Nambiar
6e11d1578f net: Fix Tx hash bound checking
Fixes the lower and upper bounds when there are multiple TCs and
traffic is on the the same TC on the same device.

The lower bound is represented by 'qoffset' and the upper limit for
hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue
mapping when there are multiple TCs, as the queue indices for upper TCs
will be offset by 'qoffset'.

v2: Fixed commit description based on comments.

Fixes: 1b837d489e ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash")
Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 11:14:10 -08:00
Stefano Brivio
212d58c106 nft_set_pipapo: Actually fetch key data in nft_pipapo_remove()
Phil reports that adding elements, flushing and re-adding them
right away:

  nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }'
  nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
  nft flush set t s
  nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'

triggers, almost reliably, a crash like this one:

  [   71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
  [   71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192
  [   71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
  [   71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
  [   71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
  [   71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
  [   71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
  [   71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
  [   71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
  [   71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
  [   71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
  [   71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
  [   71.334596] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
  [   71.335780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
  [   71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [   71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [   71.339718] Call Trace:
  [   71.340093]  nft_pipapo_destroy+0x7a/0x170 [nf_tables_set]
  [   71.340973]  nft_set_destroy+0x20/0x50 [nf_tables]
  [   71.341879]  nf_tables_trans_destroy_work+0x246/0x260 [nf_tables]
  [   71.342916]  process_one_work+0x1d5/0x3c0
  [   71.343601]  worker_thread+0x4a/0x3c0
  [   71.344229]  kthread+0xfb/0x130
  [   71.344780]  ? process_one_work+0x3c0/0x3c0
  [   71.345477]  ? kthread_park+0x90/0x90
  [   71.346129]  ret_from_fork+0x35/0x40
  [   71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink]
  [   71.348153] ---[ end trace 2eaa8149ca759bcc ]---
  [   71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
  [   71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
  [   71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
  [   71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
  [   71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
  [   71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
  [   71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
  [   71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
  [   71.350025] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
  [   71.350026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
  [   71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [   71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [   71.350030] Kernel panic - not syncing: Fatal exception
  [   71.350412] Kernel Offset: disabled
  [   71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]---

which is caused by dangling elements that have been deactivated, but
never removed.

On a flush operation, nft_pipapo_walk() walks through all the elements
in the mapping table, which are then deactivated by nft_flush_set(),
one by one, and added to the commit list for removal. Element data is
then freed.

On transaction commit, nft_pipapo_remove() is called, and failed to
remove these elements, leading to the stale references in the mapping.
The first symptom of this, revealed by KASan, is a one-byte
use-after-free in subsequent calls to nft_pipapo_walk(), which is
usually not enough to trigger a panic. When stale elements are used
more heavily, though, such as double-free via nft_pipapo_destroy()
as in Phil's case, the problem becomes more noticeable.

The issue comes from that fact that, on a flush operation,
nft_pipapo_remove() won't get the actual key data via elem->key,
elements to be deleted upon commit won't be found by the lookup via
pipapo_get(), and removal will be skipped. Key data should be fetched
via nft_set_ext_key(), instead.

Reported-by: Phil Sutter <phil@nwl.cc>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26 14:33:09 +01:00
Pablo Neira Ayuso
9ea4894ba4 Merge branch 'master' of git://blackhole.kfki.hu/nf
Jozsef Kadlecsik says:

====================
ipset patches for nf

The first one is larger than usual, but the issue could not be solved simpler.
Also, it's a resend of the patch I submitted a few days ago, with a one line
fix on top of that: the size of the comment extensions was not taken into
account at reporting the full size of the set.

- Fix "INFO: rcu detected stall in hash_xxx" reports of syzbot
  by introducing region locking and using workqueue instead of timer based
  gc of timed out entries in hash types of sets in ipset.
- Fix the forceadd evaluation path - the bug was also uncovered by the syzbot.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26 13:55:15 +01:00
David S. Miller
3614d05b5e A few fixes:
* remove a double mutex-unlock
  * fix a leak in an error path
  * NULL pointer check
  * include if_vlan.h where needed
  * avoid RCU list traversal when not under RCU
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5UGA8ACgkQB8qZga/f
 l8Qk4g/8DstHoC7R74x4jrOyGffw7ss7XC8YJi2x3IENg7mAYdEFvnUC7GQmrCso
 aMypkOWac0H4cAw2BId8CFevbWZbanJ9uvC4JicQpaNa9bZkqBaDMFwO3k66S1hA
 JBvmOqPy8FNuFRsXTicMPw7/I6QazR+sW7Ed68L7ygikkklvTeovH9qkL+zHHMcK
 dAVS6ZKB2jSkEOtnlG01QMRwdIijTkD6KjGDECVMzau9XC+fGowQeuy+xuAU0THq
 Myef9CpzU2Q47Sa61fEyXVC54Izp9B1ZuohM2wfrN2zC3SKxsmPM1lm4GhrJOfUF
 wJTQY75z1kAmCAJDKpwBrFkyIEfUPQV1v9HHit0LX4y11NYNWL22liLhPv+kZDmD
 9ht1IxAP1Nw/DSVKaXUmwk1tFkETfk2BvCkNjVrwKlt88QgvQAL/uzgKoz5E6VIM
 hU1hp0mJCzsmyt/COWOf0l/QpO2nN3/64YHZgFiqOGnhXr5r0L74Jf3lcdWyaXRU
 hcJ0M/EcfGLWOlHFcRkNdWCkBFk5Daf4/zzQ3vO9weAlb7wOLDo9mHoXMBgP6R9l
 kCnxjxgIqmpQLvMo/sBt2cLaf3bJU40p/BYU7hqI7CerODFc3JKkCs1ODZPbYDaR
 Fk60mlElkBD3P4EU8/6nbXHOfG9FvEiZjhT7kRcUdPx0dLA/KMk=
 =7JH4
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg

====================
A few fixes:
 * remove a double mutex-unlock
 * fix a leak in an error path
 * NULL pointer check
 * include if_vlan.h where needed
 * avoid RCU list traversal when not under RCU
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 15:43:38 -08:00
Nikolay Aleksandrov
823d81b0fa net: bridge: fix stale eth hdr pointer in br_dev_xmit
In br_dev_xmit() we perform vlan filtering in br_allowed_ingress() but
if the packet has the vlan header inside (e.g. bridge with disabled
tx-vlan-offload) then the vlan filtering code will use skb_vlan_untag()
to extract the vid before filtering which in turn calls pskb_may_pull()
and we may end up with a stale eth pointer. Moreover the cached eth header
pointer will generally be wrong after that operation. Remove the eth header
caching and just use eth_hdr() directly, the compiler does the right thing
and calculates it only once so we don't lose anything.

Fixes: 057658cb33 ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 11:11:19 -08:00
Madhuparna Bhowmik
253216ffb2 mac80211: rx: avoid RCU list traversal under mutex
local->sta_mtx is held in __ieee80211_check_fast_rx_iface().
No need to use list_for_each_entry_rcu() as it also requires
a cond argument to avoid false lockdep warnings when not used in
RCU read-side section (with CONFIG_PROVE_RCU_LIST).
Therefore use list_for_each_entry();

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Link: https://lore.kernel.org/r/20200223143302.15390-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:42:38 +01:00
Johannes Berg
e3ae39edbc nl80211: explicitly include if_vlan.h
We use that here, and do seem to get it through some recursive
include, but better include it explicitly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200224093814.1b9c258fec67.I45ac150d4e11c72eb263abec9f1f0c7add9bef2b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:41:13 +01:00
Madhuparna Bhowmik
6132c1d903 net: core: devlink.c: Hold devlink->lock from the beginning of devlink_dpipe_table_register()
devlink_dpipe_table_find() should be called under either
rcu_read_lock() or devlink->lock. devlink_dpipe_table_register()
calls devlink_dpipe_table_find() without holding the lock
and acquires it later. Therefore hold the devlink->lock
from the beginning of devlink_dpipe_table_register().

Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-23 21:17:37 -08:00
Neal Cardwell
dad8cea7ad tcp: fix TFO SYNACK undo to avoid double-timestamp-undo
In a rare corner case the new logic for undo of SYNACK RTO could
result in triggering the warning in tcp_fastretrans_alert() that says:
        WARN_ON(tp->retrans_out != 0);

The warning looked like:

WARNING: CPU: 1 PID: 1 at net/ipv4/tcp_input.c:2818 tcp_ack+0x13e0/0x3270

The sequence that tickles this bug is:
 - Fast Open server receives TFO SYN with data, sends SYNACK
 - (client receives SYNACK and sends ACK, but ACK is lost)
 - server app sends some data packets
 - (N of the first data packets are lost)
 - server receives client ACK that has a TS ECR matching first SYNACK,
   and also SACKs suggesting the first N data packets were lost
    - server performs TS undo of SYNACK RTO, then immediately
      enters recovery
    - buggy behavior then performed a *second* undo that caused
      the connection to be in CA_Open with retrans_out != 0

Basically, the incoming ACK packet with SACK blocks causes us to first
undo the cwnd reduction from the SYNACK RTO, but then immediately
enters fast recovery, which then makes us eligible for undo again. And
then tcp_rcv_synrecv_state_fastopen() accidentally performs an undo
using a "mash-up" of state from two different loss recovery phases: it
uses the timestamp info from the ACK of the original SYNACK, and the
undo_marker from the fast recovery.

This fix refines the logic to only invoke the tcp_try_undo_loss()
inside tcp_rcv_synrecv_state_fastopen() if the connection is still in
CA_Loss.  If peer SACKs triggered fast recovery, then
tcp_rcv_synrecv_state_fastopen() can't safely undo.

Fixes: 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-23 17:23:35 -08:00
Paolo Abeni
39f3b41aa7 net: genetlink: return the error code when attribute parsing fails.
Currently if attribute parsing fails and the genl family
does not support parallel operation, the error code returned
by __nlmsg_parse() is discarded by genl_family_rcv_msg_attrs_parse().

Be sure to report the error for all genl families.

Fixes: c10e6cf85e ("net: genetlink: push attrbuf allocation and parsing to a separate function")
Fixes: ab5b526da0 ("net: genetlink: always allocate separate attrs for dumpit ops")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-22 21:58:33 -08:00
Matteo Croce
3e72dfdf82 ipv4: ensure rcu_read_lock() in cipso_v4_error()
Similarly to commit c543cb4a5f ("ipv4: ensure rcu_read_lock() in
ipv4_link_failure()"), __ip_options_compile() must be called under rcu
protection.

Fixes: 3da1ed7ac3 ("net: avoid use IPCB in cipso_v4_error")
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-22 21:45:55 -08:00
Jozsef Kadlecsik
8af1c6fbd9 netfilter: ipset: Fix forceadd evaluation path
When the forceadd option is enabled, the hash:* types should find and replace
the first entry in the bucket with the new one if there are no reuseable
(deleted or timed out) entries. However, the position index was just not set
to zero and remained the invalid -1 if there were no reuseable entries.

Reported-by: syzbot+6a86565c74ebe30aea18@syzkaller.appspotmail.com
Fixes: 23c42a403a ("netfilter: ipset: Introduction of new commands and protocol version 7")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2020-02-22 12:13:45 +01:00
Jozsef Kadlecsik
f66ee0410b netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports
In the case of huge hash:* types of sets, due to the single spinlock of
a set the processing of the whole set under spinlock protection could take
too long.

There were four places where the whole hash table of the set was processed
from bucket to bucket under holding the spinlock:

- During resizing a set, the original set was locked to exclude kernel side
  add/del element operations (userspace add/del is excluded by the
  nfnetlink mutex). The original set is actually just read during the
  resize, so the spinlocking is replaced with rcu locking of regions.
  However, thus there can be parallel kernel side add/del of entries.
  In order not to loose those operations a backlog is added and replayed
  after the successful resize.
- Garbage collection of timed out entries was also protected by the spinlock.
  In order not to lock too long, region locking is introduced and a single
  region is processed in one gc go. Also, the simple timer based gc running
  is replaced with a workqueue based solution. The internal book-keeping
  (number of elements, size of extensions) is moved to region level due to
  the region locking.
- Adding elements: when the max number of the elements is reached, the gc
  was called to evict the timed out entries. The new approach is that the gc
  is called just for the matching region, assuming that if the region
  (proportionally) seems to be full, then the whole set does. We could scan
  the other regions to check every entry under rcu locking, but for huge
  sets it'd mean a slowdown at adding elements.
- Listing the set header data: when the set was defined with timeout
  support, the garbage collector was called to clean up timed out entries
  to get the correct element numbers and set size values. Now the set is
  scanned to check non-timed out entries, without actually calling the gc
  for the whole set.

Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
SOFTIRQ-unsafe lock order issues during working on the patch.

Reported-by: syzbot+4b0e9d4ff3cf117837e5@syzkaller.appspotmail.com
Reported-by: syzbot+c27b8d5010f45c666ed1@syzkaller.appspotmail.com
Reported-by: syzbot+68a806795ac89df3aa1c@syzkaller.appspotmail.com
Fixes: 23c42a403a ("netfilter: ipset: Introduction of new commands and protocol version 7")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2020-02-22 12:00:06 +01:00
Linus Torvalds
3dc55dba67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
    Cong Wang.

 2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
    is finished, from Magnus Karlsson.

 3) Set table field properly to RT_TABLE_COMPAT when necessary, from
    Jethro Beekman.

 4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
    that in IFLA_ALT_IFNAME. From Eric Dumazet.

 5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

 6) Handle mtu==0 devices properly in wireguard, from Jason A.
    Donenfeld.

 7) Fix several lockdep warnings in bonding, from Taehee Yoo.

 8) Fix cls_flower port blocking, from Jason Baron.

 9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

10) Fix RDMA race in qede driver, from Michal Kalderon.

11) Fix several false lockdep warnings by adding conditions to
    list_for_each_entry_rcu(), from Madhuparna Bhowmik.

12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

14) Hey, variables declared in switch statement before any case
    statements are not initialized. I learn something every day. Get
    rids of this stuff in several parts of the networking, from Kees
    Cook.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
  bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
  bnxt_en: Improve device shutdown method.
  net: netlink: cap max groups which will be considered in netlink_bind()
  net: thunderx: workaround BGX TX Underflow issue
  ionic: fix fw_status read
  net: disable BRIDGE_NETFILTER by default
  net: macb: Properly handle phylink on at91rm9200
  s390/qeth: fix off-by-one in RX copybreak check
  s390/qeth: don't warn for napi with 0 budget
  s390/qeth: vnicc Fix EOPNOTSUPP precedence
  openvswitch: Distribute switch variables for initialization
  net: ip6_gre: Distribute switch variables for initialization
  net: core: Distribute switch variables for initialization
  udp: rehash on disconnect
  net/tls: Fix to avoid gettig invalid tls record
  bpf: Fix a potential deadlock with bpf_map_do_batch
  bpf: Do not grab the bucket spinlock by default on htab batch ops
  ice: Wait for VF to be reset/ready before configuration
  ice: Don't tell the OS that link is going down
  ice: Don't reject odd values of usecs set by user
  ...
2020-02-21 11:59:51 -08:00
Andrei Otcheretianski
0daa63ed4c mac80211: Remove a redundant mutex unlock
The below-mentioned commit changed the code to unlock *inside*
the function, but previously the unlock was *outside*. It failed
to remove the outer unlock, however, leading to double unlock.

Fix this.

Fixes: 33483a6b88 ("mac80211: fix missing unlock on error in ieee80211_mark_sta_auth()")
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Link: https://lore.kernel.org/r/20200221104719.cce4741cf6eb.I671567b185c8a4c2409377e483fd149ce590f56d@changeid
[rewrite commit message to better explain what happened]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 10:59:36 +01:00
Johannes Berg
a7ee7d44b5 cfg80211: check reg_rule for NULL in handle_channel_custom()
We may end up with a NULL reg_rule after the loop in
handle_channel_custom() if the bandwidth didn't fit,
check if this is the case and bail out if so.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200221104449.3b558a50201c.I4ad3725c4dacaefd2d18d3cc65ba6d18acd5dbfe@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 10:59:27 +01:00
Johannes Berg
9951ebfcdf nl80211: fix potential leak in AP start
If nl80211_parse_he_obss_pd() fails, we leak the previously
allocated ACL memory. Free it in this case.

Fixes: 796e90f42b ("cfg80211: add support for parsing OBBS_PD attributes")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200221104142.835aba4cdd14.I1923b55ba9989c57e13978f91f40bfdc45e60cbd@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 10:50:26 +01:00
Nikolay Aleksandrov
3a20773bee net: netlink: cap max groups which will be considered in netlink_bind()
Since nl_groups is a u32 we can't bind more groups via ->bind
(netlink_bind) call, but netlink has supported more groups via
setsockopt() for a long time and thus nlk->ngroups could be over 32.
Recently I added support for per-vlan notifications and increased the
groups to 33 for NETLINK_ROUTE which exposed an old bug in the
netlink_bind() code causing out-of-bounds access on archs where unsigned
long is 32 bits via test_bit() on a local variable. Fix this by capping the
maximum groups in netlink_bind() to BITS_PER_TYPE(u32), effectively
capping them at 32 which is the minimum of allocated groups and the
maximum groups which can be bound via netlink_bind().

CC: Christophe Leroy <christophe.leroy@c-s.fr>
CC: Richard Guy Briggs <rgb@redhat.com>
Fixes: 4f52090052 ("netlink: have netlink per-protocol bind function return an error code.")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 16:02:08 -08:00
Roman Kiryanov
98bda63e20 net: disable BRIDGE_NETFILTER by default
The description says 'If unsure, say N.' but
the module is built as M by default (once
the dependencies are satisfied).

When the module is selected (Y or M), it enables
NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS
which alter kernel internal structures.

We (Android Studio Emulator) currently do not
use this module and think this it is more consistent
to have it disabled by default as opposite to
disabling it explicitly to prevent enabling
NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS.

Signed-off-by: Roman Kiryanov <rkir@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 15:02:02 -08:00
Kees Cook
16a556eeb7 openvswitch: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/openvswitch/flow_netlink.c: In function ‘validate_set’:
net/openvswitch/flow_netlink.c:2711:29: warning: statement will never be executed [-Wswitch-unreachable]
 2711 |  const struct ovs_key_ipv4 *ipv4_key;
      |                             ^~~~~~~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook
46d30cb104 net: ip6_gre: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/ipv6/ip6_gre.c: In function ‘ip6gre_err’:
net/ipv6/ip6_gre.c:440:32: warning: statement will never be executed [-Wswitch-unreachable]
  440 |   struct ipv6_tlv_tnl_enc_lim *tel;
      |                                ^~~

net/ipv6/ip6_tunnel.c: In function ‘ip6_tnl_err’:
net/ipv6/ip6_tunnel.c:520:32: warning: statement will never be executed [-Wswitch-unreachable]
  520 |   struct ipv6_tlv_tnl_enc_lim *tel;
      |                                ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook
161d179261 net: core: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/core/skbuff.c: In function ‘skb_checksum_setup_ip’:
net/core/skbuff.c:4809:7: warning: statement will never be executed [-Wswitch-unreachable]
 4809 |   int err;
      |       ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
David S. Miller
41f57cfde1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-02-19

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 10 day(s) which contain
a total of 10 files changed, 93 insertions(+), 31 deletions(-).

The main changes are:

1) batched bpf hashtab fixes from Brian and Yonghong.

2) various selftests and libbpf fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:42:35 -08:00
Willem de Bruijn
303d0403b8 udp: rehash on disconnect
As of the below commit, udp sockets bound to a specific address can
coexist with one bound to the any addr for the same port.

The commit also phased out the use of socket hashing based only on
port (hslot), in favor of always hashing on {addr, port} (hslot2).

The change broke the following behavior with disconnect (AF_UNSPEC):

    server binds to 0.0.0.0:1337
    server connects to 127.0.0.1:80
    server disconnects
    client connects to 127.0.0.1:1337
    client sends "hello"
    server reads "hello"	// times out, packet did not find sk

On connect the server acquires a specific source addr suitable for
routing to its destination. On disconnect it reverts to the any addr.

The connect call triggers a rehash to a different hslot2. On
disconnect, add the same to return to the original hslot2.

Skip this step if the socket is going to be unhashed completely.

Fixes: 4cdeeee925 ("net: udp: prefer listeners bound to an address")
Reported-by: Pavel Roskin <plroskin@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:34:11 -08:00
Rohit Maheshwari
06f5201c63 net/tls: Fix to avoid gettig invalid tls record
Current code doesn't check if tcp sequence number is starting from (/after)
1st record's start sequnce number. It only checks if seq number is before
1st record's end sequnce number. This problem will always be a possibility
in re-transmit case. If a record which belongs to a requested seq number is
already deleted, tls_get_record will start looking into list and as per the
check it will look if seq number is before the end seq of 1st record, which
will always be true and will return 1st record always, it should in fact
return NULL.
As part of the fix, start looking each record only if the sequence number
lies in the list else return NULL.
There is one more check added, driver look for the start marker record to
handle tcp packets which are before the tls offload start sequence number,
hence return 1st record if the record is tls start marker and seq number is
before the 1st record's starting sequence number.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:32:06 -08:00
Madhuparna Bhowmik
33c4acbe2f bridge: br_stp: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 11:13:43 -08:00
Amol Grover
a7a9456e8d net: hsr: Pass lockdep expression to RCU lists
node_db is traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of hsr->list_lock.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 10:55:42 -08:00
David S. Miller
7c8c1697c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This batch contains Netfilter fixes for net:

1) Restrict hashlimit size to 1048576, from Cong Wang.

2) Check for offload flags from nf_flow_table_offload_setup(),
   this fixes a crash in case the hardware offload is disabled.
   From Florian Westphal.

3) Three preparation patches to extend the conntrack clash resolution,
   from Florian.

4) Extend clash resolution to deal with DNS packets from the same flow
   racing to set up the NAT configuration.

5) Small documentation fix in pipapo, from Stefano Brivio.

6) Remove misleading unlikely() from pipapo_refill(), also from Stefano.

7) Reduce hashlimit mutex scope, from Cong Wang. This patch is actually
   triggering another problem, still under discussion, another patch to
   fix this will follow up.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 15:44:13 -08:00
Stefano Brivio
9a7712048f netfilter: nft_set_pipapo: Don't abuse unlikely() in pipapo_refill()
I originally used unlikely() in the if (match_only) clause, which
we hit on the mapping table for the last field in a set, to ensure
we avoid branching to the rest of for loop body, which is executed
more frequently.

However, Pablo reports, this is confusing as it gives the impression
that this is not a common case, and it's actually not the intended
usage of unlikely().

I couldn't observe any statistical difference in matching rates on
x864_64 and aarch64 without it, so just drop it.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-18 22:07:09 +01:00
Stefano Brivio
bd97ad51a7 netfilter: nft_set_pipapo: Fix mapping table example in comments
In both insertion and lookup examples, the two element pointers
of rule mapping tables were swapped. Fix that.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-18 22:06:54 +01:00
Madhuparna Bhowmik
a2cfb96cc3 flow_table.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:27 -08:00
Madhuparna Bhowmik
53742e69e8 datapath.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:27 -08:00
Madhuparna Bhowmik
fed48423f1 vport.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:27 -08:00
Madhuparna Bhowmik
7790614616 meter.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:26 -08:00
Madhuparna Bhowmik
9facfdb546 netlabel_domainhash.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:44:23 -08:00
Madhuparna Bhowmik
8c70c3d728 net: netlabel: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:43:06 -08:00
Paolo Abeni
379349e9bc Revert "net: dev: introduce support for sch BYPASS for lockless qdisc"
This reverts commit ba27b4cdaa

Ahmed reported ouf-of-order issues bisected to commit ba27b4cdaa
("net: dev: introduce support for sch BYPASS for lockless qdisc").
I can't find any working solution other than a plain revert.

This will introduce some minor performance regressions for
pfifo_fast qdisc. I plan to address them in net-next with more
indirect call wrapper boilerplate for qdiscs.

Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Fixes: ba27b4cdaa ("net: dev: introduce support for sch BYPASS for lockless qdisc")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:37:45 -08:00
Florian Westphal
d99bfed58d mptcp: fix bogus socket flag values
Dan Carpenter reports static checker warnings due to bogus BIT() usage:

net/mptcp/subflow.c:571 subflow_write_space() warn: test_bit() takes a bit number
net/mptcp/subflow.c:694 subflow_state_change() warn: test_bit() takes a bit number
net/mptcp/protocol.c:261 ssk_check_wmem() warn: test_bit() takes a bit number
[..]

This is harmless (we use bits 1 & 2 instead of 0 and 1), but would
break eventually when adding BIT(5) (or 6, depends on size of 'long').

Just use 0 and 1, the values are only passed to test/set/clear_bit
functions.

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:05:53 -08:00
Sven Eckelmann
8e8ce08198 batman-adv: Don't schedule OGM for disabled interface
A transmission scheduling for an interface which is currently dropped by
batadv_iv_ogm_iface_disable could still be in progress. The B.A.T.M.A.N. V
is simply cancelling the workqueue item in an synchronous way but this is
not possible with B.A.T.M.A.N. IV because the OGM submissions are
intertwined.

Instead it has to stop submitting the OGM when it detect that the buffer
pointer is set to NULL.

Reported-by: syzbot+a98f2016f40b9cd3818a@syzkaller.appspotmail.com
Reported-by: syzbot+ac36b6a33c28a491e929@syzkaller.appspotmail.com
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-02-18 09:07:55 +01:00
Xin Long
245709ec8b sctp: move the format error check out of __sctp_sf_do_9_1_abort
When T2 timer is to be stopped, the asoc should also be deleted,
otherwise, there will be no chance to call sctp_association_free
and the asoc could last in memory forever.

However, in sctp_sf_shutdown_sent_abort(), after adding the cmd
SCTP_CMD_TIMER_STOP for T2 timer, it may return error due to the
format error from __sctp_sf_do_9_1_abort() and miss adding
SCTP_CMD_ASSOC_FAILED where the asoc will be deleted.

This patch is to fix it by moving the format error check out of
__sctp_sf_do_9_1_abort(), and do it before adding the cmd
SCTP_CMD_TIMER_STOP for T2 timer.

Thanks Hangbin for reporting this issue by the fuzz testing.

v1->v2:
  - improve the comment in the code as Marcelo's suggestion.

Fixes: 96ca468b86 ("sctp: check invalid value of length parameter in error cause")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 21:56:03 -08:00
Jason Baron
8a9093c798 net: sched: correct flower port blocking
tc flower rules that are based on src or dst port blocking are sometimes
ineffective due to uninitialized stack data. __skb_flow_dissect() extracts
ports from the skb for tc flower to match against. However, the port
dissection is not done when when the FLOW_DIS_IS_FRAGMENT bit is set in
key_control->flags. All callers of __skb_flow_dissect(), zero-out the
key_control field except for fl_classify() as used by the flower
classifier. Thus, the FLOW_DIS_IS_FRAGMENT may be set on entry to
__skb_flow_dissect(), since key_control is allocated on the stack
and may not be initialized.

Since key_basic and key_control are present for all flow keys, let's
make sure they are initialized.

Fixes: 62230715fd ("flow_dissector: do not dissect l4 ports for fragments")
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 21:33:28 -08:00
Florian Westphal
6a757c07e5 netfilter: conntrack: allow insertion of clashing entries
This patch further relaxes the need to drop an skb due to a clash with
an existing conntrack entry.

Current clash resolution handles the case where the clash occurs between
two identical entries (distinct nf_conn objects with same tuples), i.e.:

                    Original                        Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353

... existing handling will discard the unconfirmed clashing entry and
makes skb->_nfct point to the existing one.  The skb can then be
processed normally just as if the clash would not have existed in the
first place.

For other clashes, the skb needs to be dropped.
This frequently happens with DNS resolvers that send A and AAAA queries
back-to-back when NAT rules are present that cause packets to get
different DNAT transformations applied, for example:

-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353

In this case the A or AAAA query is dropped which incurs a costly
delay during name resolution.

This patch also allows this collision type:
                       Original                   Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.7:5353

In this case, clash is in original direction -- the reply direction
is still unique.

The change makes it so that when the 2nd colliding packet is received,
the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed
1 second timeout and is inserted in the reply direction only.

The entry is hidden from 'conntrack -L', it will time out quickly
and it can be early dropped because it will never progress to the
ASSURED state.

To avoid special-casing the delete code path to special case
the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is
added so hlist_nulls_del() will work.

Example:

      CPU A:                               CPU B:
1.  10.2.3.4:42 -> 10.8.8.8:53 (A)
2.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
3.  Apply DNAT, reply changed to 10.0.0.6
4.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
5.                                         Apply DNAT, reply changed to 10.0.0.7
6. confirm/commit to conntrack table, no collisions
7.                                         commit clashing entry

Reply comes in:

10.2.3.4:42 <- 10.0.0.6:5353 (A)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
    The conntrack entry is deleted from table, as it has the NAT_CLASH
    bit set.

In case of a retransmit from ORIGINAL dir, all further packets will get
the DNAT transformation to 10.0.0.6.

I tried to come up with other solutions but they all have worse
problems.

Alternatives considered were:
1.  Confirm ct entries at allocation time, not in postrouting.
 a. will cause uneccesarry work when the skb that creates the
    conntrack is dropped by ruleset.
 b. in case nat is applied, ct entry would need to be moved in
    the table, which requires another spinlock pair to be taken.
 c. breaks the 'unconfirmed entry is private to cpu' assumption:
    we would need to guard all nfct->ext allocation requests with
    ct->lock spinlock.

2. Make the unconfirmed list a hash table instead of a pcpu list.
   Shares drawback c) of the first alternative.

3. Document this is expected and force users to rearrange their
   ruleset (e.g. by using "-m cluster" instead of "-m statistics").
   nft has the 'jhash' expression which can be used instead of 'numgen'.

   Major drawback: doesn't fix what I consider a bug, not very realistic
   and I believe its reasonable to have the existing rulesets to 'just
   work'.

4. Document this is expected and force users to steer problematic
   packets to the same CPU -- this would serialize the "allocate new
   conntrack entry/nat table evaluation/perform nat/confirm entry", so
   no race can occur.  Similar drawback to 3.

Another advantage of this patch compared to 1) and 2) is that there are
no changes to the hot path; things are handled in the udp tracker and
the clash resolution path.

Cc: rcu@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-17 10:55:14 +01:00
Randy Dunlap
8955b4357d skbuff: remove stale bit mask comments
Remove stale comments since this flag is no longer a bit mask
but is a bit field.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:49:51 -08:00
Matthieu Baerts
357b41caf9 mptcp: select CRYPTO
Without this modification and if CRYPTO is not selected, we have this
warning:

  WARNING: unmet direct dependencies detected for CRYPTO_LIB_SHA256
    Depends on [n]: CRYPTO [=n]
    Selected by [y]:
    - MPTCP [=y] && NET [=y] && INET [=y]

MPTCP selects CRYPTO_LIB_SHA256 which seems to depend on CRYPTO. CRYPTO
is now selected to avoid this issue.

Even though the config system prints that warning, it looks like
sha256.c is compiled and linked even without CONFIG_CRYPTO. Since MPTCP
will end up needing CONFIG_CRYPTO anyway in future commits -- currently
in preparation for net-next -- we propose to add it now to fix the
warning.

The dependency in the config system comes from the fact that
CRYPTO_LIB_SHA256 is defined in "lib/crypto/Kconfig" which is sourced
from "crypto/Kconfig" only if CRYPTO is selected.

Fixes: 65492c5a6a (mptcp: move from sha1 (v0) to sha256 (v1))
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:37:16 -08:00
Taehee Yoo
7151affeef net: export netdev_next_lower_dev_rcu()
netdev_next_lower_dev_rcu() will be used to implement a function,
which is to walk all lower interfaces.
There are already functions that they walk their lower interface.
(netdev_walk_all_lower_dev_rcu, netdev_walk_all_lower_dev()).
But, there would be cases that couldn't be covered by given
netdev_walk_all_lower_dev_{rcu}() function.
So, some modules would want to implement own function,
which is to walk all lower interfaces.

In the next patch, netdev_next_lower_dev_rcu() will be used.
In addition, this patch removes two unused prototypes in netdevice.h.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:32:11 -08:00
Michal Kubecek
6699170376 ethtool: fix application of verbose no_mask bitset
A bitset without mask in a _SET request means we want exactly the bits in
the bitset to be set. This works correctly for compact format but when
verbose format is parsed, ethnl_update_bitset32_verbose() only sets the
bits present in the request bitset but does not clear the rest. This can
cause incorrect results like

  lion:~ # ethtool eth0 | grep Wake
          Supports Wake-on: pumbg
          Wake-on: g
  lion:~ # ethtool -s eth0 wol u
  lion:~ # ethtool eth0 | grep Wake
          Supports Wake-on: pumbg
          Wake-on: ug

when the second ethtool command issues request

ETHTOOL_MSG_WOL_SET
    ETHTOOL_A_WOL_HEADER
        ETHTOOL_A_HEADER_DEV_NAME = "eth0"
    ETHTOOL_A_WOL_MODES
        ETHTOOL_A_BITSET_NOMASK
        ETHTOOL_A_BITSET_BITS
            ETHTOOL_A_BITSET_BITS_BIT
                ETHTOOL_BITSET_BIT_INDEX = 1

Fix the logic by clearing the whole target bitmap before we start iterating
through the request bits.

Fixes: 10b518d4e6 ("ethtool: netlink bitset handling")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:28:51 -08:00
Mat Martineau
b6e4a1aeeb mptcp: Protect subflow socket options before connection completes
Userspace should not be able to directly manipulate subflow socket
options before a connection is established since it is not yet known if
it will be an MPTCP subflow or a TCP fallback subflow. TCP fallback
subflows can be more directly controlled by userspace because they are
regular TCP connections, while MPTCP subflow sockets need to be
configured for the specific needs of MPTCP. Use the same logic as
sendmsg/recvmsg to ensure that socket option calls are only passed
through to known TCP fallback subflows.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:20:18 -08:00
Eric Dumazet
e08ad80551 net: add strict checks in netdev_name_node_alt_destroy()
netdev_name_node_alt_destroy() does a lookup over all
device names of a namespace.

We need to make sure the name belongs to the device
of interest, and that we do not destroy its primary
name, since we rely on it being not deleted :
dev->name_node would indeed point to freed memory.

syzbot report was the following :

BUG: KASAN: use-after-free in dev_net include/linux/netdevice.h:2206 [inline]
BUG: KASAN: use-after-free in mld_force_mld_version net/ipv6/mcast.c:1172 [inline]
BUG: KASAN: use-after-free in mld_in_v2_mode_only net/ipv6/mcast.c:1180 [inline]
BUG: KASAN: use-after-free in mld_in_v1_mode+0x203/0x230 net/ipv6/mcast.c:1190
Read of size 8 at addr ffff88809886c588 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
 __kasan_report.cold+0x1b/0x32 mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:641
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
 dev_net include/linux/netdevice.h:2206 [inline]
 mld_force_mld_version net/ipv6/mcast.c:1172 [inline]
 mld_in_v2_mode_only net/ipv6/mcast.c:1180 [inline]
 mld_in_v1_mode+0x203/0x230 net/ipv6/mcast.c:1190
 mld_send_initial_cr net/ipv6/mcast.c:2083 [inline]
 mld_dad_timer_expire+0x24/0x230 net/ipv6/mcast.c:2118
 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
 expire_timers kernel/time/timer.c:1449 [inline]
 __run_timers kernel/time/timer.c:1773 [inline]
 __run_timers kernel/time/timer.c:1740 [inline]
 run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0x19b/0x1e0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:546 [inline]
 smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1146
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
 </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: 68 73 c5 f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 94 be 59 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 84 be 59 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 de 2a 74 f9 e8 09
RSP: 0018:ffffc90000d3fd68 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff136761a RBX: ffff8880a99fc340 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a99fcbd4
RBP: ffffc90000d3fd98 R08: ffff8880a99fc340 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff8aa5a1c0 R14: 0000000000000000 R15: 0000000000000001
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:686
 default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
 start_secondary+0x2f4/0x410 arch/x86/kernel/smpboot.c:264
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242

Allocated by task 10229:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 __kasan_kmalloc mm/kasan/common.c:515 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:488
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:529
 __do_kmalloc_node mm/slab.c:3616 [inline]
 __kmalloc_node+0x4e/0x70 mm/slab.c:3623
 kmalloc_node include/linux/slab.h:578 [inline]
 kvmalloc_node+0x68/0x100 mm/util.c:574
 kvmalloc include/linux/mm.h:645 [inline]
 kvzalloc include/linux/mm.h:653 [inline]
 alloc_netdev_mqs+0x98/0xe40 net/core/dev.c:9797
 rtnl_create_link+0x22d/0xaf0 net/core/rtnetlink.c:3047
 __rtnl_newlink+0xf9f/0x1790 net/core/rtnetlink.c:3309
 rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
 rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 __sys_sendto+0x262/0x380 net/socket.c:1998
 __do_compat_sys_socketcall net/compat.c:771 [inline]
 __se_compat_sys_socketcall net/compat.c:719 [inline]
 __ia32_compat_sys_socketcall+0x530/0x710 net/compat.c:719
 do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
 do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

Freed by task 10229:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 kasan_set_free_info mm/kasan/common.c:337 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:476
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:485
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x10a/0x2c0 mm/slab.c:3757
 __netdev_name_node_alt_destroy+0x1ff/0x2a0 net/core/dev.c:322
 netdev_name_node_alt_destroy+0x57/0x80 net/core/dev.c:334
 rtnl_alt_ifname net/core/rtnetlink.c:3518 [inline]
 rtnl_linkprop.isra.0+0x575/0x6f0 net/core/rtnetlink.c:3567
 rtnl_dellinkprop+0x46/0x60 net/core/rtnetlink.c:3588
 rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 ____sys_sendmsg+0x753/0x880 net/socket.c:2343
 ___sys_sendmsg+0x100/0x170 net/socket.c:2397
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
 __compat_sys_sendmsg net/compat.c:642 [inline]
 __do_compat_sys_sendmsg net/compat.c:649 [inline]
 __se_compat_sys_sendmsg net/compat.c:646 [inline]
 __ia32_compat_sys_sendmsg+0x7a/0xb0 net/compat.c:646
 do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
 do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

The buggy address belongs to the object at ffff88809886c000
 which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 1416 bytes inside of
 4096-byte region [ffff88809886c000, ffff88809886d000)
The buggy address belongs to the page:
page:ffffea0002621b00 refcount:1 mapcount:0 mapping:ffff8880aa402000 index:0x0 compound_mapcount: 0
flags: 0xfffe0000010200(slab|head)
raw: 00fffe0000010200 ffffea0002610d08 ffffea0002607608 ffff8880aa402000
raw: 0000000000000000 ffff88809886c000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88809886c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88809886c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88809886c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                      ^
 ffff88809886c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88809886c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:04:33 -08:00
Eric Dumazet
44bfa9c5e5 net: rtnetlink: fix bugs in rtnl_alt_ifname()
Since IFLA_ALT_IFNAME is an NLA_STRING, we have no
guarantee it is nul terminated.

We should use nla_strdup() instead of kstrdup(), since this
helper will make sure not accessing out-of-bounds data.

BUG: KMSAN: uninit-value in strlen+0x5e/0xa0 lib/string.c:535
CPU: 1 PID: 19157 Comm: syz-executor.5 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 strlen+0x5e/0xa0 lib/string.c:535
 kstrdup+0x7f/0x1a0 mm/util.c:59
 rtnl_alt_ifname net/core/rtnetlink.c:3495 [inline]
 rtnl_linkprop+0x85d/0xc00 net/core/rtnetlink.c:3553
 rtnl_newlinkprop+0x9d/0xb0 net/core/rtnetlink.c:3568
 rtnetlink_rcv_msg+0x1153/0x1570 net/core/rtnetlink.c:5424
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5442
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45b3b9
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ff1c7b1ac78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007ff1c7b1b6d4 RCX: 000000000045b3b9
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000009cb R14: 00000000004cb3dd R15: 000000000075bf2c

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
 slab_alloc_node mm/slub.c:2774 [inline]
 __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4382
 __kmalloc_reserve net/core/skbuff.c:141 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:209
 alloc_skb include/linux/skbuff.h:1049 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline]
 netlink_sendmsg+0x7d3/0x14d0 net/netlink/af_netlink.c:1892
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:51:59 -08:00
Jethro Beekman
540e585a79 net: fib_rules: Correctly set table field when table number exceeds 8 bits
In 709772e6e0, RT_TABLE_COMPAT was added to
allow legacy software to deal with routing table numbers >= 256, but the
same change to FIB rule queries was overlooked.

Signed-off-by: Jethro Beekman <jethro@fortanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:38:24 -08:00
Leon Romanovsky
0d4597c8c5 net/rds: Track user mapped pages through special API
Convert net/rds to use the newly introduces pin_user_pages() API,
which properly sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Note that this effectively changes the code's behavior: it now
ultimately calls set_page_dirty_lock(), instead of set_page_dirty().
This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:37:09 -08:00
Benjamin Poirier
afecdb376b ipv6: Fix nlmsg_flags when splitting a multipath route
When splitting an RTA_MULTIPATH request into multiple routes and adding the
second and later components, we must not simply remove NLM_F_REPLACE but
instead replace it by NLM_F_CREATE. Otherwise, it may look like the netlink
message was malformed.

For example,
	ip route add 2001:db8::1/128 dev dummy0
	ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0 \
		nexthop via fe80::30:2 dev dummy0
results in the following warnings:
[ 1035.057019] IPv6: RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE
[ 1035.057517] IPv6: NLM_F_CREATE should be set when creating new route

This patch makes the nlmsg sequence look equivalent for __ip6_ins_rt() to
what it would get if the multipath route had been added in multiple netlink
operations:
	ip route add 2001:db8::1/128 dev dummy0
	ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0
	ip route append 2001:db8::1/128 nexthop via fe80::30:2 dev dummy0

Fixes: 2759647247 ("ipv6: fix ECMP route replacement")
Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:34:31 -08:00
Benjamin Poirier
e404b8c7cf ipv6: Fix route replacement with dev-only route
After commit 2759647247 ("ipv6: fix ECMP route replacement") it is no
longer possible to replace an ECMP-able route by a non ECMP-able route.
For example,
	ip route add 2001:db8::1/128 via fe80::1 dev dummy0
	ip route replace 2001:db8::1/128 dev dummy0
does not work as expected.

Tweak the replacement logic so that point 3 in the log of the above commit
becomes:
3. If the new route is not ECMP-able, and no matching non-ECMP-able route
exists, replace matching ECMP-able route (if any) or add the new route.

We can now summarize the entire replace semantics to:
When doing a replace, prefer replacing a matching route of the same
"ECMP-able-ness" as the replace argument. If there is no such candidate,
fallback to the first route found.

Fixes: 2759647247 ("ipv6: fix ECMP route replacement")
Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:34:31 -08:00
Linus Torvalds
829e694469 NFS Client Bugfixes for Linux 5.6-rc2
Stable Bugfixes:
 - Fix DMA scatter-gather list mapping imbalance
 
 Other Fixes:
 - Fix directory verifier races
 - Fix races between open and dentry revalidation
 - Fix revalidation of dentries with delegations
 - Fix "cachethis" setting for writes
 - Fix delegation and delegation cred pinning
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5HEa4ACgkQ18tUv7Cl
 QOvwAA/9FAIwiE5UU+OrbpJ47IKFay2lfu6ywWVt0Q9i/tDeTRawa9aZDNKmKuYk
 rViVzRhIrmdkhpSVeTl3BdV/U9oXmNQ9ubFxWkQ0TI1j/tkTotT4dA0/kjeampZJ
 dZeW+NtRnFxYtfyV6fV/Bp/GB1RE9tAr1hTzHYEpNfnKxmAgkOeIXsVebuWKvUxz
 XCeRED30Ekt2n1JQqb1m3VAMpYmprF2i9TErAVR8ILt0dkkJD8nNlPogrZwMjzuD
 GtB7BSJ7rIDHQomuF3HVrz+SpMhM530mpsZiVGYCKjChJWaPm12+AZGPtv3UcpAf
 /M2VTa+z8MJDwirzjWogo3OvYAqZGI6KUG0o2t5dKmguQoi1RrDlPbFMLXlxow2e
 5gi/qKFbnBOt6iIV9FffMb50jpnShBE6hQS4/rb2inGmAKYgEXcL+ouBY6308soS
 fnWJSViTQfGbF9uVxGeOYl766zEqlNleclfjKBCNXMC3HOfIslIBpilMDwCoOHeB
 MqYSqmnwAH9ZrqjNmR3Qh2RlcOxN8CTUMhkUW2KLMAgnEAd9WINaYu4u2Y50Fe6j
 yDa6YEAmaNSGVYVofh8qA1R1Zi4tnKlg1/WFdy/SnEQKi8mQqXASGiVu2bXjhlF1
 22cHVXNIjp1PMijZKnJtimuxChygj9xxgzw4Kstd1WCLm2VirAc=
 =Ase9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "The only stable fix this time is the DMA scatter-gather list bug fixed
  by Chuck.

  The rest fix up races and refcounting issues that have been found
  during testing.

  Stable fix:
   - fix DMA scatter-gather list mapping imbalance

  The rest:
   - fix directory verifier races
   - fix races between open and dentry revalidation
   - fix revalidation of dentries with delegations
   - fix "cachethis" setting for writes
   - fix delegation and delegation cred pinning"

* tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Ensure the delegation cred is pinned when we call delegreturn
  NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation()
  NFSv4.1 make cachethis=no for writes
  xprtrdma: Fix DMA scatter-gather list mapping imbalance
  NFSv4: Fix revalidation of dentries with delegations
  NFSv4: Fix races between open and dentry revalidation
  NFS: Fix up directory verifier races
2020-02-14 14:46:11 -08:00
Linus Torvalds
2019fc96af Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix interrupt name truncation in mv88e6xxx dsa driver, from Andrew
    Lunn.

 2) Process generic XDP even if SKB is cloned, from Toke Høiland-Jørgensen.

 3) Fix leak of kernel memory to userspace in smc, from Eric Dumazet.

 4) Add some missing netlink attribute validation to matchall and
    flower, from Davide Caratti.

 5) Send icmp responses properly when NAT has been applied to the frame
    before we get to the tunnel emitting the icmp, from Jason Donenfeld.

 6) Make sure there is enough SKB headroom when adding dsa tags for qca
    and ar9331. From Per Forlin.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  netdevice.h: fix all kernel-doc and Sphinx warnings
  net: dsa: tag_ar9331: Make sure there is headroom for tag
  net: dsa: tag_qca: Make sure there is headroom for tag
  net, ip6_tunnel: enhance tunnel locate with link check
  net/smc: no peer ID in CLC decline for SMCD
  net/smc: transfer fasync_list in case of fallback
  net: hns3: fix a copying IPv6 address error in hclge_fd_get_flow_tuples()
  net: hns3: fix VF bandwidth does not take effect in some case
  net: hns3: add management table after IMP reset
  mac80211: fix wrong 160/80+80 MHz setting
  cfg80211: add missing policy for NL80211_ATTR_STATUS_CODE
  xfrm: interface: use icmp_ndo_send helper
  wireguard: device: use icmp_ndo_send helper
  sunvnet: use icmp_ndo_send helper
  gtp: use icmp_ndo_send helper
  icmp: introduce helper for nat'd source address in network device context
  net/sched: flower: add missing validation of TCA_FLOWER_FLAGS
  net/sched: matchall: add missing validation of TCA_MATCHALL_FLAGS
  net/flow_dissector: remove unexist field description
  page_pool: refill page when alloc.count of pool is zero
  ...
2020-02-14 12:40:38 -08:00
Per Forlin
ddc9abaf5d net: dsa: tag_ar9331: Make sure there is headroom for tag
Passing tag size to skb_cow_head will make sure
there is enough headroom for the tag data.
This change does not introduce any overhead in case there
is already available headroom for tag.

Signed-off-by: Per Forlin <perfn@axis.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:34:51 -08:00
Per Forlin
04fb91243a net: dsa: tag_qca: Make sure there is headroom for tag
Passing tag size to skb_cow_head will make sure
there is enough headroom for the tag data.
This change does not introduce any overhead in case there
is already available headroom for tag.

Signed-off-by: Per Forlin <perfn@axis.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:34:51 -08:00
William Dauchy
5fdcce211b net, ip6_tunnel: enhance tunnel locate with link check
With ipip, it is possible to create an extra interface explicitly
attached to a given physical interface:

  # ip link show tunl0
  4: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
  # ip link add tunl1 type ipip dev eth0
  # ip link show tunl1
  6: tunl1@eth0: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0

But it is not possible with ip6tnl:

  # ip link show ip6tnl0
  5: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/tunnel6 :: brd ::
  # ip link add ip6tnl1 type ip6tnl dev eth0
  RTNETLINK answers: File exists

This patch aims to make it possible by adding link comparaison in both
tunnel locate and lookup functions; we also modify mtu calculation when
attached to an interface with a lower mtu.

This permits to make use of x-netns communication by moving the newly
created tunnel in a given netns.

Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:31:48 -08:00
David S. Miller
b32cb6fcf1 Just a few fixes:
* avoid running out of tracking space for frames that need
    to be reported to userspace by using more bits
  * fix beacon handling suppression by adding some relevant
    elements to the CRC calculation
  * fix quiet mode in action frames
  * fix crash in ethtool for virt_wifi and similar
  * add a missing policy entry
  * fix 160 & 80+80 bandwidth to take local capabilities into
    account
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5GZQ8ACgkQB8qZga/f
 l8TQYQ//fcrDPndtwJWE5IaXJq7OmnM2NsYED6r1gVK7CWeVhqfwES6x9LVbhW/B
 QPwqFfL3i2II8XFVG/+BCUcEhp5FZQKGEDXaYcWXqQCuc6mU7ZKBuaEJmznMoez1
 6b96vdkVa78QnNK0QjSjAMua9KaBCIS2bPP/0E5o/8T9YsRKhuC4ItpLHYAGLsfM
 1QzU8ZBr/Ug4XIx28lDYOs8MHF5jm8dUgSpB4WgSQUKGTmgvyO/SefGtaJQ36R9B
 tAVR3Mmstm00ufsXT3OhiqXgN6bjUENvM+osVjD6E1eqjtPlrUYTCO8b/ZC1wean
 4G077eE186uao+9xQ8cLk7sL6ZJnlRwJ15RdBB5bsLZ8nG2FpclKCMxpcxc4PL/M
 oIXjEU8uNhSAVDmI3c2k/sxRW0hdQoblgl8PQMZ3k/6nvpwnIrG8yoeAwP4B457Y
 vgiRE212YkOaXhtwrg1EuOIhk1N3YRG8DpSdDrKlj+y0LOijTrHCnqoUwnne+RpX
 2Zqm5Sz3R2Nel9iIVs1JOxIGvyg+KIoM+WjmmrjMSqlrvMMisq8O3BuMfgd6jd+t
 NuZE9MkukBPMiaB4QDMDgBcaTMrLRItD+SSSQcMvkORTfGrpjUhpiQTVwOPXCYqZ
 793BU8kI+gwrz9UfloYxSv2+2H12a6LDQY6Q/mJVKAuhdw6SXsE=
 =L+Ik
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a few fixes:
 * avoid running out of tracking space for frames that need
   to be reported to userspace by using more bits
 * fix beacon handling suppression by adding some relevant
   elements to the CRC calculation
 * fix quiet mode in action frames
 * fix crash in ethtool for virt_wifi and similar
 * add a missing policy entry
 * fix 160 & 80+80 bandwidth to take local capabilities into
   account
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:16:08 -08:00
Ursula Braun
369537c970 net/smc: no peer ID in CLC decline for SMCD
Just SMCR requires a CLC Peer ID, but not SMCD. The field should be
zero for SMCD.

Fixes: c758dfddc1 ("net/smc: add SMC-D support in CLC messages")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:09:36 -08:00
Ursula Braun
67f562e3e1 net/smc: transfer fasync_list in case of fallback
SMC does not work together with FASTOPEN. If sendmsg() is called with
flag MSG_FASTOPEN in SMC_INIT state, the SMC-socket switches to
fallback mode. To handle the previous ioctl FIOASYNC call correctly
in this case, it is necessary to transfer the socket wait queue
fasync_list to the internal TCP socket.

Reported-by: syzbot+4b1fe8105f8044a26162@syzkaller.appspotmail.com
Fixes: ee9dfbef02 ("net/smc: handle sockopts forcing fallback")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:09:36 -08:00
Shay Bar
33181ea7f5 mac80211: fix wrong 160/80+80 MHz setting
Before this patch, STA's would set new width of 160/80+80 MHz based on AP capability only.
This is wrong because STA may not support > 80MHz BW.
Fix is to verify STA has 160/80+80 MHz capability before increasing its width to > 80MHz.

The "support_80_80" and "support_160" setting is based on:
"Table 9-272 — Setting of the Supported Channel Width Set subfield and Extended NSS BW
Support subfield at a STA transmitting the VHT Capabilities Information field"
From "Draft P802.11REVmd_D3.0.pdf"

Signed-off-by: Aviad Brikman <aviad.brikman@celeno.com>
Signed-off-by: Shay Bar <shay.bar@celeno.com>
Link: https://lore.kernel.org/r/20200210130728.23674-1-shay.bar@celeno.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:54:12 +01:00
Sergey Matyukevich
ea75080110 cfg80211: add missing policy for NL80211_ATTR_STATUS_CODE
The nl80211_policy is missing for NL80211_ATTR_STATUS_CODE attribute.
As a result, for strictly validated commands, it's assumed to not be
supported.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200213131608.10541-2-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:50:37 +01:00
Jason A. Donenfeld
45942ba890 xfrm: interface: use icmp_ndo_send helper
Because xfrmi is calling icmp from network device context, it should use
the ndo helper so that the rate limiting applies correctly.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:19:00 -08:00
Jason A. Donenfeld
0b41713b60 icmp: introduce helper for nat'd source address in network device context
This introduces a helper function to be called only by network drivers
that wraps calls to icmp[v6]_send in a conntrack transformation, in case
NAT has been used. We don't want to pollute the non-driver path, though,
so we introduce this as a helper to be called by places that actually
make use of this, as suggested by Florian.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:19:00 -08:00
Davide Caratti
e2debf0852 net/sched: flower: add missing validation of TCA_FLOWER_FLAGS
unlike other classifiers that can be offloaded (i.e. users can set flags
like 'skip_hw' and 'skip_sw'), 'cls_flower' doesn't validate the size of
netlink attribute 'TCA_FLOWER_FLAGS' provided by user: add a proper entry
to fl_policy.

Fixes: 5b33f48842 ("net/flower: Introduce hardware offload support")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:16:35 -08:00
Davide Caratti
1afa3cc90f net/sched: matchall: add missing validation of TCA_MATCHALL_FLAGS
unlike other classifiers that can be offloaded (i.e. users can set flags
like 'skip_hw' and 'skip_sw'), 'cls_matchall' doesn't validate the size
of netlink attribute 'TCA_MATCHALL_FLAGS' provided by user: add a proper
entry to mall_policy.

Fixes: b87f7936a9 ("net/sched: Add match-all classifier hw offloading.")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:16:35 -08:00
Li RongQing
304db6cb76 page_pool: refill page when alloc.count of pool is zero
"do {} while" in page_pool_refill_alloc_cache will always
refill page once whether refill is true or false, and whether
alloc.count of pool is less than PP_ALLOC_CACHE_REFILL or not
this is wrong, and will cause overflow of pool->alloc.cache

the caller of __page_pool_get_cached should provide guarantee
that pool->alloc.cache is safe to access, so in_serving_softirq
should be removed as suggested by Jesper Dangaard Brouer in
https://patchwork.ozlabs.org/patch/1233713/

so fix this issue by calling page_pool_refill_alloc_cache()
only when pool->alloc.count is zero

Fixes: 44768decb7 ("page_pool: handle page recycle for NUMA_NO_NODE condition")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:11:51 -08:00
Chuck Lever
ca1c671302 xprtrdma: Fix DMA scatter-gather list mapping imbalance
The @nents value that was passed to ib_dma_map_sg() has to be passed
to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to
concatenate sg entries, it will return a different nents value than
it was passed.

The bug was exposed by recent changes to the AMD IOMMU driver, which
enabled sg entry concatenation.

Looking all the way back to commit 4143f34e01 ("xprtrdma: Port to
new memory registration API") and reviewing other kernel ULPs, it's
not clear that the frwr_map() logic was ever correct for this case.

Reported-by: Andre Tomt <andre@tomt.net>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-13 15:35:33 -05:00
Eric Dumazet
457fed775c net/smc: fix leak of kernel memory to user space
As nlmsg_put() does not clear the memory that is reserved,
it this the caller responsability to make sure all of this
memory will be written, in order to not reveal prior content.

While we are at it, we can provide the socket cookie even
if clsock is not set.

syzbot reported :

BUG: KMSAN: uninit-value in __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
BUG: KMSAN: uninit-value in __fswab32 include/uapi/linux/swab.h:59 [inline]
BUG: KMSAN: uninit-value in __swab32p include/uapi/linux/swab.h:179 [inline]
BUG: KMSAN: uninit-value in __be32_to_cpup include/uapi/linux/byteorder/little_endian.h:82 [inline]
BUG: KMSAN: uninit-value in get_unaligned_be32 include/linux/unaligned/access_ok.h:30 [inline]
BUG: KMSAN: uninit-value in ____bpf_skb_load_helper_32 net/core/filter.c:240 [inline]
BUG: KMSAN: uninit-value in ____bpf_skb_load_helper_32_no_cache net/core/filter.c:255 [inline]
BUG: KMSAN: uninit-value in bpf_skb_load_helper_32_no_cache+0x14a/0x390 net/core/filter.c:252
CPU: 1 PID: 5262 Comm: syz-executor.5 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
 __fswab32 include/uapi/linux/swab.h:59 [inline]
 __swab32p include/uapi/linux/swab.h:179 [inline]
 __be32_to_cpup include/uapi/linux/byteorder/little_endian.h:82 [inline]
 get_unaligned_be32 include/linux/unaligned/access_ok.h:30 [inline]
 ____bpf_skb_load_helper_32 net/core/filter.c:240 [inline]
 ____bpf_skb_load_helper_32_no_cache net/core/filter.c:255 [inline]
 bpf_skb_load_helper_32_no_cache+0x14a/0x390 net/core/filter.c:252

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_kmalloc_large+0x73/0xc0 mm/kmsan/kmsan_hooks.c:128
 kmalloc_large_node_hook mm/slub.c:1406 [inline]
 kmalloc_large_node+0x282/0x2c0 mm/slub.c:3841
 __kmalloc_node_track_caller+0x44b/0x1200 mm/slub.c:4368
 __kmalloc_reserve net/core/skbuff.c:141 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:209
 alloc_skb include/linux/skbuff.h:1049 [inline]
 netlink_dump+0x44b/0x1ab0 net/netlink/af_netlink.c:2224
 __netlink_dump_start+0xbb2/0xcf0 net/netlink/af_netlink.c:2352
 netlink_dump_start include/linux/netlink.h:233 [inline]
 smc_diag_handler_dump+0x2ba/0x300 net/smc/smc_diag.c:242
 sock_diag_rcv_msg+0x211/0x610 net/core/sock_diag.c:256
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
 sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:275
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 kernel_sendmsg+0x433/0x440 net/socket.c:679
 sock_no_sendpage+0x235/0x300 net/core/sock.c:2740
 kernel_sendpage net/socket.c:3776 [inline]
 sock_sendpage+0x1e1/0x2c0 net/socket.c:937
 pipe_to_sendpage+0x38c/0x4c0 fs/splice.c:458
 splice_from_pipe_feed fs/splice.c:512 [inline]
 __splice_from_pipe+0x539/0xed0 fs/splice.c:636
 splice_from_pipe fs/splice.c:671 [inline]
 generic_splice_sendpage+0x1d5/0x2d0 fs/splice.c:844
 do_splice_from fs/splice.c:863 [inline]
 do_splice fs/splice.c:1170 [inline]
 __do_sys_splice fs/splice.c:1447 [inline]
 __se_sys_splice+0x2380/0x3350 fs/splice.c:1427
 __x64_sys_splice+0x6e/0x90 fs/splice.c:1427
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: f16a7dd5cf ("smc: netlink interface for SMC sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-11 17:04:42 -08:00
Toke Høiland-Jørgensen
ad1e03b2b3 core: Don't skip generic XDP program execution for cloned SKBs
The current generic XDP handler skips execution of XDP programs entirely if
an SKB is marked as cloned. This leads to some surprising behaviour, as
packets can end up being cloned in various ways, which will make an XDP
program not see all the traffic on an interface.

This was discovered by a simple test case where an XDP program that always
returns XDP_DROP is installed on a veth device. When combining this with
the Scapy packet sniffer (which uses an AF_PACKET) socket on the sending
side, SKBs reliably end up in the cloned state, causing them to be passed
through to the receiving interface instead of being dropped. A minimal
reproducer script for this is included below.

This patch fixed the issue by simply triggering the existing linearisation
code for cloned SKBs instead of skipping the XDP program execution. This
behaviour is in line with the behaviour of the native XDP implementation
for the veth driver, which will reallocate and copy the SKB data if the SKB
is marked as shared.

Reproducer Python script (requires BCC and Scapy):

from scapy.all import TCP, IP, Ether, sendp, sniff, AsyncSniffer, Raw, UDP
from bcc import BPF
import time, sys, subprocess, shlex

SKB_MODE = (1 << 1)
DRV_MODE = (1 << 2)
PYTHON=sys.executable

def client():
    time.sleep(2)
    # Sniffing on the sender causes skb_cloned() to be set
    s = AsyncSniffer()
    s.start()

    for p in range(10):
        sendp(Ether(dst="aa:aa:aa:aa:aa:aa", src="cc:cc:cc:cc:cc:cc")/IP()/UDP()/Raw("Test"),
              verbose=False)
        time.sleep(0.1)

    s.stop()
    return 0

def server(mode):
    prog = BPF(text="int dummy_drop(struct xdp_md *ctx) {return XDP_DROP;}")
    func = prog.load_func("dummy_drop", BPF.XDP)
    prog.attach_xdp("a_to_b", func, mode)

    time.sleep(1)

    s = sniff(iface="a_to_b", count=10, timeout=15)
    if len(s):
        print(f"Got {len(s)} packets - should have gotten 0")
        return 1
    else:
        print("Got no packets - as expected")
        return 0

if len(sys.argv) < 2:
    print(f"Usage: {sys.argv[0]} <skb|drv>")
    sys.exit(1)

if sys.argv[1] == "client":
    sys.exit(client())
elif sys.argv[1] == "server":
    mode = SKB_MODE if sys.argv[2] == 'skb' else DRV_MODE
    sys.exit(server(mode))
else:
    try:
        mode = sys.argv[1]
        if mode not in ('skb', 'drv'):
            print(f"Usage: {sys.argv[0]} <skb|drv>")
            sys.exit(1)
        print(f"Running in {mode} mode")

        for cmd in [
                'ip netns add netns_a',
                'ip netns add netns_b',
                'ip -n netns_a link add a_to_b type veth peer name b_to_a netns netns_b',
                # Disable ipv6 to make sure there's no address autoconf traffic
                'ip netns exec netns_a sysctl -qw net.ipv6.conf.a_to_b.disable_ipv6=1',
                'ip netns exec netns_b sysctl -qw net.ipv6.conf.b_to_a.disable_ipv6=1',
                'ip -n netns_a link set dev a_to_b address aa:aa:aa:aa:aa:aa',
                'ip -n netns_b link set dev b_to_a address cc:cc:cc:cc:cc:cc',
                'ip -n netns_a link set dev a_to_b up',
                'ip -n netns_b link set dev b_to_a up']:
            subprocess.check_call(shlex.split(cmd))

        server = subprocess.Popen(shlex.split(f"ip netns exec netns_a {PYTHON} {sys.argv[0]} server {mode}"))
        client = subprocess.Popen(shlex.split(f"ip netns exec netns_b {PYTHON} {sys.argv[0]} client"))

        client.wait()
        server.wait()
        sys.exit(server.returncode)

    finally:
        subprocess.run(shlex.split("ip netns delete netns_a"))
        subprocess.run(shlex.split("ip netns delete netns_b"))

Fixes: d445516966 ("net: xdp: support xdp generic on virtual devices")
Reported-by: Stepan Horacek <shoracek@redhat.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-11 17:01:29 -08:00