Commit Graph

50087 Commits

Author SHA1 Message Date
Steffen Klassert
9a3fb9fb84 xfrm: Fix transport mode skb control buffer usage.
A recent commit introduced a new struct xfrm_trans_cb
that is used with the sk_buff control buffer. Unfortunately
it placed the structure in front of the control buffer and
overlooked that the IPv4/IPv6 control buffer is still needed
for some layer 4 protocols. As a result the IPv4/IPv6 control
buffer is overwritten with this structure. Fix this by setting
a apropriate header in front of the structure.

Fixes acf568ee85 ("xfrm: Reinject transport-mode packets ...")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-23 07:56:04 +01:00
Stefano Brivio
f8a554b4aa vti6: Fix dev->max_mtu setting
We shouldn't allow a tunnel to have IP_MAX_MTU as MTU, because
another IPv6 header is going on top of our packets. Without this
patch, we might end up building packets bigger than IP_MAX_MTU.

Fixes: b96f9afee4 ("ipv4/6: use core net MTU range checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio
7a67e69a33 vti6: Keep set MTU on link creation or change, validate it
In vti6_link_config(), if MTU is already given on link creation
or change, validate and use it instead of recomputing it. To do
that, we need to propagate the knowledge that MTU was set by
userspace all the way down to vti6_link_config().

To keep this simple, vti6_dev_init() sets the new 'keep_mtu'
argument of vti6_link_config() to true: on initialization, we
don't have convenient access to netlink attributes there, but we
will anyway check whether dev->mtu is set in vti6_link_config().
If it's non-zero, it was set to the value of the IFLA_MTU
attribute during creation. Otherwise, determine a reasonable
value.

Fixes: ed1efb2aef ("ipv6: Add support for IPsec virtual tunnel interfaces")
Fixes: 53c81e95df ("ip6_vti: adjust vti mtu according to mtu of lower device")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio
c6741fbed6 vti6: Properly adjust vti6 MTU from MTU of lower device
If a lower device is found, we don't need to subtract
LL_MAX_HEADER to calculate our MTU: just use its MTU, the link
layer headers are already taken into account by it.

If the lower device is not found, start from ETH_DATA_LEN
instead, and only in this case subtract a worst-case
LL_MAX_HEADER.

We then need to subtract our additional IPv6 header from the
calculation.

While at it, note that vti6 doesn't have a hardware header, so
it doesn't need to set dev->hard_header_len. And as
vti6_link_config() now always sets the MTU, there's no need to
set a default value in vti6_dev_setup().

This makes the behaviour consistent with IPv4 vti, after
commit a32452366b ("vti4: Don't count header length twice."),
which was accidentally reverted by merge commit f895f0cfbb
("Merge branch 'master' of
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec").

While commit 53c81e95df ("ip6_vti: adjust vti mtu according to
mtu of lower device") improved on the original situation, this
was still not ideal. As reported in that commit message itself,
if we start from an underlying veth MTU of 9000, we end up with
an MTU of 8832, that is, 9000 - LL_MAX_HEADER - sizeof(ipv6hdr).
This should simply be 8880, or 9000 - sizeof(ipv6hdr) instead:
we found the lower device (veth) and we know we don't have any
additional link layer header, so there's no need to subtract an
hypothetical worst-case number.

Fixes: 53c81e95df ("ip6_vti: adjust vti mtu according to mtu of lower device")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio
03080e5ec7 vti4: Don't override MTU passed on link creation via IFLA_MTU
Don't hardcode a MTU value on vti tunnel initialization,
ip_tunnel_newlink() is able to deal with this already. See also
commit ffc2b6ee41 ("ip_gre: fix IFLA_MTU ignored on NEWLINK").

Fixes: 1181412c1a ("net/ipv4: VTI support new module for ip_vti.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio
24fc79798b ip_tunnel: Clamp MTU to bounds on new link
Otherwise, it's possible to specify invalid MTU values directly
on creation of a link (via 'ip link add'). This is already
prevented on subsequent MTU changes by commit b96f9afee4
("ipv4/6: use core net MTU range checking").

Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Stefano Brivio
dd1df24737 vti4: Don't count header length twice on tunnel setup
This re-introduces the effect of commit a32452366b ("vti4:
Don't count header length twice.") which was accidentally
reverted by merge commit f895f0cfbb ("Merge branch 'master' of
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec").

The commit message from Steffen Klassert said:

    We currently count the size of LL_MAX_HEADER and struct iphdr
    twice for vti4 devices, this leads to a wrong device mtu.
    The size of LL_MAX_HEADER and struct iphdr is already counted in
    ip_tunnel_bind_dev(), so don't do it again in vti_tunnel_init().

And this is still the case now: ip_tunnel_bind_dev() already
accounts for the header length of the link layer (not
necessarily LL_MAX_HEADER, if the output device is found), plus
one IP header.

For example, with a vti device on top of veth, with MTU of 1500,
the existing implementation would set the initial vti MTU to
1332, accounting once for LL_MAX_HEADER (128, included in
hard_header_len by vti) and twice for the same IP header (once
from hard_header_len, once from ip_tunnel_bind_dev()).

It should instead be 1480, because ip_tunnel_bind_dev() is able
to figure out that the output device is veth, so no additional
link layer header is attached, and will properly count one
single IP header.

The existing issue had the side effect of avoiding PMTUD for
most xfrm policies, by arbitrarily lowering the initial MTU.
However, the only way to get a consistent PMTU value is to let
the xfrm PMTU discovery do its course, and commit d6af1a31cc
("vti: Add pmtu handling to vti_xmit.") now takes care of local
delivery cases where the application ignores local socket
notifications.

Fixes: b9959fd3b0 ("vti: switch to new ip tunnel code")
Fixes: f895f0cfbb ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19 08:45:50 +01:00
Taehee Yoo
46c0ef6e1e xfrm: fix rcu_read_unlock usage in xfrm_local_error
In the xfrm_local_error, rcu_read_unlock should be called when afinfo
is not NULL. because xfrm_state_get_afinfo calls rcu_read_unlock
if afinfo is NULL.

Fixes: af5d27c4e1 ("xfrm: remove xfrm_state_put_afinfo")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-16 08:26:02 +01:00
Roman Mashak
51d4740f88 net sched actions: return explicit error when tunnel_key mode is not specified
If set/unset mode of the tunnel_key action is not provided, ->init() still
returns 0, and the caller proceeds with bogus 'struct tc_action *' object,
this results in crash:

% tc actions add action tunnel_key src_ip 1.1.1.1 dst_ip 2.2.2.1 id 7 index 1

[   35.805515] general protection fault: 0000 [#1] SMP PTI
[   35.806161] Modules linked in: act_tunnel_key kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64
crypto_simd glue_helper cryptd serio_raw
[   35.808233] CPU: 1 PID: 428 Comm: tc Not tainted 4.16.0-rc4+ #286
[   35.808929] RIP: 0010:tcf_action_init+0x90/0x190
[   35.809457] RSP: 0018:ffffb8edc068b9a0 EFLAGS: 00010206
[   35.810053] RAX: 1320c000000a0003 RBX: 0000000000000001 RCX: 0000000000000000
[   35.810866] RDX: 0000000000000070 RSI: 0000000000007965 RDI: ffffb8edc068b910
[   35.811660] RBP: ffffb8edc068b9d0 R08: 0000000000000000 R09: ffffb8edc068b808
[   35.812463] R10: ffffffffc02bf040 R11: 0000000000000040 R12: ffffb8edc068bb38
[   35.813235] R13: 0000000000000000 R14: 0000000000000000 R15: ffffb8edc068b910
[   35.814006] FS:  00007f3d0d8556c0(0000) GS:ffff91d1dbc40000(0000)
knlGS:0000000000000000
[   35.814881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   35.815540] CR2: 000000000043f720 CR3: 0000000019248001 CR4: 00000000001606a0
[   35.816457] Call Trace:
[   35.817158]  tc_ctl_action+0x11a/0x220
[   35.817795]  rtnetlink_rcv_msg+0x23d/0x2e0
[   35.818457]  ? __slab_alloc+0x1c/0x30
[   35.819079]  ? __kmalloc_node_track_caller+0xb1/0x2b0
[   35.819544]  ? rtnl_calcit.isra.30+0xe0/0xe0
[   35.820231]  netlink_rcv_skb+0xce/0x100
[   35.820744]  netlink_unicast+0x164/0x220
[   35.821500]  netlink_sendmsg+0x293/0x370
[   35.822040]  sock_sendmsg+0x30/0x40
[   35.822508]  ___sys_sendmsg+0x2c5/0x2e0
[   35.823149]  ? pagecache_get_page+0x27/0x220
[   35.823714]  ? filemap_fault+0xa2/0x640
[   35.824423]  ? page_add_file_rmap+0x108/0x200
[   35.825065]  ? alloc_set_pte+0x2aa/0x530
[   35.825585]  ? finish_fault+0x4e/0x70
[   35.826140]  ? __handle_mm_fault+0xbc1/0x10d0
[   35.826723]  ? __sys_sendmsg+0x41/0x70
[   35.827230]  __sys_sendmsg+0x41/0x70
[   35.827710]  do_syscall_64+0x68/0x120
[   35.828195]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   35.828859] RIP: 0033:0x7f3d0ca4da67
[   35.829331] RSP: 002b:00007ffc9f284338 EFLAGS: 00000246 ORIG_RAX:
000000000000002e
[   35.830304] RAX: ffffffffffffffda RBX: 00007ffc9f284460 RCX: 00007f3d0ca4da67
[   35.831247] RDX: 0000000000000000 RSI: 00007ffc9f2843b0 RDI: 0000000000000003
[   35.832167] RBP: 000000005aa6a7a9 R08: 0000000000000001 R09: 0000000000000000
[   35.833075] R10: 00000000000005f1 R11: 0000000000000246 R12: 0000000000000000
[   35.833997] R13: 00007ffc9f2884c0 R14: 0000000000000001 R15: 0000000000674640
[   35.834923] Code: 24 30 bb 01 00 00 00 45 31 f6 eb 5e 8b 50 08 83 c2 07 83 e2
fc 83 c2 70 49 8b 07 48 8b 40 70 48 85 c0 74 10 48 89 14 24 4c 89 ff <ff> d0 48
8b 14 24 48 01 c2 49 01 d6 45 85 ed 74 05 41 83 47 2c
[   35.837442] RIP: tcf_action_init+0x90/0x190 RSP: ffffb8edc068b9a0
[   35.838291] ---[ end trace a095c06ee4b97a26 ]---

Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-15 14:43:41 -04:00
Ursula Braun
3d50206759 net/smc: simplify wait when closing listen socket
Closing of a listen socket wakes up kernel_accept() of
smc_tcp_listen_worker(), and then has to wait till smc_tcp_listen_worker()
gives up the internal clcsock. The wait logic introduced with
commit 127f497058 ("net/smc: release clcsock from tcp_listen_worker")
might wait longer than necessary. This patch implements the idea to
implement the wait just with flush_work(), and gets rid of the extra
smc_close_wait_listen_clcsock() function.

Fixes: 127f497058 ("net/smc: release clcsock from tcp_listen_worker")
Reported-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-15 09:49:13 -04:00
Sabrina Dubroca
d52e5a7e7c ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
Prior to the rework of PMTU information storage in commit
2c8cec5c10 ("ipv4: Cache learned PMTU information in inetpeer."),
when a PMTU event advertising a PMTU smaller than
net.ipv4.route.min_pmtu was received, we would disable setting the DF
flag on packets by locking the MTU metric, and set the PMTU to
net.ipv4.route.min_pmtu.

Since then, we don't disable DF, and set PMTU to
net.ipv4.route.min_pmtu, so the intermediate router that has this link
with a small MTU will have to drop the packets.

This patch reestablishes pre-2.6.39 behavior by splitting
rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
and is checked in ip_dont_fragment().

One possible workaround is to set net.ipv4.route.min_pmtu to a value low
enough to accommodate the lowest MTU encountered.

Fixes: 2c8cec5c10 ("ipv4: Cache learned PMTU information in inetpeer.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-14 13:37:36 -04:00
Eric Dumazet
4dcb31d464 net: use skb_to_full_sk() in skb_update_prio()
Andrei Vagin reported a KASAN: slab-out-of-bounds error in
skb_update_prio()

Since SYNACK might be attached to a request socket, we need to
get back to the listener socket.
Since this listener is manipulated without locks, add const
qualifiers to sock_cgroup_prioidx() so that the const can also
be used in skb_update_prio()

Also add the const qualifier to sock_cgroup_classid() for consistency.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrei Vagin <avagin@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-14 12:53:23 -04:00
David S. Miller
d2ddf628e9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2018-03-13

1) Refuse to insert 32 bit userspace socket policies on 64
   bit systems like we do it for standard policies. We don't
   have a compat layer, so inserting socket policies from
   32 bit userspace will lead to a broken configuration.

2) Make the policy hold queue work without the flowcache.
   Dummy bundles are not chached anymore, so we need to
   generate a new one on each lookup as long as the SAs
   are not yet in place.

3) Fix the validation of the esn replay attribute. The
   The sanity check in verify_replay() is bypassed if
   the XFRM_STATE_ESN flag is not set. Fix this by doing
   the sanity check uncoditionally.
   From Florian Westphal.

4) After most of the dst_entry garbage collection code
   is removed, we may leak xfrm_dst entries as they are
   neither cached nor tracked somewhere. Fix this by
   reusing the 'uncached_list' to track xfrm_dst entries
   too. From Xin Long.

5) Fix a rcu_read_lock/rcu_read_unlock imbalance in
   xfrm_get_tos() From Xin Long.

6) Fix an infinite loop in xfrm_get_dst_nexthop. On
   transport mode we fetch the child dst_entry after
   we continue, so this pointer is never updated.
   Fix this by fetching it before we continue.

7) Fix ESN sequence number gap after IPsec GSO packets.
    We accidentally increment the sequence number counter
    on the xfrm_state by one packet too much in the ESN
    case. Fix this by setting the sequence number to the
    correct value.

8) Reset the ethernet protocol after decapsulation only if a
   mac header was set. Otherwise it breaks configurations
   with TUN devices. From Yossi Kuperman.

9) Fix __this_cpu_read() usage in preemptible code. Use
   this_cpu_read() instead in ipcomp_alloc_tfms().
   From Greg Hackmann.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-13 10:38:07 -04:00
Greg Hackmann
0dcd787602 net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()
f7c83bcbfa ("net: xfrm: use __this_cpu_read per-cpu helper") added a
__this_cpu_read() call inside ipcomp_alloc_tfms().

At the time, __this_cpu_read() required the caller to either not care
about races or to handle preemption/interrupt issues.  3.15 tightened
the rules around some per-cpu operations, and now __this_cpu_read()
should never be used in a preemptible context.  On 3.15 and later, we
need to use this_cpu_read() instead.

syzkaller reported this leading to the following kernel BUG while
fuzzing sendmsg:

BUG: using __this_cpu_read() in preemptible [00000000] code: repro/3101
caller is ipcomp_init_state+0x185/0x990
CPU: 3 PID: 3101 Comm: repro Not tainted 4.16.0-rc4-00123-g86f84779d8e9 #154
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0xb9/0x115
 check_preemption_disabled+0x1cb/0x1f0
 ipcomp_init_state+0x185/0x990
 ? __xfrm_init_state+0x876/0xc20
 ? lock_downgrade+0x5e0/0x5e0
 ipcomp4_init_state+0xaa/0x7c0
 __xfrm_init_state+0x3eb/0xc20
 xfrm_init_state+0x19/0x60
 pfkey_add+0x20df/0x36f0
 ? pfkey_broadcast+0x3dd/0x600
 ? pfkey_sock_destruct+0x340/0x340
 ? pfkey_seq_stop+0x80/0x80
 ? __skb_clone+0x236/0x750
 ? kmem_cache_alloc+0x1f6/0x260
 ? pfkey_sock_destruct+0x340/0x340
 ? pfkey_process+0x62a/0x6f0
 pfkey_process+0x62a/0x6f0
 ? pfkey_send_new_mapping+0x11c0/0x11c0
 ? mutex_lock_io_nested+0x1390/0x1390
 pfkey_sendmsg+0x383/0x750
 ? dump_sp+0x430/0x430
 sock_sendmsg+0xc0/0x100
 ___sys_sendmsg+0x6c8/0x8b0
 ? copy_msghdr_from_user+0x3b0/0x3b0
 ? pagevec_lru_move_fn+0x144/0x1f0
 ? find_held_lock+0x32/0x1c0
 ? do_huge_pmd_anonymous_page+0xc43/0x11e0
 ? lock_downgrade+0x5e0/0x5e0
 ? get_kernel_page+0xb0/0xb0
 ? _raw_spin_unlock+0x29/0x40
 ? do_huge_pmd_anonymous_page+0x400/0x11e0
 ? __handle_mm_fault+0x553/0x2460
 ? __fget_light+0x163/0x1f0
 ? __sys_sendmsg+0xc7/0x170
 __sys_sendmsg+0xc7/0x170
 ? SyS_shutdown+0x1a0/0x1a0
 ? __do_page_fault+0x5a0/0xca0
 ? lock_downgrade+0x5e0/0x5e0
 SyS_sendmsg+0x27/0x40
 ? __sys_sendmsg+0x170/0x170
 do_syscall_64+0x19f/0x640
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f0ee73dfb79
RSP: 002b:00007ffe14fc15a8 EFLAGS: 00000207 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0ee73dfb79
RDX: 0000000000000000 RSI: 00000000208befc8 RDI: 0000000000000004
RBP: 00007ffe14fc15b0 R08: 00007ffe14fc15c0 R09: 00007ffe14fc15c0
R10: 0000000000000000 R11: 0000000000000207 R12: 0000000000400440
R13: 00007ffe14fc16b0 R14: 0000000000000000 R15: 0000000000000000

Signed-off-by: Greg Hackmann <ghackmann@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-13 07:46:37 +01:00
Florian Fainelli
5a9f8df68e net: dsa: Fix dsa_is_user_port() test inversion
During the conversion to dsa_is_user_port(), a condition ended up being
reversed, which would prevent the creation of any user port when using
the legacy binding and/or platform data, fix that.

Fixes: 4a5b85ffe2 ("net: dsa: use dsa_is_user_port everywhere")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 21:04:55 -04:00
Paolo Abeni
b954f94023 l2tp: fix races with ipv4-mapped ipv6 addresses
The l2tp_tunnel_create() function checks for v4mapped ipv6
sockets and cache that flag, so that l2tp core code can
reusing it at xmit time.

If the socket is provided by the userspace, the connection
status of the tunnel sockets can change between the tunnel
creation and the xmit call, so that syzbot is able to
trigger the following splat:

BUG: KASAN: use-after-free in ip6_dst_idev include/net/ip6_fib.h:192
[inline]
BUG: KASAN: use-after-free in ip6_xmit+0x1f76/0x2260
net/ipv6/ip6_output.c:264
Read of size 8 at addr ffff8801bd949318 by task syz-executor4/23448

CPU: 0 PID: 23448 Comm: syz-executor4 Not tainted 4.16.0-rc4+ #65
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x194/0x24d lib/dump_stack.c:53
  print_address_description+0x73/0x250 mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report+0x23c/0x360 mm/kasan/report.c:412
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
  ip6_dst_idev include/net/ip6_fib.h:192 [inline]
  ip6_xmit+0x1f76/0x2260 net/ipv6/ip6_output.c:264
  inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
  l2tp_xmit_core net/l2tp/l2tp_core.c:1053 [inline]
  l2tp_xmit_skb+0x105f/0x1410 net/l2tp/l2tp_core.c:1148
  pppol2tp_sendmsg+0x470/0x670 net/l2tp/l2tp_ppp.c:341
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:640
  ___sys_sendmsg+0x767/0x8b0 net/socket.c:2046
  __sys_sendmsg+0xe5/0x210 net/socket.c:2080
  SYSC_sendmsg net/socket.c:2091 [inline]
  SyS_sendmsg+0x2d/0x50 net/socket.c:2087
  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x453e69
RSP: 002b:00007f819593cc68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f819593d6d4 RCX: 0000000000453e69
RDX: 0000000000000081 RSI: 000000002037ffc8 RDI: 0000000000000004
RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000004c3 R14: 00000000006f72e8 R15: 0000000000000000

This change addresses the issues:
* explicitly checking for TCP_ESTABLISHED for user space provided sockets
* dropping the v4mapped flag usage - it can become outdated - and
  explicitly invoking ipv6_addr_v4mapped() instead

The issue is apparently there since ancient times.

v1 -> v2: (many thanks to Guillaume)
 - with csum issue introduced in v1
 - replace pr_err with pr_debug
 - fix build issue with IPV6 disabled
 - move l2tp_sk_is_v4mapped in l2tp_core.c

v2 -> v3:
 - don't update inet_daddr for v4mapped address, unneeded
 - drop rendundant check at creation time

Reported-and-tested-by: syzbot+92fa328176eb07e4ac1a@syzkaller.appspotmail.com
Fixes: 3557baabf2 ("[L2TP]: PPP over L2TP driver core")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 15:11:09 -04:00
Paolo Abeni
2f987a76a9 net: ipv6: keep sk status consistent after datagram connect failure
On unsuccesful ip6_datagram_connect(), if the failure is caused by
ip6_datagram_dst_update(), the sk peer information are cleared, but
the sk->sk_state is preserved.

If the socket was already in an established status, the overall sk
status is inconsistent and fouls later checks in datagram code.

Fix this saving the old peer information and restoring them in
case of failure. This also aligns ipv6 datagram connect() behavior
with ipv4.

v1 -> v2:
 - added missing Fixes tag

Fixes: 85cb73ff9b ("net: ipv6: reset daddr and dport in sk if connect() fails")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 15:10:54 -04:00
David S. Miller
b747594829 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree, they are:

1) Fixed hashtable representation doesn't support timeout flag, skip it
   otherwise rules to add elements from the packet fail bogusly fail with
   EOPNOTSUPP.

2) Fix bogus error with 32-bits ebtables userspace and 64-bits kernel,
   patch from Florian Westphal.

3) Sanitize proc names in several x_tables extensions, also from Florian.

4) Add sanitization to ebt_among wormhash logic, from Florian.

5) Missing release of hook array in flowtable.
====================
2018-03-12 12:49:30 -04:00
Xin Long
bf2ae2e4bf sock_diag: request _diag module only when the family or proto has been registered
Now when using 'ss' in iproute, kernel would try to load all _diag
modules, which also causes corresponding family and proto modules
to be loaded as well due to module dependencies.

Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
would be loaded.

For example:

  $ lsmod|grep sctp
  $ ss
  $ lsmod|grep sctp
  sctp_diag              16384  0
  sctp                  323584  5 sctp_diag
  inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
  libcrc32c              16384  3 nf_conntrack,nf_nat,sctp

As these family and proto modules are loaded unintentionally, it
could cause some problems, like:

- Some debug tools use 'ss' to collect the socket info, which loads all
  those diag and family and protocol modules. It's noisy for identifying
  issues.

- Users usually expect to drop sctp init packet silently when they
  have no sense of sctp protocol instead of sending abort back.

- It wastes resources (especially with multiple netns), and SCTP module
  can't be unloaded once it's loaded.

...

In short, it's really inappropriate to have these family and proto
modules loaded unexpectedly when just doing debugging with inet_diag.

This patch is to introduce sock_load_diag_module() where it loads
the _diag module only when it's corresponding family or proto has
been already registered.

Note that we can't just load _diag module without the family or
proto loaded, as some symbols used in _diag module are from the
family or proto module.

v1->v2:
  - move inet proto check to inet_diag to avoid a compiling err.
v2->v3:
  - define sock_load_diag_module in sock.c and export one symbol
    only.
  - improve the changelog.

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12 11:03:42 -04:00
zhangliping
ddc502dfed openvswitch: meter: fix the incorrect calculation of max delta_t
Max delat_t should be the full_bucket/rate instead of the full_bucket.
Also report EINVAL if the rate is zero.

Fixes: 96fbc13d7e ("openvswitch: Add meter infrastructure")
Cc: Andy Zhou <azhou@ovn.org>
Signed-off-by: zhangliping <zhangliping02@baidu.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-11 22:48:59 -04:00
Pablo Neira Ayuso
c04a3f7300 netfilter: nf_tables: release flowtable hooks
Otherwise we leak this array.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-11 21:24:56 +01:00
Florian Westphal
c8d70a700a netfilter: bridge: ebt_among: add more missing match size checks
ebt_among is special, it has a dynamic match size and is exempt
from the central size checks.

commit c4585a2823 ("bridge: ebt_among: add missing match size checks")
added validation for pool size, but missed fact that the macros
ebt_among_wh_src/dst can already return out-of-bound result because
they do not check value of wh_src/dst_ofs (an offset) vs. the size
of the match that userspace gave to us.

v2:
check that offset has correct alignment.
Paolo Abeni points out that we should also check that src/dst
wormhash arrays do not overlap, and src + length lines up with
start of dst (or vice versa).
v3: compact wormhash_sizes_valid() part

NB: Fixes tag is intentionally wrong, this bug exists from day
one when match was added for 2.6 kernel. Tag is there so stable
maintainers will notice this one too.

Tested with same rules from the earlier patch.

Fixes: c4585a2823 ("bridge: ebt_among: add missing match size checks")
Reported-by: <syzbot+bdabab6f1983a03fc009@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-11 21:24:49 +01:00
Florian Westphal
b1d0a5d0cb netfilter: x_tables: add and use xt_check_proc_name
recent and hashlimit both create /proc files, but only check that
name is 0 terminated.

This can trigger WARN() from procfs when name is "" or "/".
Add helper for this and then use it for both.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: <syzbot+0502b00edac2a0680b61@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-11 21:24:29 +01:00
Florian Westphal
932909d9b2 netfilter: ebtables: fix erroneous reject of last rule
The last rule in the blob has next_entry offset that is same as total size.
This made "ebtables32 -A OUTPUT -d de:ad:be:ef:01:02" fail on 64 bit kernel.

Fixes: b718121685 ("netfilter: ebtables: CONFIG_COMPAT: don't trust userland offsets")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-11 21:24:00 +01:00
William Tu
e41c7c68ea ip6erspan: make sure enough headroom at xmit.
The patch adds skb_cow_header() to ensure enough headroom
at ip6erspan_tunnel_xmit before pushing the erspan header
to the skb.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:03:57 -05:00
William Tu
d6aa71197f ip6erspan: improve error handling for erspan version number.
When users fill in incorrect erspan version number through
the struct erspan_metadata uapi, current code skips pushing
the erspan header but continue pushing the gre header, which
is incorrect.  The patch fixes it by returning error.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:03:56 -05:00
William Tu
3b04caab81 ip6gre: add erspan v2 to tunnel lookup
The patch adds the erspan v2 proto in ip6gre_tunnel_lookup
so the erspan v2 tunnel can be found correctly.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 13:03:56 -05:00
Daniel Axtens
1dd27cde30 net: use skb_is_gso_sctp() instead of open-coding
As well as the basic conversion, I noticed that a lot of the
SCTP code checks gso_type without first checking skb_is_gso()
so I have added that where appropriate.

Also, document the helper.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:41:47 -05:00
Eric Dumazet
ca0edb131b ieee802154: 6lowpan: fix possible NULL deref in lowpan_device_event()
A tun device type can trivially be set to arbitrary value using
TUNSETLINK ioctl().

Therefore, lowpan_device_event() must really check that ieee802154_ptr
is not NULL.

Fixes: 2c88b5283f ("ieee802154: 6lowpan: remove check on null")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@osg.samsung.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:19:26 -05:00
Lorenzo Bianconi
9f62c15f28 ipv6: fix access to non-linear packet in ndisc_fill_redirect_hdr_option()
Fix the following slab-out-of-bounds kasan report in
ndisc_fill_redirect_hdr_option when the incoming ipv6 packet is not
linear and the accessed data are not in the linear data region of orig_skb.

[ 1503.122508] ==================================================================
[ 1503.122832] BUG: KASAN: slab-out-of-bounds in ndisc_send_redirect+0x94e/0x990
[ 1503.123036] Read of size 1184 at addr ffff8800298ab6b0 by task netperf/1932

[ 1503.123220] CPU: 0 PID: 1932 Comm: netperf Not tainted 4.16.0-rc2+ #124
[ 1503.123347] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[ 1503.123527] Call Trace:
[ 1503.123579]  <IRQ>
[ 1503.123638]  print_address_description+0x6e/0x280
[ 1503.123849]  kasan_report+0x233/0x350
[ 1503.123946]  memcpy+0x1f/0x50
[ 1503.124037]  ndisc_send_redirect+0x94e/0x990
[ 1503.125150]  ip6_forward+0x1242/0x13b0
[...]
[ 1503.153890] Allocated by task 1932:
[ 1503.153982]  kasan_kmalloc+0x9f/0xd0
[ 1503.154074]  __kmalloc_track_caller+0xb5/0x160
[ 1503.154198]  __kmalloc_reserve.isra.41+0x24/0x70
[ 1503.154324]  __alloc_skb+0x130/0x3e0
[ 1503.154415]  sctp_packet_transmit+0x21a/0x1810
[ 1503.154533]  sctp_outq_flush+0xc14/0x1db0
[ 1503.154624]  sctp_do_sm+0x34e/0x2740
[ 1503.154715]  sctp_primitive_SEND+0x57/0x70
[ 1503.154807]  sctp_sendmsg+0xaa6/0x1b10
[ 1503.154897]  sock_sendmsg+0x68/0x80
[ 1503.154987]  ___sys_sendmsg+0x431/0x4b0
[ 1503.155078]  __sys_sendmsg+0xa4/0x130
[ 1503.155168]  do_syscall_64+0x171/0x3f0
[ 1503.155259]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 1503.155436] Freed by task 1932:
[ 1503.155527]  __kasan_slab_free+0x134/0x180
[ 1503.155618]  kfree+0xbc/0x180
[ 1503.155709]  skb_release_data+0x27f/0x2c0
[ 1503.155800]  consume_skb+0x94/0xe0
[ 1503.155889]  sctp_chunk_put+0x1aa/0x1f0
[ 1503.155979]  sctp_inq_pop+0x2f8/0x6e0
[ 1503.156070]  sctp_assoc_bh_rcv+0x6a/0x230
[ 1503.156164]  sctp_inq_push+0x117/0x150
[ 1503.156255]  sctp_backlog_rcv+0xdf/0x4a0
[ 1503.156346]  __release_sock+0x142/0x250
[ 1503.156436]  release_sock+0x80/0x180
[ 1503.156526]  sctp_sendmsg+0xbb0/0x1b10
[ 1503.156617]  sock_sendmsg+0x68/0x80
[ 1503.156708]  ___sys_sendmsg+0x431/0x4b0
[ 1503.156799]  __sys_sendmsg+0xa4/0x130
[ 1503.156889]  do_syscall_64+0x171/0x3f0
[ 1503.156980]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 1503.157158] The buggy address belongs to the object at ffff8800298ab600
                which belongs to the cache kmalloc-1024 of size 1024
[ 1503.157444] The buggy address is located 176 bytes inside of
                1024-byte region [ffff8800298ab600, ffff8800298aba00)
[ 1503.157702] The buggy address belongs to the page:
[ 1503.157820] page:ffffea0000a62a00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0
[ 1503.158053] flags: 0x4000000000008100(slab|head)
[ 1503.158171] raw: 4000000000008100 0000000000000000 0000000000000000 00000001800e000e
[ 1503.158350] raw: dead000000000100 dead000000000200 ffff880036002600 0000000000000000
[ 1503.158523] page dumped because: kasan: bad access detected

[ 1503.158698] Memory state around the buggy address:
[ 1503.158816]  ffff8800298ab900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1503.158988]  ffff8800298ab980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 1503.159165] >ffff8800298aba00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1503.159338]                    ^
[ 1503.159436]  ffff8800298aba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1503.159610]  ffff8800298abb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1503.159785] ==================================================================
[ 1503.159964] Disabling lock debugging due to kernel taint

The test scenario to trigger the issue consists of 4 devices:
- H0: data sender, connected to LAN0
- H1: data receiver, connected to LAN1
- GW0 and GW1: routers between LAN0 and LAN1. Both of them have an
  ethernet connection on LAN0 and LAN1
On H{0,1} set GW0 as default gateway while on GW0 set GW1 as next hop for
data from LAN0 to LAN1.
Moreover create an ip6ip6 tunnel between H0 and H1 and send 3 concurrent
data streams (TCP/UDP/SCTP) from H0 to H1 through ip6ip6 tunnel (send
buffer size is set to 16K). While data streams are active flush the route
cache on HA multiple times.
I have not been able to identify a given commit that introduced the issue
since, using the reproducer described above, the kasan report has been
triggered from 4.14 and I have not gone back further.

Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09 11:16:19 -05:00
David S. Miller
cfda06d736 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-03-08

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix various BPF helpers which adjust the skb and its GSO information
   with regards to SCTP GSO. The latter is a special case where gso_size
   is of value GSO_BY_FRAGS, so mangling that will end up corrupting
   the skb, thus bail out when seeing SCTP GSO packets, from Daniel(s).

2) Fix a compilation error in bpftool where BPF_FS_MAGIC is not defined
   due to too old kernel headers in the system, from Jiri.

3) Increase the number of x64 JIT passes in order to allow larger images
   to converge instead of punting them to interpreter or having them
   rejected when the interpreter is not built into the kernel, from Daniel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 20:27:51 -05:00
Paul Moore
b51f26b146 net: don't unnecessarily load kernel modules in dev_ioctl()
Starting with v4.16-rc1 we've been seeing a higher than usual number
of requests for the kernel to load networking modules, even on events
which shouldn't trigger a module load (e.g. ioctl(TCGETS)).  Stephen
Smalley suggested the problem may lie in commit 44c02a2c3d
("dev_ioctl(): move copyin/copyout to callers") which moves changes
the network dev_ioctl() function to always call dev_load(),
regardless of the requested ioctl.

This patch moves the dev_load() calls back into the individual ioctls
while preserving the rest of the original patch.

Reported-by: Dominick Grift <dac.override@gmail.com>
Suggested-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 15:12:58 -05:00
Soheil Hassas Yeganeh
e05836ac07 tcp: purge write queue upon aborting the connection
When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.

Similar to a27fd7a8ed ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.

Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 15:01:03 -05:00
Alexey Kodanev
67f93df79a dccp: check sk for closed state in dccp_sendmsg()
dccp_disconnect() sets 'dp->dccps_hc_tx_ccid' tx handler to NULL,
therefore if DCCP socket is disconnected and dccp_sendmsg() is
called after it, it will cause a NULL pointer dereference in
dccp_write_xmit().

This crash and the reproducer was reported by syzbot. Looks like
it is reproduced if commit 69c64866ce ("dccp: CVE-2017-8824:
use-after-free in DCCP code") is applied.

Reported-by: syzbot+f99ab3887ab65d70f816@syzkaller.appspotmail.com
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 13:38:56 -05:00
Eric Dumazet
17cfe79a65 l2tp: do not accept arbitrary sockets
syzkaller found an issue caused by lack of sufficient checks
in l2tp_tunnel_create()

RAW sockets can not be considered as UDP ones for instance.

In another patch, we shall replace all pr_err() by less intrusive
pr_debug() so that syzkaller can find other bugs faster.
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Acked-by: James Chapman <jchapman@katalix.com>

==================================================================
BUG: KASAN: slab-out-of-bounds in setup_udp_tunnel_sock+0x3ee/0x5f0 net/ipv4/udp_tunnel.c:69
dst_release: dst:00000000d53d0d0f refcnt:-1
Write of size 1 at addr ffff8801d013b798 by task syz-executor3/6242

CPU: 1 PID: 6242 Comm: syz-executor3 Not tainted 4.16.0-rc2+ #253
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report+0x23b/0x360 mm/kasan/report.c:412
 __asan_report_store1_noabort+0x17/0x20 mm/kasan/report.c:435
 setup_udp_tunnel_sock+0x3ee/0x5f0 net/ipv4/udp_tunnel.c:69
 l2tp_tunnel_create+0x1354/0x17f0 net/l2tp/l2tp_core.c:1596
 pppol2tp_connect+0x14b1/0x1dd0 net/l2tp/l2tp_ppp.c:707
 SYSC_connect+0x213/0x4a0 net/socket.c:1640
 SyS_connect+0x24/0x30 net/socket.c:1621
 do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 13:33:59 -05:00
Kirill Tkhai
a560002437 net: Fix hlist corruptions in inet_evict_bucket()
inet_evict_bucket() iterates global list, and
several tasks may call it in parallel. All of
them hash the same fq->list_evictor to different
lists, which leads to list corruption.

This patch makes fq be hashed to expired list
only if this has not been made yet by another
task. Since inet_frag_alloc() allocates fq
using kmem_cache_zalloc(), we may rely on
list_evictor is initially unhashed.

The problem seems to exist before async
pernet_operations, as there was possible to have
exit method to be executed in parallel with
inet_frags::frags_work, so I add two Fixes tags.
This also may go to stable.

Fixes: d1fe19444d "inet: frag: don't re-use chainlist for evictor"
Fixes: f84c6821aa "net: Convert pernet_subsys, registered from inet_init()"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 13:29:28 -05:00
Stefano Brivio
e9fa1495d7 ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes
Currently, administrative MTU changes on a given netdevice are
not reflected on route exceptions for MTU-less routes, with a
set PMTU value, for that device:

 # ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a proto kernel src 2001:db8::a metric 256 pref medium
 # ping6 -c 1 -q -s10000 2001:db8::b > /dev/null
 # ip netns exec a ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
     cache expires 571sec mtu 4926 pref medium
 # ip link set dev vti_a mtu 3000
 # ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
     cache expires 571sec mtu 4926 pref medium
 # ip link set dev vti_a mtu 9000
 # ip -6 route get 2001:db8::b
 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0
     cache expires 571sec mtu 4926 pref medium

The first issue is that since commit fb56be83e4 ("net-ipv6: on
device mtu change do not add mtu to mtu-less routes") we don't
call rt6_exceptions_update_pmtu() from rt6_mtu_change_route(),
which handles administrative MTU changes, if the regular route
is MTU-less.

However, PMTU exceptions should be always updated, as long as
RTAX_MTU is not locked. Keep the check for MTU-less main route,
as introduced by that commit, but, for exceptions,
call rt6_exceptions_update_pmtu() regardless of that check.

Once that is fixed, one problem remains: MTU changes are not
reflected if the new MTU is higher than the previous one,
because rt6_exceptions_update_pmtu() doesn't allow that. We
should instead allow PMTU increase if the old PMTU matches the
local MTU, as that implies that the old MTU was the lowest in the
path, and PMTU discovery might lead to different results.

The existing check in rt6_mtu_change_route() correctly took that
case into account (for regular routes only), so factor it out
and re-use it also in rt6_exceptions_update_pmtu().

While at it, fix comments style and grammar, and try to be a bit
more descriptive.

Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: fb56be83e4 ("net-ipv6: on device mtu change do not add mtu to mtu-less routes")
Fixes: f5bbe7ee79 ("ipv6: prepare rt6_mtu_change() for exception table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 13:17:50 -05:00
Alexey Kodanev
35d889d10b sch_netem: fix skb leak in netem_enqueue()
When we exceed current packets limit and we have more than one
segment in the list returned by skb_gso_segment(), netem drops
only the first one, skipping the rest, hence kmemleak reports:

unreferenced object 0xffff880b5d23b600 (size 1024):
  comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
  hex dump (first 32 bytes):
    00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000d8a19b9d>] __alloc_skb+0xc9/0x520
    [<000000001709b32f>] skb_segment+0x8c8/0x3710
    [<00000000c7b9bb88>] tcp_gso_segment+0x331/0x1830
    [<00000000c921cba1>] inet_gso_segment+0x476/0x1370
    [<000000008b762dd4>] skb_mac_gso_segment+0x1f9/0x510
    [<000000002182660a>] __skb_gso_segment+0x1dd/0x620
    [<00000000412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
    [<0000000005d3b2a9>] __dev_queue_xmit+0x1167/0x2120
    [<00000000fc5f7327>] ip_finish_output2+0x998/0xf00
    [<00000000d309e9d3>] ip_output+0x1aa/0x2c0
    [<000000007ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
    [<0000000042d2a45f>] tcp_write_xmit+0x4d4/0x58c0
    [<0000000056a44199>] tcp_tasklet_func+0x3d9/0x540
    [<0000000013d06d02>] tasklet_action+0x1ca/0x250
    [<00000000fcde0b8b>] __do_softirq+0x1b4/0x5a3
    [<00000000e7ed027c>] irq_exit+0x1e2/0x210

Fix it by adding the rest of the segments, if any, to skb 'to_free'
list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
because they can be useful in the future if we need to drop segmented
GSO packets in other places.

Fixes: 6071bd1aa1 ("netem: Segment GSO packets on enqueue")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 11:18:14 -05:00
David S. Miller
89036a2a98 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2018-03-05

Here are a few more Bluetooth fixes for the 4.16 kernel:

 - btusb: reset/resume fixes for Yoga 920 and Dell OptiPlex 3060
 - Fix for missing encryption refresh with the Security Manager protocol
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07 10:51:14 -05:00
Yossi Kuperman
87cdf3148b xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_proto
Artem Savkov reported that commit 5efec5c655 leads to a packet loss under
IPSec configuration. It appears that his setup consists of a TUN device,
which does not have a MAC header.

Make sure MAC header exists.

Note: TUN device sets a MAC header pointer, although it does not have one.

Fixes: 5efec5c655 ("xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version")
Reported-by: Artem Savkov <artem.savkov@gmail.com>
Tested-by: Artem Savkov <artem.savkov@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-07 10:54:29 +01:00
David Ahern
2cbb4ea7de net: Only honor ifindex in IP_PKTINFO if non-0
Only allow ifindex from IP_PKTINFO to override SO_BINDTODEVICE settings
if the index is actually set in the message.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-06 16:36:44 -05:00
Pablo Neira Ayuso
0d3601d299 netfilter: nft_set_hash: skip fixed hash if timeout is specified
Fixed hash supports to timeouts, so skip it. Otherwise, userspace hits
EOPNOTSUPP.

Fixes: 6c03ae210c ("netfilter: nft_set_hash: add non-resizable hashtable implementation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-06 12:23:33 +01:00
Linus Torvalds
547046141f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Use an appropriate TSQ pacing shift in mac80211, from Toke
    Høiland-Jørgensen.

 2) Just like ipv4's ip_route_me_harder(), we have to use skb_to_full_sk
    in ip6_route_me_harder, from Eric Dumazet.

 3) Fix several shutdown races and similar other problems in l2tp, from
    James Chapman.

 4) Handle missing XDP flush properly in tuntap, for real this time.
    From Jason Wang.

 5) Out-of-bounds access in powerpc ebpf tailcalls, from Daniel
    Borkmann.

 6) Fix phy_resume() locking, from Andrew Lunn.

 7) IFLA_MTU values are ignored on newlink for some tunnel types, fix
    from Xin Long.

 8) Revert F-RTO middle box workarounds, they only handle one dimension
    of the problem. From Yuchung Cheng.

 9) Fix socket refcounting in RDS, from Ka-Cheong Poon.

10) Don't allow ppp unit registration to an unregistered channel, from
    Guillaume Nault.

11) Various hv_netvsc fixes from Stephen Hemminger.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (98 commits)
  hv_netvsc: propagate rx filters to VF
  hv_netvsc: filter multicast/broadcast
  hv_netvsc: defer queue selection to VF
  hv_netvsc: use napi_schedule_irqoff
  hv_netvsc: fix race in napi poll when rescheduling
  hv_netvsc: cancel subchannel setup before halting device
  hv_netvsc: fix error unwind handling if vmbus_open fails
  hv_netvsc: only wake transmit queue if link is up
  hv_netvsc: avoid retry on send during shutdown
  virtio-net: re enable XDP_REDIRECT for mergeable buffer
  ppp: prevent unregistered channels from connecting to PPP units
  tc-testing: skbmod: fix match value of ethertype
  mlxsw: spectrum_switchdev: Check success of FDB add operation
  net: make skb_gso_*_seglen functions private
  net: xfrm: use skb_gso_validate_network_len() to check gso sizes
  net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
  net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
  rds: Incorrect reference counting in TCP socket creation
  net: ethtool: don't ignore return from driver get_fecparam method
  vrf: check forwarding on the original netdevice when generating ICMP dest unreachable
  ...
2018-03-05 11:29:24 -08:00
Daniel Axtens
a4a77718ee net: make skb_gso_*_seglen functions private
They're very hard to use properly as they do not consider the
GSO_BY_FRAGS case. Code should use skb_gso_validate_network_len
and skb_gso_validate_mac_len as they do consider this case.

Make the seglen functions static, which stops people using them
outside of skbuff.c

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Daniel Axtens
80f5974d15 net: xfrm: use skb_gso_validate_network_len() to check gso sizes
Replace skb_gso_network_seglen() with
skb_gso_validate_network_len(), as it considers the GSO_BY_FRAGS
case.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Daniel Axtens
ee78bbef8d net: sched: tbf: handle GSO_BY_FRAGS case in enqueue
tbf_enqueue() checks the size of a packet before enqueuing it.
However, the GSO size check does not consider the GSO_BY_FRAGS
case, and so will drop GSO SCTP packets, causing a massive drop
in throughput.

Use skb_gso_validate_mac_len() instead, as it does consider that
case.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
Daniel Axtens
779b7931b2 net: rename skb_gso_validate_mtu -> skb_gso_validate_network_len
If you take a GSO skb, and split it into packets, will the network
length (L3 headers + L4 headers + payload) of those packets be small
enough to fit within a given MTU?

skb_gso_validate_mtu gives you the answer to that question. However,
we recently added to add a way to validate the MAC length of a split GSO
skb (L2+L3+L4+payload), and the names get confusing, so rename
skb_gso_validate_mtu to skb_gso_validate_network_len

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04 17:49:17 -05:00
David S. Miller
4e00f5d5f9 Here are some batman-adv bugfixes:
- fix skb checksum issues, by Matthias Schiffer (2 patches)
 
  - fix exception handling when dumping data objects through netlink,
    by Sven Eckelmann (4 patches)
 
  - fix handling of interface indices, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlqZjskWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZgAEADcXmuvwW+QVU0L6N8ZDV7Titdp
 oIK+w2XDPLa2DcIkJ8ZAGJSK4ZZBrye/jAaRL/1CpBQiGDgP6pGodSihDPNUqCjs
 aqw8Zrwfg6BAvIeEFp3ThkCB89MADk6tMRXM4wZ3daNJvfCgHEUZeSl/1/PWWQmq
 yBn+DMeXZPOPJZgTTPMQUQ11kpf7UxtT4T9N8lq0KH6ISTbaZChPFlpPPIivpG+9
 JLjH40v8BErpIAsZHT2LJ/JPDRPXXul61p3y2sdnVnqUuSUnsc/ElJGLARXNId8P
 tkKtOONlZEwnt9ZrXOLqeqxzN+KFs97JSNIuap/hRfpgbG+Qqn4pePiIQVXNHMUC
 eBUfbGSWNtRDX/JXcwk6O4nMkTX79UDq97npcDv5sTkWLcIeznacqpXJCA8+2E05
 3F34XTDr0HY4QTxDQSEOfQ6qwo7qCxhc8u2Mr8RYNtzSiyRYS3Sy8jdFMMK/RgqT
 ggR8mIdR9MSRYhN0VFVHWWL7NL/OImke9ALvb4KioDbvDTFMFhsa/Pp1+K/NE3fK
 /E5mwicFFUB0MUmRnZiuN9gXPOy06fWOShouMgqF86Ofx4QpmT/Ph/gGK/HgFFKq
 SEfU6UunC3r6BISE4Zu5O7xTEH+AQhEZgyeSTTrTr+HLfNkvyqwGIlhG1HXzyCTx
 KWA50ikvjiKGeOCoMA==
 =tcl9
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20180302' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - fix skb checksum issues, by Matthias Schiffer (2 patches)

 - fix exception handling when dumping data objects through netlink,
   by Sven Eckelmann (4 patches)

 - fix handling of interface indices, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-03 23:52:49 -05:00
Daniel Axtens
d02f51cbcf bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat to deal with gso sctp skbs
SCTP GSO skbs have a gso_size of GSO_BY_FRAGS, so any sort of
unconditionally mangling of that will result in nonsense value
and would corrupt the skb later on.

Therefore, i) add two helpers skb_increase_gso_size() and
skb_decrease_gso_size() that would throw a one time warning and
bail out for such skbs and ii) refuse and return early with an
error in those BPF helpers that are affected. We do need to bail
out as early as possible from there before any changes on the
skb have been performed.

Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Co-authored-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-03 13:01:11 -08:00
David S. Miller
4a0c7191c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter fixes for your net tree,
they are:

1) Put back reference on CLUSTERIP configuration structure from the
   error path, patch from Florian Westphal.

2) Put reference on CLUSTERIP configuration instead of freeing it,
   another cpu may still be walking over it, also from Florian.

3) Refetch pointer to IPv6 header from nf_nat_ipv6_manip_pkt() given
   packet manipulation may reallocation the skbuff header, from Florian.

4) Missing match size sanity checks in ebt_among, from Florian.

5) Convert BUG_ON to WARN_ON in ebtables, from Florian.

6) Sanity check userspace offsets from ebtables kernel, from Florian.

7) Missing checksum replace call in flowtable IPv4 DNAT, from Felix
   Fietkau.

8) Bump the right stats on checksum error from bridge netfilter,
   from Taehee Yoo.

9) Unset interface flag in IPv6 fib lookups otherwise we get
   misleading routing lookup results, from Florian.

10) Missing sk_to_full_sk() in ip6_route_me_harder() from Eric Dumazet.

11) Don't allow devices to be part of multiple flowtables at the same
    time, this may break setups.

12) Missing netlink attribute validation in flowtable deletion.

13) Wrong array index in nf_unregister_net_hook() call from error path
    in flowtable addition path.

14) Fix FTP IPVS helper when NAT mangling is in place, patch from
    Julian Anastasov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-02 20:32:15 -05:00