linux/net
Jakub Sitnicki 00bc0ef588 ipv6: Skip XFRM lookup if dst_entry in socket cache is valid
At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.

If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.

To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:

  1) Set up a transformation and lower the MTU to trigger fragmentation
    # ip xfrm policy add dir out src ::1 dst ::1 \
      tmpl src ::1 dst ::1 proto esp spi 1
    # ip xfrm state add src ::1 dst ::1 \
      proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
    # ip link set dev lo mtu 1500

  2) Monitor the packet flow and set up an UDP sink
    # tcpdump -ni lo -ttt &
    # socat udp6-listen:12345,fork /dev/null &

  3) Send a datagram that needs fragmentation with a connected socket
    # perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
    2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
    00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
    00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
    00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
    (^ ICMPv6 Parameter Problem)
    00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136

  4) Compare it to a non-connected socket
    # perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
    00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
    00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)

What happens in step (3) is:

  1) when connecting the socket in __ip6_datagram_connect(), we
     perform an XFRM lookup, miss the flow cache, create an XFRM
     bundle, and cache the destination,

  2) afterwards, when sending the datagram, we perform an XFRM lookup,
     again, miss the flow cache (due to mismatch of flowi6_iif and
     flowi6_oif, which is an issue of its own), and recreate an XFRM
     bundle based on the cached (and already transformed) destination.

To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.

The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.

Joint work with Hannes Frederic Sowa.

Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 11:16:06 -07:00
..
6lowpan 6lowpan: move mac802154 header 2016-04-13 10:41:10 +02:00
9p remove lots of IS_ERR_VALUE abuses 2016-05-27 15:26:11 -07:00
802
8021q vlan: Propagate MAC address to VLANs 2016-05-31 11:56:48 -07:00
appletalk appletalk: fix erroneous return value 2016-02-18 14:59:34 -05:00
atm net/atm: sk_err_soft must be positive 2016-05-23 13:51:10 -07:00
ax25 ax25: add link layer header validation function 2016-03-09 22:13:01 -05:00
batman-adv batman-adv: initialize ELP orig address on secondary interfaces 2016-05-18 11:49:44 +08:00
bluetooth Bluetooth: fix power_on vs close race 2016-05-13 16:50:23 +02:00
bridge bridge: Don't insert unnecessary local fdb entry on changing mac address 2016-06-08 00:31:38 -07:00
caif net: caif: fix misleading indentation 2016-03-14 13:09:50 -04:00
can sock: enable timestamping using control messages 2016-04-04 15:50:30 -04:00
ceph libceph: make ceph_osdc_wait_request() uninterruptible 2016-05-26 01:15:40 +02:00
core net-sysfs: fix missing <linux/of_net.h> 2016-06-08 00:37:58 -07:00
dcb
dccp dccp: do not assume DCCP code is non preemptible 2016-05-02 17:02:25 -04:00
decnet decnet: Do not build routes to devices without decnet private data. 2016-04-10 23:01:30 -04:00
dns_resolver KEYS: Add a facility to restrict new links into a keyring 2016-04-11 22:37:37 +01:00
dsa dsa: Rename switch chip data to cd 2016-05-11 19:36:28 -04:00
ethernet eth: Pull header from first fragment via eth_get_headlen 2016-02-24 13:58:05 -05:00
hsr net/hsr: Use setup_timer and mod_timer. 2016-05-16 14:00:43 -04:00
ieee802154 ieee802154: fix logic error in ieee802154_llsec_parse_dev_addr 2016-05-29 22:36:25 -07:00
ipv4 Possible problem with e6afc8ac ("udp: remove headers from UDP packets before queueing") 2016-06-02 18:29:49 -04:00
ipv6 ipv6: Skip XFRM lookup if dst_entry in socket cache is valid 2016-06-08 11:16:06 -07:00
ipx
irda TTY and Serial driver update for 4.7-rc1 2016-05-20 20:57:27 -07:00
iucv af_iucv: Validate socket address length in iucv_sock_bind() 2016-01-19 14:21:08 -05:00
kcm kcm: fix a signedness in kcm_splice_read() 2016-05-19 11:26:51 -07:00
key
l2tp l2tp: fix configuration passed to setup_udp_tunnel_sock() 2016-06-08 11:11:53 -07:00
l3mdev net: l3mdev: Allow send on enslaved interface 2016-05-09 22:33:52 -04:00
lapb net/lapb: tuse %*ph to dump buffers 2016-05-29 22:33:25 -07:00
llc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
mac80211 mac80211: fix fast_tx header alignment 2016-05-31 12:14:04 +02:00
mac802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2016-03-19 10:05:34 -07:00
mpls gso: Remove arbitrary checks for unsupported GSO 2016-05-20 18:03:15 -04:00
netfilter ipvs: update real-server binding of outgoing connections in SIP-pe 2016-06-06 09:47:25 +09:00
netlabel netlabel: fix a problem with netlbl_secattr_catmap_setrng() 2016-04-05 16:10:47 -04:00
netlink netlink: Fix dump skb leak/double free 2016-05-16 22:05:15 -04:00
netrom
nfc nfc: nci: Add nci_nfcc_loopback to the nci core 2016-05-04 01:48:16 +02:00
openvswitch openvswitch: update checksum in {push,pop}_mpls 2016-05-31 13:51:42 -07:00
packet Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-23 18:51:33 -04:00
phonet sock: struct proto hash function may error 2016-02-11 03:54:14 -05:00
qrtr Merge tag 'qcom-soc-for-4.7-2' into net-next 2016-05-17 14:11:19 -04:00
rds RDS: TCP: fix race windows in send-path quiescence by rds_tcp_accept_one() 2016-06-07 15:10:15 -07:00
rfkill rfkill: Use switch to demux userspace operations 2016-04-05 10:48:53 +02:00
rose
rxrpc rxrpc: fix ptr_ret.cocci warnings 2016-06-07 15:30:21 -07:00
sched net: sched: fix tc_should_offload for specific clsact classes 2016-06-07 16:59:53 -07:00
sctp sctp: sctp_diag should dump sctp socket type 2016-05-31 11:59:06 -07:00
sunrpc NFS client updates for Linux 4.7 2016-05-26 10:33:33 -07:00
switchdev switchdev: pass pointer to fib_info instead of copy 2016-05-17 13:58:49 -04:00
tipc tipc: fix an infoleak in tipc_nl_compat_link_dump 2016-06-02 21:32:37 -07:00
unix constify security_path_{mkdir,mknod,symlink} 2016-03-28 00:47:27 -04:00
vmw_vsock Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
wimax
wireless mm/page_ref: use page_ref helper instead of direct modification of _count 2016-05-19 19:12:14 -07:00
x25 net: fix a kernel infoleak in x25 module 2016-05-09 22:45:33 -04:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-09 15:59:24 -04:00
compat.c soreuseport: add compat case for setsockopt SO_ATTACH_REUSEPORT_CBPF 2016-06-06 15:21:04 -07:00
Kconfig bpf: add generic constant blinding for use in jits 2016-05-16 13:49:32 -04:00
Makefile net: Add Qualcomm IPC router 2016-05-08 23:46:14 -04:00
socket.c fs: poll/select/recvmmsg: use timespec64 for timeout events 2016-05-19 19:12:14 -07:00
sysctl_net.c