linux/net/ipv4
David S. Miller 5e9965c15b Merge branch 'kill_rtcache'
The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables.  The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

The general flow of this patch series is that first the routing cache
is removed.  We build a completely new rtable entry every lookup
request.

Next we make some simplifications due to the fact that removing the
routing cache causes several members of struct rtable to become no
longer necessary.

Then we need to make some amends such that we can legally cache
pre-constructed routes in the FIB nexthops.  Firstly, we need to
invalidate routes which are hit with nexthop exceptions.  Secondly we
have to change the semantics of rt->rt_gateway such that zero means
that the destination is on-link and non-zero otherwise.

Now that the preparations are ready, we start caching precomputed
routes in the FIB nexthops.  Output and input routes need different
kinds of care when determining if we can legally do such caching or
not.  The details are in the commit log messages for those changes.

The patch series then winds down with some more struct rtable
simplifications and other tidy ups that remove unnecessary overhead.

On a SPARC-T3 output route lookups are ~876 cycles.  Input route
lookups are ~1169 cycles with rpfilter disabled, and about ~1468
cycles with rpfilter enabled.

These measurements were taken with the kbench_mod test module in the
net_test_tools GIT tree:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

That GIT tree also includes a udpflood tester tool and stresses
route lookups on packet output.

For example, on the same SPARC-T3 system we can run:

	time ./udpflood -l 10000000 10.2.2.11

with routing cache:
real    1m21.955s       user    0m6.530s        sys     1m15.390s

without routing cache:
real    1m31.678s       user    0m6.520s        sys     1m25.140s

Performance undoubtedly can easily be improved further.

For example fib_table_lookup() performs a lot of excessive
computations with all the masking and shifting, some of it
conditionalized to deal with edge cases.

Also, Eric's no-ref optimization for input route lookups can be
re-instated for the FIB nexthop caching code path.  I would be really
pleased if someone would work on that.

In fact anyone suitable motivated can just fire up perf on the loading
of the test net_test_tools benchmark kernel module.  I spend much of
my time going:

bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
bash# perf report

Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
Hutchings, and others.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-22 17:04:15 -07:00
..
netfilter ipv4: Adjust semantics of rt->rt_gateway. 2012-07-20 13:31:20 -07:00
af_inet.c net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) 2012-07-19 11:02:03 -07:00
ah4.c ipv4: Add redirect support to all protocol icmp error handlers. 2012-07-11 21:27:49 -07:00
arp.c ipv4: Adjust semantics of rt->rt_gateway. 2012-07-20 13:31:20 -07:00
cipso_ipv4.c cipso: don't follow a NULL pointer when setsockopt() is called 2012-07-18 09:01:12 -07:00
datagram.c ipv4: Lock socket and use cork flow in ip4_datagram_connect(). 2011-05-08 13:48:57 -07:00
devinet.c ipv4: Add interface option to enable routing of 127.0.0.0/8 2012-06-12 15:25:46 -07:00
esp4.c ipv4: Add redirect support to all protocol icmp error handlers. 2012-07-11 21:27:49 -07:00
fib_frontend.c ipv4: Delete routing cache. 2012-07-20 13:30:27 -07:00
fib_lookup.h ipv4: Fix nexthop caching wrt. scoping. 2011-03-24 18:06:47 -07:00
fib_rules.c ipv4: Don't store a rule pointer in fib_result. 2012-07-13 08:21:29 -07:00
fib_semantics.c ipv4: Cache input routes in fib_info nexthops. 2012-07-20 13:36:40 -07:00
fib_trie.c ipv4: Remove tb_peers from fib_table. 2012-07-12 09:39:28 -07:00
gre.c net: ipv4: Standardize prefixes for message logging 2012-03-12 17:05:21 -07:00
icmp.c ipv4: Put proper checks into icmp_socket_deliver(). 2012-07-12 08:06:04 -07:00
igmp.c ipv4: fix checkpatch errors 2012-04-15 12:37:19 -04:00
inet_connection_sock.c ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. 2012-07-20 13:36:54 -07:00
inet_diag.c net: make sock diag per-namespace 2012-07-16 22:31:34 -07:00
inet_fragment.c inetpeer: add parameter net for inet_getpeer_v4,v6 2012-06-08 14:27:23 -07:00
inet_hashtables.c ipv4: fix checkpatch errors 2012-04-15 12:37:19 -04:00
inet_lro.c net: add skb frag size accessors 2011-10-19 03:10:46 -04:00
inet_timewait_sock.c net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug 2012-05-16 01:01:03 -04:00
inetpeer.c ipv4: Maintain redirect and PMTU info in struct rtable again. 2012-07-10 22:40:14 -07:00
ip_forward.c snmp: fix OutOctets counter to include forwarded datagrams 2012-06-07 14:50:56 -07:00
ip_fragment.c ipv4: Kill ip_route_input_noref(). 2012-07-20 13:30:59 -07:00
ip_gre.c ipv4: Adjust semantics of rt->rt_gateway. 2012-07-20 13:31:20 -07:00
ip_input.c ipv4: Kill ip_route_input_noref(). 2012-07-20 13:30:59 -07:00
ip_options.c ipv4: optimize fib_compute_spec_dst call in ip_options_echo 2012-07-19 08:30:49 -07:00
ip_output.c Merge branch 'kill_rtcache' 2012-07-22 17:04:15 -07:00
ip_sockglue.c ipv4: Create and use fib_compute_spec_dst() helper. 2012-06-28 03:59:11 -07:00
ip_vti.c net/ipv4: VTI support new module for ip_vti. 2012-07-18 09:36:12 -07:00
ipcomp.c ipv4: Add redirect support to all protocol icmp error handlers. 2012-07-11 21:27:49 -07:00
ipconfig.c net/ipv4/ipconfig: neaten __setup placement 2012-05-20 04:06:16 -04:00
ipip.c ipv4: Adjust semantics of rt->rt_gateway. 2012-07-20 13:31:20 -07:00
ipmr.c ipv4: Kill rt->rt_oif 2012-07-20 13:38:34 -07:00
Kconfig net/ipv4: VTI support new module for ip_vti. 2012-07-18 09:36:12 -07:00
Makefile net-tcp: Fast Open base 2012-07-19 10:55:36 -07:00
netfilter.c net: Delete all remaining instances of ctl_path 2012-04-20 21:22:30 -04:00
ping.c ipv4: Add redirect support to all protocol icmp error handlers. 2012-07-11 21:27:49 -07:00
proc.c net-tcp: Fast Open client - sending SYN-data 2012-07-19 11:02:03 -07:00
protocol.c inet: Sanitize inet{,6} protocol demux. 2012-06-19 18:56:21 -07:00
raw.c ipv4: Add redirect support to all protocol icmp error handlers. 2012-07-11 21:27:49 -07:00
route.c ipv4: Kill rt->fi 2012-07-20 13:40:07 -07:00
syncookies.c net-tcp: Fast Open base 2012-07-19 10:55:36 -07:00
sysctl_net_ipv4.c net-tcp: Fast Open base 2012-07-19 10:55:36 -07:00
tcp_bic.c tcp: fix undo after RTO for BIC 2012-01-20 14:17:26 -05:00
tcp_cong.c tcp: fix ABC in tcp_slow_start() 2012-07-20 10:59:41 -07:00
tcp_cubic.c tcp: fix undo after RTO for CUBIC 2012-01-20 14:17:26 -05:00
tcp_diag.c inet_diag: Rename inet_diag_req into inet_diag_req_v2 2012-01-11 12:56:06 -08:00
tcp_fastopen.c net-tcp: Fast Open base 2012-07-19 10:55:36 -07:00
tcp_highspeed.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_htcp.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_hybla.c tcp: bool conversions 2012-05-17 14:59:59 -04:00
tcp_illinois.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_input.c tcp: Return bool instead of int where appropriate 2012-07-20 10:59:41 -07:00
tcp_ipv4.c ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. 2012-07-20 13:36:54 -07:00
tcp_lp.c Fix common misspellings 2011-03-31 11:26:23 -03:00
tcp_memcontrol.c memcg: decrement static keys at real destroy time 2012-05-29 16:22:28 -07:00
tcp_metrics.c tcp: use hash_32() in tcp_metrics 2012-07-20 10:59:41 -07:00
tcp_minisocks.c net-tcp: Fast Open base 2012-07-19 10:55:36 -07:00
tcp_output.c tcp: improve latencies of timer triggered events 2012-07-20 10:59:41 -07:00
tcp_probe.c net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
tcp_scalable.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_timer.c tcp: improve latencies of timer triggered events 2012-07-20 10:59:41 -07:00
tcp_vegas.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_vegas.h
tcp_veno.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_westwood.c tcp: mark tcp_congestion_ops read_mostly 2011-03-10 00:40:17 -08:00
tcp_yeah.c Fix common misspellings 2011-03-31 11:26:23 -03:00
tcp.c net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) 2012-07-19 11:02:03 -07:00
tunnel4.c net: Convert printks to pr_<level> 2012-03-11 23:42:51 -07:00
udp_diag.c net: make sock diag per-namespace 2012-07-16 22:31:34 -07:00
udp_impl.h ipv4: fix checkpatch errors 2012-04-15 12:37:19 -04:00
udp.c ipv4: Add redirect support to all protocol icmp error handlers. 2012-07-11 21:27:49 -07:00
udplite.c net: ipv4: Standardize prefixes for message logging 2012-03-12 17:05:21 -07:00
xfrm4_input.c ipv4: Kill ip_route_input_noref(). 2012-07-20 13:30:59 -07:00
xfrm4_mode_beet.c ipsec: be careful of non existing mac headers 2012-02-23 16:50:45 -05:00
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel. 2012-07-18 09:36:12 -07:00
xfrm4_output.c xfrm4: Don't call icmp_send on local error 2011-07-01 17:33:19 -07:00
xfrm4_policy.c ipv4: Turn rt->rt_route_iif into rt->rt_is_input. 2012-07-20 13:40:02 -07:00
xfrm4_state.c net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules 2011-10-31 19:30:30 -04:00
xfrm4_tunnel.c net: ipv4: Standardize prefixes for message logging 2012-03-12 17:05:21 -07:00