linux

Author	SHA1	Message	Date
Julian Anastasov	e6b45241c5	ipv4: reset flowi parameters on route connect Eric Dumazet found that commit `813b3b5db8` (ipv4: Use caller's on-stack flowi as-is in output route lookups.) that comes in 3.0 added a regression. The problem appears to be that resulting flowi4_oif is used incorrectly as input parameter to some routing lookups. The result is that when connecting to local port without listener if the IP address that is used is not on a loopback interface we incorrectly assign RTN_UNICAST to the output route because no route is matched by oif=lo. The RST packet can not be sent immediately by tcp_v4_send_reset because it expects RTN_LOCAL. So, change ip_route_connect and ip_route_newports to update the flowi4 fields that are input parameters because we do not want unnecessary binding to oif. To make it clear what are the input parameters that can be modified during lookup and to show which fields of floiw4 are reused add a new function to update the flowi4 structure: flowi4_update_output. Thanks to Yurij M. Plotnikov for providing a bug report including a program to reproduce the problem. Thanks to Eric Dumazet for tracking the problem down to tcp_v4_send_reset and providing initial fix. Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru> Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2012-02-04 19:29:48 -05:00
Steffen Klassert	b8400f3718	route: struct rtable can be const in rt_is_input_route and rt_is_output_route Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-11-26 14:29:51 -05:00
David S. Miller	a48eff1288	ipv4: Pass explicit destination address to rt_bind_peer(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-18 18:42:43 -04:00
David S. Miller	ed2361e66e	ipv4: Pass explicit destination address to rt_get_peer(). This will next trickle down to rt_bind_peer(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-18 18:38:54 -04:00
David S. Miller	8e36360ae8	ipv4: Remove route key identity dependencies in ip_rt_get_source(). Pass in the sk_buff so that we can fetch the necessary keys from the packet header when working with input routes. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-13 17:29:41 -04:00
David S. Miller	cbb1e85f9c	ipv4: Kill rt->rt_{src, dst} usage in IP GRE tunnels. First, make callers pass on-stack flowi4 to ip_route_output_gre() so they can get at the fully resolved flow key. Next, use that in ipgre_tunnel_xmit() to avoid the need to use rt->rt_{dst,src}. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-04 12:55:07 -07:00
David S. Miller	31e4543db2	ipv4: Make caller provide on-stack flow key to ip_route_output_ports(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-03 20:25:42 -07:00
David S. Miller	475949d8e8	ipv4: Renamt struct rtable's rt_tos to rt_key_tos. To more accurately reflect that it is purely a routing cache lookup key and is used in no other context. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-03 19:45:15 -07:00
David S. Miller	6706b6ebab	ipv4: Remove now superfluous code in ip_route_connect(). Now that output route lookups update the flow with source address et al. selections, the fl4->{saddr,daddr} assignments here are no longer necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 23:13:17 -07:00
David S. Miller	813b3b5db8	ipv4: Use caller's on-stack flowi as-is in output route lookups. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-28 22:26:00 -07:00
David S. Miller	b678027cb7	ipv4: Kill RTO_CONN. It's not used by anything in the kernel, and defined in net/route.h so never exported to userspace. Therefore we can safely remove it. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-27 13:59:05 -07:00
David S. Miller	2d7192d6cb	ipv4: Sanitize and simplify ip_route_{connect,newports}() These functions are used together as a unit for route resolution during connect(). They address the chicken-and-egg problem that exists when ports need to be allocated during connect() processing, yet such port allocations require addressing information from the routing code. It's currently more heavy handed than it needs to be, and in particular we allocate and initialize a flow object twice. Let the callers provide the on-stack flow object. That way we only need to initialize it once in the ip_route_connect() call. Later, if ip_route_newports() needs to do anything, it re-uses that flow object as-is except for the ports which it updates before the route re-lookup. Also, describe why this set of facilities are needed and how it works in a big comment. Signed-off-by: David S. Miller <davem@davemloft.net> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>	2011-04-27 13:59:04 -07:00
David S. Miller	2a9e950701	net: Remove __KERNEL__ cpp checks from include/net These header files are never installed to user consumption, so any __KERNEL__ cpp checks are superfluous. Projects should also not copy these files into their userland utility sources and try to use them there. If they insist on doing so, the onus is on them to sanitize the headers as needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-24 10:54:56 -07:00
Eric Dumazet	b71d1d426d	inet: constify ip headers and in6_addr Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers where possible, to make code intention more obvious. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-22 11:04:14 -07:00
David S. Miller	c1e48efc70	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/benet/be_main.c	2011-04-07 14:05:23 -07:00
OGAWA Hirofumi	1b86a58f9d	ipv4: Fix "Set rt->rt_iif more sanely on output routes." Commit `1018b5c016` ("Set rt->rt_iif more sanely on output routes.") breaks rt_is_{output,input}_route. This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0". To fix it, this does: 1) Add "int rt_route_iif;" to struct rtable 2) For input routes, always set rt_route_iif to same value as rt_iif 3) For output routes, always set rt_route_iif to zero. Set rt_iif as it is done currently. 4) Change rt_is_{output,input}_route() to test rt_route_iif Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-07 14:04:08 -07:00
David S. Miller	94b92b8834	ipv4: Use flowi4_init_output() in net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-31 04:52:59 -07:00
Steffen Klassert	6df59a84ec	route: Take the right src and dst addresses in ip_route_newports When we set up the flow informations in ip_route_newports(), we take the address informations from the the rt_key_src and rt_key_dst fields of the rtable. They appear to be empty. So take the address informations from rt_src and rt_dst instead. This issue was introduced by commit `5e2b61f784` ("ipv4: Remove flowi from struct rtable.") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-25 01:28:45 -07:00
Julian Anastasov	e6abbaa272	ipv4: fix route deletion for IPs on many subnets Alex Sidorenko reported for problems with local routes left after IP addresses are deleted. It happens when same IPs are used in more than one subnet for the device. Fix fib_del_ifaddr to restrict the checks for duplicate local and broadcast addresses only to the IFAs that use our primary IFA or another primary IFA with same address. And we expect the prefsrc to be matched when the routes are deleted because it is possible they to differ only by prefsrc. This patch prevents local and broadcast routes to be leaked until their primary IP is deleted finally from the box. As the secondary address promotion needs to delete the routes for all secondaries that used the old primary IFA, add option to ignore these secondaries from the checks and to assume they are already deleted, so that we can safely delete the route while these IFAs are still on the device list. Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-22 01:06:32 -07:00
David S. Miller	9cce96df5b	net: Put fl4_* macros to struct flowi4 and use them again. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:54 -08:00
David S. Miller	9d6ec93801	ipv4: Use flowi4 in public route lookup interfaces. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:48 -08:00
David S. Miller	6281dcc94a	net: Make flowi ports AF dependent. Create two sets of port member accessors, one set prefixed by fl4_* and the other prefixed by fl6_* This will let us to create AF optimal flow instances. It will work because every context in which we access the ports, we have to be fully aware of which AF the flowi is anyways. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:46 -08:00
David S. Miller	1d28f42c1b	net: Put flowi_* prefix on AF independent members of struct flowi I intend to turn struct flowi into a union of AF specific flowi structs. There will be a common structure that each variant includes first, much like struct sock_common. This is the first step to move in that direction. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:44 -08:00
David S. Miller	78fbfd8a65	ipv4: Create and use route lookup helpers. The idea here is this minimizes the number of places one has to edit in order to make changes to how flows are defined and used. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-12 15:08:42 -08:00
David S. Miller	5e2b61f784	ipv4: Remove flowi from struct rtable. The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-04 21:55:31 -08:00
David S. Miller	4157434c23	ipv4: Use passed-in protocol in ip_route_newports(). Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-04 21:31:48 -08:00
David S. Miller	5bfa787fb2	ipv4: ip_route_output_key() is better as an inline. This avoid a stack frame at zero cost. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-02 14:56:30 -08:00
David S. Miller	b23dd4fe42	ipv4: Make output route lookup return rtable directly. Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-02 14:31:35 -08:00
David S. Miller	2774c131b1	xfrm: Handle blackhole route creation via afinfo. That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:59:04 -08:00
David S. Miller	273447b352	ipv4: Kill can_sleep arg to ip_route_output_flow() This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:27:04 -08:00
David S. Miller	5df65e5567	net: Add FLOWI_FLAG_CAN_SLEEP. And set is in contexts where the route resolution can sleep. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:22:19 -08:00
David S. Miller	420d44daa7	ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep" Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:19:23 -08:00
David S. Miller	abdf7e7239	ipv4: Can final ip_route_connect() arg to boolean "can_sleep". Since that's what the current vague "flags" thing means. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-01 14:15:24 -08:00
David S. Miller	dca8b089c9	ipv4: Rearrange how ip_route_newports() gets port keys. ip_route_newports() is the only place in the entire kernel that cares about the port members in the routing cache entry's lookup flow key. Therefore the only reason we store an entire flow inside of the struct rtentry is for this one special case. Rewrite ip_route_newports() such that: 1) The caller passes in the original port values, so we don't need to use the rth->fl.fl_ip_{s,d}port values to remember them. 2) The lookup flow is constructed by hand instead of being copied from the routing cache entry's flow. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-24 13:38:12 -08:00
David S. Miller	6431cbc25f	inet: Create a mechanism for upward inetpeer propagation into routes. If we didn't have a routing cache, we would not be able to properly propagate certain kinds of dynamic path attributes, for example PMTU information and redirects. The reason is that if we didn't have a routing cache, then there would be no way to lookup all of the active cached routes hanging off of sockets, tunnels, IPSEC bundles, etc. Consider the case where we created a cached route, but no inetpeer entry existed and also we were not asked to pre-COW the route metrics and therefore did not force the creation a new inetpeer entry. If we later get a PMTU message, or a redirect, and store this information in a new inetpeer entry, there is no way to teach that cached route about the newly existing inetpeer entry. The facilities implemented here handle this problem. First we create a generation ID. When we create a cached route of any kind, we remember the generation ID at the time of attachment. Any time we force-create an inetpeer entry in response to new path information, we bump that generation ID. The dst_ops->check() callback is where the knowledge of this event is propagated. If the global generation ID does not equal the one stored in the cached route, and the cached route has not attached to an inetpeer yet, we look it up and attach if one is found. Now that we've updated the cached route's information, we update the route's generation ID too. This clears the way for implementing PMTU and redirects directly in the inetpeer cache. There is absolutely no need to consult cached route information in order to maintain this information. At this point nothing bumps the inetpeer genids, that comes in the later changes which handle PMTUs and redirects using inetpeers. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-02-10 13:33:41 -08:00
David S. Miller	a4daad6b09	net: Pre-COW metrics for TCP. TCP is going to record metrics for the connection, so pre-COW the route metrics at route cache entry creation time. This avoids several atomic operations that have to occur if we COW the metrics after the entry reaches global visibility. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-27 22:01:53 -08:00
David S. Miller	62fa8a846d	net: Implement read-only protection and COW'ing of metrics. Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-26 20:51:05 -08:00
David S. Miller	6561a3b12d	ipv4: Flush per-ns routing cache more sanely. Flush the routing cache only of entries that match the network namespace in which the purge event occurred. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>	2010-12-20 10:37:19 -08:00
David S. Miller	323e126f0c	ipv4: Don't pre-seed hoplimit metric. Always go through a new ip4_dst_hoplimit() helper, just like ipv6. This allowed several simplifications: 1) The interim dst_metric_hoplimit() can go as it's no longer userd. 2) The sysctl_ip_default_ttl entry no longer needs to use ipv4_doint_and_flush, since the sysctl is not cached in routing cache metrics any longer. 3) ipv4_doint_and_flush no longer needs to be exported and therefore can be marked static. When ipv4_doint_and_flush_strategy was removed some time ago, the external declaration in ip.h was mistakenly left around so kill that off too. We have to move the sysctl_ip_default_ttl declaration into ipv4's route cache definition header net/route.h, because currently net/ip.h (where the declaration lives now) has a back dependency on net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-12 22:08:17 -08:00
Changli Gao	5811662b15	net: use the macros defined for the members of flowi Use the macros defined for the members of flowi to clean the code up. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-17 12:27:45 -08:00
David S. Miller	c753796769	ipv4: Make rt->fl.iif tests lest obscure. When we test rt->fl.iif against zero, we're seeing if it's an output or an input route. Make that explicit with some helper functions. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-11 17:07:48 -08:00
Eric Dumazet	72cdd1d971	net: get rid of rtable->idev It seems idev field in struct rtable has no special purpose, but adding extra atomic ops. We hold refcounts on the device itself (using percpu data, so pretty cheap in current kernel). infiniband case is solved using dst.dev instead of idev->dev Removal of this field means routing without route cache is now using shared data, percpu data, and only potential contention is a pair of atomic ops on struct neighbour per forwarded packet. About 5% speedup on routing test. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-11 10:29:40 -08:00
Ulrich Weber	fb0c5f0bc8	tproxy: check for transparent flag in ip_route_newports as done in ip_route_connect() Signed-off-by: Ulrich Weber <uweber@astaro.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-27 15:03:33 -07:00
Changli Gao	d8d1f30b95	net-next: remove useless union keyword remove useless union keyword in rtable, rt6_info and dn_route. Since there is only one member in a union, the union keyword isn't useful. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-10 23:31:35 -07:00
Eric Dumazet	407eadd996	net: implements ip_route_input_noref() ip_route_input() is the version returning a refcounted dst, while ip_route_input_noref() returns a non refcounted one. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-17 17:18:51 -07:00
Tejun Heo	7d720c3e4f	percpu: add __percpu sparse annotations to net Add __percpu sparse annotations to net. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. The macro and type tricks around snmp stats make things a bit interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All snmp_mib_() users which used to cast the argument to (void ) are updated to cast it to (void __percpu *). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-16 23:05:38 -08:00
Eric W. Biederman	a5ee155136	net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH The motivation for an additional notifier in batched netdevice notification (rt_do_flush) only needs to be called once per batch not once per namespace. For further batching improvements I need a guarantee that the netdevices are unregistered in order allowing me to unregister an all of the network devices in a network namespace at the same time with the guarantee that the loopback device is really and truly unregistered last. Additionally it appears that we moved the route cache flush after the final synchronize_net, which seems wrong and there was no explanation. So I have restored the original location of the final synchronize_net. Cc: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-01 16:15:50 -08:00
Eric Dumazet	fd2c3ef761	net: cleanup include/net This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-11-04 05:06:25 -08:00
Eric Dumazet	511c3f92ad	net: skb->rtable accessor Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb Delete skb->rtable field Setting rtable is not allowed, just set dst instead as rtable is an alias. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-06-03 02:51:02 -07:00
KOVACS Krisztian	79876874ce	ipv4: Conditionally enable transparent flow flag when connecting Set FLOWI_FLAG_ANYSRC in flowi->flags if the socket has the transparent socket option set. This way we selectively enable certain connections with non-local source addresses to be routed. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-10-01 07:35:39 -07:00

1 2

88 Commits