linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-28 23:21:31 +00:00

Author	SHA1	Message	Date
David Forster	94d9f1c590	ipv4: panic in leaf_walk_rcu due to stale node pointer Panic occurs when issuing "cat /proc/net/route" whilst populating FIB with > 1M routes. Use of cached node pointer in fib_route_get_idx is unsafe. BUG: unable to handle kernel paging request at ffffc90001630024 IP: [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0 PGD 11b08d067 PUD 11b08e067 PMD dac4b067 PTE 0 Oops: 0000 [#1] SMP Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscac snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep virti acpi_cpufreq button parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd tio_ring virtio floppy uhci_hcd ehci_hcd usbcore usb_common libata scsi_mod CPU: 1 PID: 785 Comm: cat Not tainted 4.2.0-rc8+ #4 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 task: ffff8800da1c0bc0 ti: ffff88011a05c000 task.ti: ffff88011a05c000 RIP: 0010:[<ffffffff814cf6a0>] [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0 RSP: 0018:ffff88011a05fda0 EFLAGS: 00010202 RAX: ffff8800d8a40c00 RBX: ffff8800da4af940 RCX: ffff88011a05ff20 RDX: ffffc90001630020 RSI: 0000000001013531 RDI: ffff8800da4af950 RBP: 0000000000000000 R08: ffff8800da1f9a00 R09: 0000000000000000 R10: ffff8800db45b7e4 R11: 0000000000000246 R12: ffff8800da4af950 R13: ffff8800d97a74c0 R14: 0000000000000000 R15: ffff8800d97a7480 FS: 00007fd3970e0700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffc90001630024 CR3: 000000011a7e4000 CR4: 00000000000006e0 Stack: ffffffff814d00d3 0000000000000000 ffff88011a05ff20 ffff8800da1f9a00 ffffffff811dd8b9 0000000000000800 0000000000020000 00007fd396f35000 ffffffff811f8714 0000000000003431 ffffffff8138dce0 0000000000000f80 Call Trace: [<ffffffff814d00d3>] ? fib_route_seq_start+0x93/0xc0 [<ffffffff811dd8b9>] ? seq_read+0x149/0x380 [<ffffffff811f8714>] ? fsnotify+0x3b4/0x500 [<ffffffff8138dce0>] ? process_echoes+0x70/0x70 [<ffffffff8121cfa7>] ? proc_reg_read+0x47/0x70 [<ffffffff811bb823>] ? __vfs_read+0x23/0xd0 [<ffffffff811bbd42>] ? rw_verify_area+0x52/0xf0 [<ffffffff811bbe61>] ? vfs_read+0x81/0x120 [<ffffffff811bcbc2>] ? SyS_read+0x42/0xa0 [<ffffffff81549ab2>] ? entry_SYSCALL_64_fastpath+0x16/0x75 Code: 48 85 c0 75 d8 f3 c3 31 c0 c3 f3 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 a 04 89 f0 33 02 44 89 c9 48 d3 e8 0f b6 4a 05 49 89 RIP [<ffffffff814cf6a0>] leaf_walk_rcu+0x10/0xe0 RSP <ffff88011a05fda0> CR2: ffffc90001630024 Signed-off-by: Dave Forster <dforster@brocade.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-08-06 00:10:05 -04:00
Linus Torvalds	34229b2774	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Pull networking fixes from David Miller: "This looks like a lot but it's a mixture of regression fixes as well as fixes for longer standing issues. 1) Fix on-channel cancellation in mac80211, from Johannes Berg. 2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables module, from Eric Dumazet. 3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric Dumazet. 4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is bound, from Craig Gallek. 5) GRO key comparisons don't take lightweight tunnels into account, from Jesse Gross. 6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric Dumazet. 7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we register them, otherwise the NEWLINK netlink message is missing the proper attributes. From Thadeu Lima de Souza Cascardo. 8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido Schimmel 9) Handle fragments properly in ipv4 easly socket demux, from Eric Dumazet. 10) Don't ignore the ifindex key specifier on ipv6 output route lookups, from Paolo Abeni" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits) tcp: avoid cwnd undo after receiving ECN irda: fix a potential use-after-free in ircomm_param_request net: tg3: avoid uninitialized variable warning net: nb8800: avoid uninitialized variable warning net: vxge: avoid unused function warnings net: bgmac: clarify CONFIG_BCMA dependency net: hp100: remove unnecessary #ifdefs net: davinci_cpdma: use dma_addr_t for DMA address ipv6/udp: use sticky pktinfo egress ifindex on connect() ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() netlink: not trim skb for mmaped socket when dump vxlan: fix a out of bounds access in __vxlan_find_mac net: dsa: mv88e6xxx: fix port VLAN maps fib_trie: Fix shift by 32 in fib_table_lookup net: moxart: use correct accessors for DMA memory ipv4: ipconfig: avoid unused ic_proto_used symbol bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout. bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter. bnxt_en: Ring free response from close path should use completion ring net_sched: drr: check for NULL pointer in drr_dequeue ...	2016-02-01 15:56:08 -08:00
Alexander Duyck	a5829f536b	fib_trie: Fix shift by 32 in fib_table_lookup The fib_table_lookup function had a shift by 32 that triggered a UBSAN warning. This was due to the fact that I had placed the shift first and then followed it with the check for the suffix length to ignore the undefined behavior. If we reorder this so that we verify the suffix is less than 32 before shifting the value we can avoid the issue. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2016-01-29 19:41:00 -08:00
Tetsuo Handa	1d5cfdb076	tree wide: use kvfree() than conditional kfree()/vfree() There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-22 17:02:18 -08:00
Alexander Duyck	c2229fe143	fib_trie: leaf_walk_rcu should not compute key if key is less than pn->key We were computing the child index in cases where the key value we were looking for was actually less than the base key of the tnode. As a result we were getting incorrect index values that would cause us to skip over some children. To fix this I have added a test that will force us to use child index 0 if the key we are looking for is less than the key of the current tnode. Fixes: `8be33e955c` ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf") Reported-by: Brian Rak <brak@gameservers.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-27 18:14:51 -07:00
David Ahern	58189ca7b2	net: Fix vti use case with oif in dst lookups Steffen reported that the recent change to add oif to dst lookups breaks the VTI use case. The problem is that with the oif set in the flow struct the comparison to the nh_oif is triggered. Fix by splitting the FLOWI_FLAG_VRFSRC into 2 flags -- one that triggers the vrf device cache bypass (FLOWI_FLAG_VRFSRC) and another telling the lookup to not compare nh oif (FLOWI_FLAG_SKIP_NH_OIF). Fixes: `42a7b32b73` ("xfrm: Add oif to dst lookups") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-09-17 16:36:34 -07:00
David Ahern	f6d3c19274	net: FIB tracepoints A few useful tracepoints developing VRF driver. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-29 13:05:16 -07:00
David S. Miller	dc25b25897	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-21 11:44:04 -07:00
David Ahern	613d09b30f	net: Use VRF device index for lookups on TX As with ingress use the index of VRF master device for route lookups on egress. However, the oif should only be used to direct the lookups to a specific table. Routes in the table are not based on the VRF device but rather interfaces that are part of the VRF so do not consider the oif for lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this latter part. Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-13 22:43:20 -07:00
Andy Whitcroft	25b97c016b	ipv4: off-by-one in continuation handling in /proc/net/route When generating /proc/net/route we emit a header followed by a line for each route. When a short read is performed we will restart this process based on the open file descriptor. When calculating the start point we fail to take into account that the 0th entry is the header. This leads us to skip the first entry when doing a continuation read. This can be easily seen with the comparison below: while read l; do echo "$l"; done </proc/net/route >A cat /proc/net/route >B diff -bu A B \| grep '^[+-]' On my example machine I have approximatly 10KB of route output. There we see the very first non-title element is lost in the while read case, and an entry around the 8K mark in the cat case: +wlan0 00000000 02021EAC 0003 0 0 400 00000000 0 0 0 -tun1 00C0AC0A 00000000 0001 0 0 950 00C0FFFF 0 0 0 Fix up the off-by-one when reaquiring position on continuation. Fixes: `8be33e955c` ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf") BugLink: http://bugs.launchpad.net/bugs/1483440 Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-08-13 21:16:29 -07:00
Alexander Duyck	1513069edc	fib_trie: Drop unnecessary calls to leaf_pull_suffix It was reported that update_suffix was taking a long time on systems where a large number of leaves were attached to a single node. As it turns out fib_table_flush was calling update_suffix for each leaf that didn't have all of the aliases stripped from it. As a result, on this large node removing one leaf would result in us calling update_suffix for every other leaf on the node. The fix is to just remove the calls to leaf_pull_suffix since they are redundant as we already have a call in resize that will go through and update the suffix length for the node before we exit out of fib_table_flush or fib_table_flush_external. Reported-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-07-27 14:29:11 -07:00
Julian Anastasov	2392debc2b	ipv4: consider TOS in fib_select_default fib_select_default considers alternative routes only when res->fi is for the first alias in res->fa_head. In the common case this can happen only when the initial lookup matches the first alias with highest TOS value. This prevents the alternative routes to require specific TOS. This patch solves the problem as follows: - routes that require specific TOS should be returned by fib_select_default only when TOS matches, as already done in fib_table_lookup. This rule implies that depending on the TOS we can have many different lists of alternative gateways and we have to keep the last used gateway (fa_default) in first alias for the TOS instead of using single tb_default value. - as the aliases are ordered by many keys (TOS desc, fib_priority asc), we restrict the possible results to routes with matching TOS and lowest metric (fib_priority) and routes that match any TOS, again with lowest metric. For example, packet with TOS 8 can not use gw3 (not lowest metric), gw4 (different TOS) and gw6 (not lowest metric), all other gateways can be used: tos 8 via gw1 metric 2 <--- res->fa_head and res->fi tos 8 via gw2 metric 2 tos 8 via gw3 metric 3 tos 4 via gw4 tos 0 via gw5 tos 0 via gw6 metric 1 Reported-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-07-24 22:46:11 -07:00
Andy Gospodarek	0eeb075fad	net: ipv4 sysctl option to ignore routes when nexthop link is down This feature is only enabled with the new per-interface or ipv4 global sysctls called 'ignore_routes_with_linkdown'. net.ipv4.conf.all.ignore_routes_with_linkdown = 0 net.ipv4.conf.default.ignore_routes_with_linkdown = 0 net.ipv4.conf.lo.ignore_routes_with_linkdown = 0 ... When the above sysctls are set, will report to userspace that a route is dead and will no longer resolve to this nexthop when performing a fib lookup. This will signal to userspace that the route will not be selected. The signalling of a RTNH_F_DEAD is only passed to userspace if the sysctl is enabled and link is down. This was done as without it the netlink listeners would have no idea whether or not a nexthop would be selected. The kernel only sets RTNH_F_DEAD internally if the interface has IFF_UP cleared. With the new sysctl set, the following behavior can be observed (interface p8p1 is link-down): default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 cache local 80.0.0.1 dev lo src 80.0.0.1 cache <local> 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 cache While the route does remain in the table (so it can be modified if needed rather than being wiped away as it would be if IFF_UP was cleared), the proper next-hop is chosen automatically when the link is down. Now interface p8p1 is linked-up: default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 cache local 80.0.0.1 dev lo src 80.0.0.1 cache <local> 80.0.0.2 dev p8p1 src 80.0.0.1 cache and the output changes to what one would expect. If the sysctl is not set, the following output would be expected when p8p1 is down: default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 Since the dead flag does not appear, there should be no expectation that the kernel would skip using this route due to link being down. v2: Split kernel changes into 2 patches, this actually makes a behavioral change if the sysctl is set. Also took suggestion from Alex to simplify code by only checking sysctl during fib lookup and suggestion from Scott to add a per-interface sysctl. v3: Code clean-ups to make it more readable and efficient as well as a reverse path check fix. v4: Drop binary sysctl v5: Whitespace fixups from Dave v6: Style changes from Dave and checkpatch suggestions v7: One more checkpatch fixup Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-24 02:15:54 -07:00
Roopa Prabhu	a2bb6d7d6f	ipv4: include NLM_F_APPEND flag in append route notifications This patch adds NLM_F_APPEND flag to struct nlmsg_hdr->nlmsg_flags in newroute notifications if the route add was an append. (This is similar to how NLM_F_REPLACE is already part of new route replace notifications today) This helps userspace determine if the route add operation was an append. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-21 10:23:04 -07:00
Firo Yang	f38b24c905	fib_trie: coding style: Use pointer after check As Alexander Duyck pointed out that: struct tnode { ... struct key_vector kv[1]; } The kv[1] member of struct tnode is an arry that refernced by a null pointer will not crash the system, like this: struct tnode p = NULL; struct key_vector kv = p->kv; As such p->kv doesn't actually dereference anything, it is simply a means for getting the offset to the array from the pointer p. This patch make the code more regular to avoid making people feel odd when they look at the code. Signed-off-by: Firo Yang <firogm@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-06-07 23:45:40 -07:00
David S. Miller	ffa915d071	ipv4: Fix fib_trie.c build, missing linux/vmalloc.h include. We used to get this indirectly I supposed, but no longer do. Either way, an explicit include should have been done in the first place. net/ipv4/fib_trie.c: In function '__node_free_rcu': >> net/ipv4/fib_trie.c:293:3: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] vfree(n); ^ net/ipv4/fib_trie.c: In function 'tnode_alloc': >> net/ipv4/fib_trie.c:312:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] return vzalloc(size); ^ >> net/ipv4/fib_trie.c:312:3: warning: return makes pointer from integer without a cast cc1: some warnings being treated as errors Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-27 00:19:03 -04:00
David S. Miller	36583eb54d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/cadence/macb.c drivers/net/phy/phy.c include/linux/skbuff.h net/ipv4/tcp.c net/switchdev/switchdev.c Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD} renaming overlapping with net-next changes of various sorts. phy.c was a case of two changes, one adding a local variable to a function whilst the second was removing one. tcp.c overlapped a deadlock fix with the addition of new tcp_info statistic values. macb.c involved the addition of two zyncq device entries. skbuff.h involved adding back ipv4_daddr to nf_bridge_info whilst net-next changes put two other existing members of that struct into a union. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-23 01:22:35 -04:00
Michal Kubeček	d4e64c2909	ipv4: fill in table id when replacing a route When replacing an IPv4 route, tb_id member of the new fib_alias structure is not set in the replace code path so that the new route is ignored. Fixes: `0ddcf43d5d` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-22 14:33:17 -04:00
Roopa Prabhu	eea39946a1	rename RTNH_F_EXTERNAL to RTNH_F_OFFLOAD RTNH_F_EXTERNAL today is printed as "offload" in iproute2 output. This patch renames the flag to be consistent with what the user sees. Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:45:39 -04:00
Jiri Pirko	ebb9a03a59	switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/ Turned out that "switchdev" sticks. So just unify all related terms to use this prefix. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-12 18:43:52 -04:00
Ian Morris	00db41243e	ipv4: coding style: comparison for inequality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-03 12:11:15 -04:00
Ian Morris	51456b2914	ipv4: coding style: comparison for equality with NULL The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: Ian Morris <ipm@chirality.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-04-03 12:11:15 -04:00
Alexander Duyck	b6f15f828d	fib_trie: Fix regression in handling of inflate/halve failure When I updated the code to address a possible null pointer dereference in resize I ended up reverting an exception handling fix for the suffix length in the event that inflate or halve failed. This change is meant to correct that by reverting the earlier fix and instead simply getting the parent again after inflate has been completed to avoid the possible null pointer issue. Fixes: `ddb4b9a13` ("fib_trie: Address possible NULL pointer dereference in resize") Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-23 16:58:32 -04:00
Alexander Duyck	0b65bd97ba	fib_trie: Provide a deterministic order for fib_alias w/ tables merged This change makes it so that we should always have a deterministic ordering for the main and local aliases within the merged table when two leaves overlap. So for example if we have a leaf with a key of 192.168.254.0. If we previously added two aliases with a prefix length of 24 from both local and main the first entry would be first and the second would be second. When I was coding this I had added a WARN_ON should such a situation occur as I wasn't sure how likely it would be. However this WARN_ON has been triggered so this is something that should be addressed. With this patch the ordering of the aliases is as follows. First they are sorted on prefix length, then on their table ID, then tos, and finally priority. This way what we end up doing is essentially interleaving the two tables on what used to be leaf_info structure boundaries. Fixes: `0ddcf43d5` ("ipv4: FIB Local/MAIN table collapse") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-12 18:26:51 -04:00
Alexander Duyck	654eff4516	fib_trie: Only display main table in /proc/net/route When we merged the tries for local and main I had overlooked the iterator for /proc/net/route. As a result it was outputting both local and main when the two tries were merged. This patch resolves that by only providing output for aliases that are actually in the main trie. As a result we should go back to the original behavior which I assume will be necessary to maintain legacy support. Fixes: `0ddcf43d5` ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 21:24:32 -04:00
Alexander Duyck	0ddcf43d5d	ipv4: FIB Local/MAIN table collapse This patch is meant to collapse local and main into one by converting tb_data from an array to a pointer. Doing this allows us to point the local table into the main while maintaining the same variables in the table. As such the tb_data was converted from an array to a pointer, and a new array called data is added in order to still provide an object for tb_data to point to. In order to track the origin of the fib aliases a tb_id value was added in a hole that existed on 64b systems. Using this we can also reverse the merge in the event that custom FIB rules are enabled. With this patch I am seeing an improvement of 20ns to 30ns for routing lookups as long as custom rules are not enabled, with custom rules enabled we fall back to split tables and the original behavior. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-11 16:22:14 -04:00
Alexander Duyck	ddb4b9a132	fib_trie: Address possible NULL pointer dereference in resize If the inflate call failed it would return NULL. As a result tp would be set to NULL and cause use to trigger a NULL pointer dereference in should_halve if the inflate failed on the first attempt. In order to prevent this we should decrement max_work before we actually attempt to inflate as this will force us to exit before attempting to halve a node we should have inflated. In order to keep things symmetric between inflate and halve I went ahead and also moved the decrement of max_work for the halve case as well so we take care of that before we actually attempt to halve the tnode. Fixes: `88bae714` ("fib_trie: Add key vector to root, return parent key_vector in resize") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 18:36:56 -04:00
Alexander Duyck	3ec320dd5c	fib_trie: Correctly handle case of key == 0 in leaf_walk_rcu In the case of a trie that had no tnodes with a key of 0 the initial look-up would fail resulting in an out-of-bounds cindex on the first tnode. This resulted in an entire trie being skipped. In order resolve this I have updated the cindex logic in the initial look-up so that if the key is zero we will always traverse the child zero path. Fixes: `8be33e95` ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-10 16:13:55 -04:00
Scott Feldman	f8f2147150	switchdev: add netlink flags to IPv4 FIB add op Pass in the netlink flags (NLM_F_*) into switchdev driver for IPv4 FIB add op to allow driver to 1) optimize hardware updates, 2) handle ip route prepend and append commands correctly. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 23:56:52 -04:00
Alexander Duyck	88bae7149a	fib_trie: Add key vector to root, return parent key_vector in resize This change makes it so that the root of the trie contains a key_vector, by doing this we make room to essentially collapse the entire trie by at least one cache line as we can store the information about the tnode or leaf that is pointed to in the root. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	f23e59fbd7	fib_trie: Move parent from key_vector to tnode This change pulls the parent pointer from the key_vector and places it in the tnode structure. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	6e22d174ba	fib_trie: Pull empty_children and full_children into tnode This pulls the information about the child array out of the key_vector and places it in the tnode since that is where it is needed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	56ca2adf6a	fib_trie: Move rcu from key_vector to tnode, add accessors. RCU is only needed once for the entire node, not once per key_vector so we can pull that out and move it to the tnode structure. In addition add accessors to be used inside the RCU functions so that we can more easily get from the key vector to either the tnode or the trie pointers. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	dc35dbeda3	fib_trie: Add tnode struct as a container for fields not needed in key_vector This change pulls the fields not explicitly needed in the key_vector and placed them in the new tnode structure. By doing this we will eventually be able to reduce the key_vector down to 16 bytes on 64 bit systems, and 12 bytes on 32 bit systems. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	2e1ac88a48	fib_trie: Rename tnode_child_length to child_length We are now checking the length of a key_vector instead of a tnode so it makes sense to probably just rename this to child_length since it would probably even be applicable to a leaf. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	754baf8dec	fib_trie: replace tnode_get_child functions with get_child macros I am replacing the tnode_get_child call with get_child since we are techically pulling the child out of a key_vector now and not a tnode. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Alexander Duyck	35c6edac19	fib_trie: Rename tnode to key_vector Rename the tnode to key_vector. The key_vector will be the eventual container for all of the information needed by either a leaf or a tnode. The final result should be much smaller than the 40 bytes currently needed for either one. This also updates the trie struct so that it contains an array of size 1 of tnode pointers. This is to bring the structure more inline with how an actual tnode itself is configured. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Alexander Duyck	8d8e810ca8	fib_trie: Return pointer to tnode pointer in resize/inflate/halve Resize related functions now all return a pointer to the pointer that references the object that was resized. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Alexander Duyck	72be72607a	fib_trie: Minor cleanups to fib_table_flush_external This change just does a couple of minor cleanups on fib_table_flush_external. Specifically it addresses the fact that resize was being called even though nothing was being removed from the table, and it drops an unecessary indent since we could just call continue on the inverse of the fi && flag check. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
David S. Miller	23375a0fd5	ipv4: Fix unused variable warnings in fib_table_flush_external. net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’: net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable] int found = 0; ^ net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable] unsigned char slen; ^ Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 00:38:35 -05:00
Scott Feldman	8e05fd7166	fib: hook IPv4 fib for hardware offload Call into the switchdev driver any time an IPv4 fib entry is added/modified/deleted from the kernel's FIB. The switchdev driver may or may not install the route to the offload device. In the case where the driver tries to install the route and something goes wrong (device's routing table is full, etc), then all of the offloaded routes will be flushed from the device, route forwarding falls back to the kernel, and no more routes are offloading. We can refine this logic later. For now, use the simplist model of offloading routes up to the point of failure, and then on failure, undo everything and mark IPv4 offloading disabled. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 00:24:58 -05:00
Scott Feldman	104616e74e	switchdev: don't support custom ip rules, for now Keep switchdev FIB offload model simple for now and don't allow custom ip rules. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 00:24:58 -05:00
Alexander Duyck	1de3d87bcd	fib_trie: Prevent allocating tnode if bits is too big for size_t This patch adds code to prevent us from attempting to allocate a tnode with a size larger than what can be represented by size_t. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	71e8b67d0f	fib_trie: Update last spot w/ idx >> n->bits code and explanation This change updates the fib_table_lookup function so that it is in sync with the fib_find_node function in terms of the explanation for the index check based on the bits value. I have also updated it from doing a mask to just doing a compare as I have found that seems to provide more options to the compiler as I have seen it turn this into a shift of the value and test under some circumstances. In addition I addressed one minor issue in which we kept computing the key ^ n->key when checking the fib aliases. I pulled the xor out of the loop in order to reduce the number of memory reads in the lookup. As a result we should save a couple cycles since the xor is only done once much earlier in the lookup. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	a7e5353123	fib_trie: Make fib_table rcu safe The fib_table was wrapped in several places with an rcu_read_lock/rcu_read_unlock however after looking over the code I found several spots where the tables were being accessed as just standard pointers without any protections. This change fixes that so that all of the proper protections are in place when accessing the table to take RCU replacement or removal of the table into account. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	41b489fd6c	fib_trie: move leaf and tnode to occupy the same spot in the key vector If we are going to compact the leaf and tnode we first need to make sure the fields are all in the same place. In that regard I am moving the leaf pointer which represents the fib_alias hash list to occupy what is currently the first key_vector pointer. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	d5d6487cb8	fib_trie: Update insert and delete to make use of tp from find_node This change makes it so that the insert and delete functions make use of the tnode pointer returned in the fib_find_node call. By doing this we will not have to rely on the parent pointer in the leaf which will be going away soon. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	d4a975e83f	fib_trie: Fib find node should return parent This change makes it so that the parent pointer is returned by reference in fib_find_node. By doing this I can use it to find the parent node when I am performing an insertion and I don't have to look for it again in fib_insert_node. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:17 -05:00
Alexander Duyck	8be33e955c	fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf This change makes it so that leaf_walk_rcu takes a tnode and a key instead of the trie and a leaf. The main idea behind this is to avoid using the leaf parent pointer as that can have additional overhead in the future as I am trying to reduce the size of a leaf down to 16 bytes on 64b systems and 12b on 32b systems. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:17 -05:00
Alexander Duyck	7289e6ddb6	fib_trie: Only resize tnodes once instead of on each leaf removal in fib_table_flush This change makes it so that we only call resize on the tnodes, instead of from each of the leaves. By doing this we can significantly reduce the amount of time spent resizing as we can update all of the leaves in the tnode first before we make any determinations about resizing. As a result we can simply free the tnode in the case that all of the leaves from a given tnode are flushed instead of resizing with each leaf removed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:17 -05:00

1 2 3 4 5 ...

262 Commits