linux

Author	SHA1	Message	Date
Eric W. Biederman	ddb3b6033c	net: Remove protocol from struct dst_ops After my change to neigh_hh_init to obtain the protocol from the neigh_table there are no more users of protocol in struct dst_ops. Remove the protocol field from dst_ops and all of it's initializers. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 16:06:10 -04:00
David S. Miller	5428aef811	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Basically, improvements for the packet rejection infrastructure, deprecation of CLUSTERIP, cleanups for nf_tables and some untangling for br_netfilter. More specifically they are: 1) Send packet to reset flow if checksum is valid, from Florian Westphal. 2) Fix nf_tables reject bridge from the input chain, also from Florian. 3) Deprecate the CLUSTERIP target, the cluster match supersedes it in functionality and it's known to have problems. 4) A couple of cleanups for nf_tables rule tracing infrastructure, from Patrick McHardy. 5) Another cleanup to place transaction declarations at the bottom of nf_tables.h, also from Patrick. 6) Consolidate Kconfig dependencies wrt. NF_TABLES. 7) Limit table names to 32 bytes in nf_tables. 8) mac header copying in bridge netfilter is already required when calling ip_fragment(), from Florian Westphal. 9) move nf_bridge_update_protocol() to br_netfilter.c, also from Florian. 10) Small refactor in br_netfilter in the transmission path, again from Florian. 11) Move br_nf_pre_routing_finish_bridge_slow() to br_netfilter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-09 15:58:21 -04:00
Alexander Duyck	88bae7149a	fib_trie: Add key vector to root, return parent key_vector in resize This change makes it so that the root of the trie contains a key_vector, by doing this we make room to essentially collapse the entire trie by at least one cache line as we can store the information about the tnode or leaf that is pointed to in the root. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	f23e59fbd7	fib_trie: Move parent from key_vector to tnode This change pulls the parent pointer from the key_vector and places it in the tnode structure. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	6e22d174ba	fib_trie: Pull empty_children and full_children into tnode This pulls the information about the child array out of the key_vector and places it in the tnode since that is where it is needed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	56ca2adf6a	fib_trie: Move rcu from key_vector to tnode, add accessors. RCU is only needed once for the entire node, not once per key_vector so we can pull that out and move it to the tnode structure. In addition add accessors to be used inside the RCU functions so that we can more easily get from the key vector to either the tnode or the trie pointers. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	dc35dbeda3	fib_trie: Add tnode struct as a container for fields not needed in key_vector This change pulls the fields not explicitly needed in the key_vector and placed them in the new tnode structure. By doing this we will eventually be able to reduce the key_vector down to 16 bytes on 64 bit systems, and 12 bytes on 32 bit systems. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	2e1ac88a48	fib_trie: Rename tnode_child_length to child_length We are now checking the length of a key_vector instead of a tnode so it makes sense to probably just rename this to child_length since it would probably even be applicable to a leaf. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:28 -05:00
Alexander Duyck	754baf8dec	fib_trie: replace tnode_get_child functions with get_child macros I am replacing the tnode_get_child call with get_child since we are techically pulling the child out of a key_vector now and not a tnode. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Alexander Duyck	35c6edac19	fib_trie: Rename tnode to key_vector Rename the tnode to key_vector. The key_vector will be the eventual container for all of the information needed by either a leaf or a tnode. The final result should be much smaller than the 40 bytes currently needed for either one. This also updates the trie struct so that it contains an array of size 1 of tnode pointers. This is to bring the structure more inline with how an actual tnode itself is configured. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Alexander Duyck	8d8e810ca8	fib_trie: Return pointer to tnode pointer in resize/inflate/halve Resize related functions now all return a pointer to the pointer that references the object that was resized. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Alexander Duyck	72be72607a	fib_trie: Minor cleanups to fib_table_flush_external This change just does a couple of minor cleanups on fib_table_flush_external. Specifically it addresses the fact that resize was being called even though nothing was being removed from the table, and it drops an unecessary indent since we could just call continue on the inverse of the fi && flag check. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 15:49:27 -05:00
Fan Du	05cbc0db03	ipv4: Create probe timer for tcp PMTU as per RFC4821 As per RFC4821 7.3. Selecting Probe Size, a probe timer should be armed once probing has converged. Once this timer expired, probing again to take advantage of any path PMTU change. The recommended probing interval is 10 minutes per RFC1981. Probing interval could be sysctled by sysctl_tcp_probe_interval. Eric Dumazet suggested to implement pseudo timer based on 32bits jiffies tcp_time_stamp instead of using classic timer for such rare event. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 14:57:42 -05:00
Fan Du	6b58e0a5f3	ipv4: Use binary search to choose tcp PMTU probe_size Current probe_size is chosen by doubling mss_cache, the probing process will end shortly with a sub-optimal mss size, and the link mtu will not be taken full advantage of, in return, this will make user to tweak tcp_base_mss with care. Use binary search to choose probe_size in a fine granularity manner, an optimal mss will be found to boost performance as its maxmium. In addition, introduce a sysctl_tcp_probe_threshold to control when probing will stop in respect to the width of search range. Test env: Docker instance with vxlan encapuslation(82599EB) iperf -c 10.0.0.24 -t 60 before this patch: 1.26 Gbits/sec After this patch: increase 26% 1.59 Gbits/sec Signed-off-by: Fan Du <fan.du@intel.com> Acked-by: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 14:57:41 -05:00
David S. Miller	23375a0fd5	ipv4: Fix unused variable warnings in fib_table_flush_external. net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’: net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable] int found = 0; ^ net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable] unsigned char slen; ^ Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 00:38:35 -05:00
Scott Feldman	8e05fd7166	fib: hook IPv4 fib for hardware offload Call into the switchdev driver any time an IPv4 fib entry is added/modified/deleted from the kernel's FIB. The switchdev driver may or may not install the route to the offload device. In the case where the driver tries to install the route and something goes wrong (device's routing table is full, etc), then all of the offloaded routes will be flushed from the device, route forwarding falls back to the kernel, and no more routes are offloading. We can refine this logic later. For now, use the simplist model of offloading routes up to the point of failure, and then on failure, undo everything and mark IPv4 offloading disabled. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 00:24:58 -05:00
Scott Feldman	104616e74e	switchdev: don't support custom ip rules, for now Keep switchdev FIB offload model simple for now and don't allow custom ip rules. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-06 00:24:58 -05:00
Eric Dumazet	496127290f	inet_diag: remove duplicate code from inet_twsk_diag_dump() timewait sockets now share a common base with established sockets. inet_twsk_diag_dump() can use inet_diag_bc_sk() instead of duplicating code, granted that inet_diag_bc_sk() does proper userlocks initialization. twsk_build_assert() will catch any future changes that could break the assumptions. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-05 22:55:44 -05:00
Pablo Neira Ayuso	f04e599e20	netfilter: nf_tables: consolidate Kconfig options Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-06 01:21:15 +01:00
Pablo Neira Ayuso	43270b1bc5	netfilter: ipt_CLUSTERIP: deprecate it in favour of xt_cluster xt_cluster supersedes ipt_CLUSTERIP since it can be also used in gateway configurations (not only from the backend side). ipt_CLUSTER is also known to leak the netdev that it uses on device removal, which requires a rather large fix to workaround the problem: http://patchwork.ozlabs.org/patch/358629/ So let's deprecate this so we can probably kill code this in the future. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-06 01:21:05 +01:00
Alexander Duyck	1de3d87bcd	fib_trie: Prevent allocating tnode if bits is too big for size_t This patch adds code to prevent us from attempting to allocate a tnode with a size larger than what can be represented by size_t. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	71e8b67d0f	fib_trie: Update last spot w/ idx >> n->bits code and explanation This change updates the fib_table_lookup function so that it is in sync with the fib_find_node function in terms of the explanation for the index check based on the bits value. I have also updated it from doing a mask to just doing a compare as I have found that seems to provide more options to the compiler as I have seen it turn this into a shift of the value and test under some circumstances. In addition I addressed one minor issue in which we kept computing the key ^ n->key when checking the fib aliases. I pulled the xor out of the loop in order to reduce the number of memory reads in the lookup. As a result we should save a couple cycles since the xor is only done once much earlier in the lookup. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	a7e5353123	fib_trie: Make fib_table rcu safe The fib_table was wrapped in several places with an rcu_read_lock/rcu_read_unlock however after looking over the code I found several spots where the tables were being accessed as just standard pointers without any protections. This change fixes that so that all of the proper protections are in place when accessing the table to take RCU replacement or removal of the table into account. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	41b489fd6c	fib_trie: move leaf and tnode to occupy the same spot in the key vector If we are going to compact the leaf and tnode we first need to make sure the fields are all in the same place. In that regard I am moving the leaf pointer which represents the fib_alias hash list to occupy what is currently the first key_vector pointer. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	d5d6487cb8	fib_trie: Update insert and delete to make use of tp from find_node This change makes it so that the insert and delete functions make use of the tnode pointer returned in the fib_find_node call. By doing this we will not have to rely on the parent pointer in the leaf which will be going away soon. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:18 -05:00
Alexander Duyck	d4a975e83f	fib_trie: Fib find node should return parent This change makes it so that the parent pointer is returned by reference in fib_find_node. By doing this I can use it to find the parent node when I am performing an insertion and I don't have to look for it again in fib_insert_node. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:17 -05:00
Alexander Duyck	8be33e955c	fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf This change makes it so that leaf_walk_rcu takes a tnode and a key instead of the trie and a leaf. The main idea behind this is to avoid using the leaf parent pointer as that can have additional overhead in the future as I am trying to reduce the size of a leaf down to 16 bytes on 64b systems and 12b on 32b systems. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:17 -05:00
Alexander Duyck	7289e6ddb6	fib_trie: Only resize tnodes once instead of on each leaf removal in fib_table_flush This change makes it so that we only call resize on the tnodes, instead of from each of the leaves. By doing this we can significantly reduce the amount of time spent resizing as we can update all of the leaves in the tnode first before we make any determinations about resizing. As a result we can simply free the tnode in the case that all of the leaves from a given tnode are flushed instead of resizing with each leaf removed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 23:35:17 -05:00
Eric W. Biederman	60395a20ff	neigh: Factor out ___neigh_lookup_noref While looking at the mpls code I found myself writing yet another version of neigh_lookup_noref. We currently have __ipv4_lookup_noref and __ipv6_lookup_noref. So to make my work a little easier and to make it a smidge easier to verify/maintain the mpls code in the future I stopped and wrote ___neigh_lookup_noref. Then I rewote __ipv4_lookup_noref and __ipv6_lookup_noref in terms of this new function. I tested my new version by verifying that the same code is generated in ip_finish_output2 and ip6_finish_output2 where these functions are inlined. To get to ___neigh_lookup_noref I added a new neighbour cache table function key_eq. So that the static size of the key would be available. I also added __neigh_lookup_noref for people who want to to lookup a neighbour table entry quickly but don't know which neibhgour table they are going to look up. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-04 00:23:23 -05:00
David S. Miller	71a83a6db6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-03 21:16:48 -05:00
Michal Kubeček	acf8dd0a9d	udp: only allow UFO for packets from SOCK_DGRAM sockets If an over-MTU UDP datagram is sent through a SOCK_RAW socket to a UFO-capable device, ip_ufo_append_data() sets skb->ip_summed to CHECKSUM_PARTIAL unconditionally as all GSO code assumes transport layer checksum is to be computed on segmentation. However, in this case, skb->csum_start and skb->csum_offset are never set as raw socket transmit path bypasses udp_send_skb() where they are usually set. As a result, driver may access invalid memory when trying to calculate the checksum and store the result (as observed in virtio_net driver). Moreover, the very idea of modifying the userspace provided UDP header is IMHO against raw socket semantics (I wasn't able to find a document clearly stating this or the opposite, though). And while allowing CHECKSUM_NONE in the UFO case would be more efficient, it would be a bit too intrusive change just to handle a corner case like this. Therefore disallowing UFO for packets from SOCK_DGRAM seems to be the best option. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 22:19:29 -05:00
Florian Westphal	ee586bbc28	netfilter: reject: don't send icmp error if csum is invalid tcp resets are never emitted if the packet that triggers the reject/reset has an invalid checksum. For icmp error responses there was no such check. It allows to distinguish icmp response generated via iptables -I INPUT -p udp --dport 42 -j REJECT and those emitted by network stack (won't respond if csum is invalid, REJECT does). Arguably its possible to avoid this by using conntrack and only using REJECT with -m conntrack NEW/RELATED. However, this doesn't work when connection tracking is not in use or when using nf_conntrack_checksum=0. Furthermore, sending errors in response to invalid csums doesn't make much sense so just add similar test as in nf_send_reset. Validate csum if needed and only send the response if it is ok. Reference: http://bugzilla.redhat.com/show_bug.cgi?id=1169829 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2015-03-03 02:10:35 +01:00
Eric W. Biederman	bdf53c5849	neigh: Don't require dst in neigh_hh_init - Add protocol to neigh_tbl so that dst->ops->protocol is not needed - Acquire the device from neigh->dev This results in a neigh_hh_init that will cache the samve values regardless of the packets flowing through it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 16:43:41 -05:00
Eric W. Biederman	59b2af26b9	arp: Kill arp_find There are no more callers so kill this function. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 16:43:41 -05:00
Eric W. Biederman	21bfb8e933	arp: Remove special case to give AX25 it's open arp operations. The special case has been pushed out into ax25_neigh_construct so there is no need to keep this code in arp.c Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 16:43:40 -05:00
Ying Xue	1b78414047	net: Remove iocb argument from sendmsg and recvmsg After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 13:06:31 -05:00
Eyal Birger	b4772ef879	net: use common macro for assering skb->cb[] available size in protocol families As part of an effort to move skb->dropcount to skb->cb[] use a common macro in protocol families using skb->cb[] for ancillary data to validate available room in skb->cb[]. Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-03-02 00:19:30 -05:00
Eric Dumazet	74abc20ced	tcp: cleanup static functions tcp_fastopen_create_child() is static and should not be exported. tcp4_gso_segment() and tcp6_gso_segment() should be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-28 16:56:51 -05:00
Eric Dumazet	a0ea700e40	tcp: tso: allow CA_CWR state in tcp_tso_should_defer() Another TCP issue is triggered by ECN. Under pressure, receiver gets ECN marks, and send back ACK packets with ECE TCP flag. Senders enter CA_CWR state. In this state, tcp_tso_should_defer() is short cut : if (icsk->icsk_ca_state != TCP_CA_Open) goto send_now; This means that about all ACK packets we receive are triggering a partial send, and because cwnd is kept small, we can only send a small amount of data for each incoming ACK, which in return generate more ACK packets. Allowing CA_Open and CA_CWR states to enable TSO defer in tcp_tso_should_defer() brings performance back : TSO autodefer has more chance to defer under pressure. This patch increases TSO and LRO/GRO efficiency back to normal levels, and does not impact overall ECN behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-28 15:10:39 -05:00
Eric Dumazet	50c8339e92	tcp: tso: restore IW10 after TSO autosizing With sysctl_tcp_min_tso_segs being 4, it is very possible that tcp_tso_should_defer() decides not sending last 2 MSS of initial window of 10 packets. This also applies if autosizing decides to send X MSS per GSO packet, and cwnd is not a multiple of X. This patch implements an heuristic based on age of first skb in write queue : If it was sent very recently (less than half srtt), we can predict that no ACK packet will come in less than half rtt, so deferring might cause an under utilization of our window. This is visible on initial send (IW10) on web servers, but more generally on some RPC, as the last part of the message might need an extra RTT to get delivered. Tested: Ran following packetdrill test // A simple server-side test that sends exactly an initial window (IW10) // worth of packets. `sysctl -e -q net.ipv4.tcp_min_tso_segs=4` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +.1 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.1 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 14600) = 14600 +0 > . 1:5841(5840) ack 1 win 457 +0 > . 5841:11681(5840) ack 1 win 457 // Following packet should be sent right now. +0 > P. 11681:14601(2920) ack 1 win 457 +.1 < . 1:1(0) ack 14601 win 257 +0 close(4) = 0 +0 > F. 14601:14601(0) ack 1 +.1 < F. 1:1(0) ack 14602 win 257 +0 > . 14602:14602(0) ack 2 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-28 15:10:39 -05:00
Eric Dumazet	5f852eb536	tcp: tso: remove tp->tso_deferred TSO relies on ability to defer sending a small amount of packets. Heuristic is to wait for future ACKS in hope to send more packets at once. Current algorithm uses a per socket tso_deferred field as a pseudo timer. This pseudo timer relies on future ACK, but there is no guarantee we receive them in time. Fix would be to use a real timer, but cost of such timer is probably too expensive for typical cases. This patch changes the logic to test the time of last transmit, because we should not add bursts of more than 1ms for any given flow. We've used this patch for about two years at Google, before FQ/pacing as it would reduce a fair amount of bursts. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-28 15:10:39 -05:00
Alexander Duyck	79e5ad2ceb	fib_trie: Remove leaf_info At this point the leaf_info hash is redundant. By adding the suffix length to the fib_alias hash list we no longer have need of leaf_info as we can determine the prefix length from fa_slen. So we can compress things by dropping the leaf_info structure from fib_trie and instead directly connect the leaves to the fib_alias hash list. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-27 16:37:07 -05:00
Alexander Duyck	9b6ebad5c3	fib_trie: Add slen to fib alias Make use of an empty spot in the alias to store the suffix length so that we don't need to pull that information from the leaf_info structure. This patch also makes a slight change to the user statistics. Instead of incrementing semantic_match_miss once per leaf_info miss we now just increment it once per leaf if a match was not found. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-27 16:37:07 -05:00
Alexander Duyck	5786ec6054	fib_trie: Replace plen with slen in leaf_info This replaces the prefix length variable in the leaf_info structure with a suffix length value, or host identifier length in bits. By doing this it makes it easier to sort out since the tnodes and leaf are carrying this value as well since it is compatible with the ->pos field in tnodes. I also cleaned up one spot that had some list manipulation that could be simplified. I basically updated it so that we just use hlist_add_head_rcu instead of calling hlist_add_before_rcu on the first node in the list. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-27 16:37:06 -05:00
Alexander Duyck	56315f9e6e	fib_trie: Convert fib_alias to hlist from list There isn't any advantage to having it as a list and by making it an hlist we make the fib_alias more compatible with the list_info in terms of the type of list used. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-27 16:37:06 -05:00
Madhu Challa	93a714d6b5	multicast: Extend ip address command to enable multicast group join/leave on Joining multicast group on ethernet level via "ip maddr" command would not work if we have an Ethernet switch that does igmp snooping since the switch would not replicate multicast packets on ports that did not have IGMP reports for the multicast addresses. Linux vxlan interfaces created via "ip link add vxlan" have the group option that enables then to do the required join. By extending ip address command with option "autojoin" we can get similar functionality for openvswitch vxlan interfaces as well as other tunneling mechanisms that need to receive multicast traffic. The kernel code is structured similar to how the vxlan driver does a group join / leave. example: ip address add 224.1.1.10/24 dev eth5 autojoin ip address del 224.1.1.10/24 dev eth5 Signed-off-by: Madhu Challa <challa@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-27 16:25:25 -05:00
Tom Herbert	723b8e460d	udp: In udp_flow_src_port use random hash value if skb_get_hash fails In the unlikely event that skb_get_hash is unable to deduce a hash in udp_flow_src_port we use a consistent random value instead. This is specified in GRE/UDP draft section 3.2.1: https://tools.ietf.org/html/draft-ietf-tsvwg-gre-in-udp-encap-04 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-27 16:00:01 -05:00
Neal Cardwell	6514890f7a	tcp: fix tcp_should_expand_sndbuf() to use tcp_packets_in_flight() tcp_should_expand_sndbuf() does not expand the send buffer if we have filled the congestion window. However, it should use tcp_packets_in_flight() instead of tp->packets_out to make this check. Testing has established that the difference matters a lot if there are many SACKed packets, causing a needless performance shortfall. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-22 23:07:11 -05:00
Eric Dumazet	959d10f6bb	igmp: add __ip_mc_{join\|leave}_group() There is a need to perform igmp join/leave operations while RTNL is held. Make ip_mc_{join\|leave}_group() wrappers around __ip_mc_{join\|leave}_group() to avoid the proliferation of work queues. For example, vxlan_igmp_join() could possibly be removed. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-20 15:24:04 -05:00
Alexander Drozdov	fba04a9e0c	ipv4: ip_check_defrag should correctly check return value of skb_copy_bits skb_copy_bits() returns zero on success and negative value on error, so it is needed to invert the condition in ip_check_defrag(). Fixes: `1bf3751ec9` ("ipv4: ip_check_defrag must not modify skb before unsharing") Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-20 15:22:38 -05:00

1 2 3 4 5 ...

6498 Commits