linux

Author	SHA1	Message	Date
Tom Herbert	fe881ef11c	gue: Use checksum partial with remote checksum offload Change remote checksum handling to set checksum partial as default behavior. Added an iflink parameter to configure not using checksum partial (calling csum_partial to update checksum). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-11 15:12:13 -08:00
Tom Herbert	15e2396d4e	net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload This patch adds infrastructure so that remote checksum offload can set CHECKSUM_PARTIAL instead of calling csum_partial and writing the modfied checksum field. Add skb_remcsum_adjust_partial function to set an skb for using CHECKSUM_PARTIAL with remote checksum offload. Changed skb_remcsum_process and skb_gro_remcsum_process to take a boolean argument to indicate if checksum partial can be set or the checksum needs to be modified using the normal algorithm. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-11 15:12:12 -08:00
Tom Herbert	6db93ea13b	udp: Set SKB_GSO_UDP_TUNNEL* in UDP GRO path Properly set GSO types and skb->encapsulation in the UDP tunnel GRO complete so that packets are properly represented for GSO. This sets SKB_GSO_UDP_TUNNEL or SKB_GSO_UDP_TUNNEL_CSUM depending on whether non-zero checksums were received, and sets SKB_GSO_TUNNEL_REMCSUM if the remote checksum option was processed. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-11 15:12:10 -08:00
Tom Herbert	26c4f7da3e	net: Fix remcsum in GRO path to not change packet Remote checksum offload processing is currently the same for both the GRO and non-GRO path. When the remote checksum offload option is encountered, the checksum field referred to is modified in the packet. So in the GRO case, the packet is modified in the GRO path and then the operation is skipped when the packet goes through the normal path based on skb->remcsum_offload. There is a problem in that the packet may be modified in the GRO path, but then forwarded off host still containing the remote checksum option. A remote host will again perform RCO but now the checksum verification will fail since GRO RCO already modified the checksum. To fix this, we ensure that GRO restores a packet to it's original state before returning. In this model, when GRO processes a remote checksum option it still changes the checksum per the algorithm but on return from lower layer processing the checksum is restored to its original value. In this patch we add define gro_remcsum structure which is passed to skb_gro_remcsum_process to save offset and delta for the checksum being changed. After lower layer processing, skb_gro_remcsum_cleanup is called to restore the checksum before returning from GRO. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-11 15:12:09 -08:00
Paul Moore	04f81f0154	cipso: don't use IPCB() to locate the CIPSO IP option Using the IPCB() macro to get the IPv4 options is convenient, but unfortunately NetLabel often needs to examine the CIPSO option outside of the scope of the IP layer in the stack. While historically IPCB() worked above the IP layer, due to the inclusion of the inet_skb_param struct at the head of the {tcp,udp}_skb_cb structs, recent commit `971f10ec` ("tcp: better TCP_SKB_CB layout to reduce cache line misses") reordered the tcp_skb_cb struct and invalidated this IPCB() trick. This patch fixes the problem by creating a new function, cipso_v4_optptr(), which locates the CIPSO option inside the IP header without calling IPCB(). Unfortunately, this isn't as fast as a simple lookup so some additional tweaks were made to limit the use of this new function. Cc: <stable@vger.kernel.org> # 3.18 Reported-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com>	2015-02-11 14:46:37 -05:00
Fan Du	b0f9ca53cb	ipv4: Namespecify TCP PMTU mechanism Packetization Layer Path MTU Discovery works separately beside Path MTU Discovery at IP level, different net namespace has various requirements on which one to chose, e.g., a virutalized container instance would require TCP PMTU to probe an usable effective mtu for underlying tunnel, while the host would employ classical ICMP based PMTU to function. Hence making TCP PMTU mechanism per net namespace to decouple two functionality. Furthermore the probe base MSS should also be configured separately for each namespace. Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-09 18:45:00 -08:00
David S. Miller	2573beec56	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-09 14:35:57 -08:00
Yuchung Cheng	531c94a968	tcp: don't include Fast Open option in SYN-ACK on pure SYN-data If a server has enabled Fast Open and it receives a pure SYN-data packet (without a Fast Open option), it won't accept the data but it incorrectly returns a SYN-ACK with a Fast Open cookie and also increments the SNMP stat LINUX_MIB_TCPFASTOPENPASSIVEFAIL. This patch makes the server include a Fast Open cookie in SYN-ACK only if the SYN has some Fast Open option (i.e., when client requests or presents a cookie). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-09 14:26:55 -08:00
Eric Dumazet	567e4b7973	net: rfs: add hash collision detection Receive Flow Steering is a nice solution but suffers from hash collisions when a mix of connected and unconnected traffic is received on the host, when flow hash table is populated. Also, clearing flow in inet_release() makes RFS not very good for short lived flows, as many packets can follow close(). (FIN , ACK packets, ...) This patch extends the information stored into global hash table to not only include cpu number, but upper part of the hash value. I use a 32bit value, and dynamically split it in two parts. For host with less than 64 possible cpus, this gives 6 bits for the cpu number, and 26 (32-6) bits for the upper part of the hash. Since hash bucket selection use low order bits of the hash, we have a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big enough. If the hash found in flow table does not match, we fallback to RPS (if it is enabled for the rxqueue). This means that a packet for an non connected flow can avoid the IPI through a unrelated/victim CPU. This also means we no longer have to clear the table at socket close time, and this helps short lived flows performance. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-08 16:53:57 -08:00
Sabrina Dubroca	3e97fa7059	gre/ipip: use be16 variants of netlink functions encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead of nla_{get,put}_u16. Fixes the sparse warnings: warning: incorrect type in assignment (different base types) expected restricted __be32 [addressable] [usertype] o_key got restricted __be16 [addressable] [usertype] i_flags warning: incorrect type in assignment (different base types) expected restricted __be16 [usertype] sport got unsigned short warning: incorrect type in assignment (different base types) expected restricted __be16 [usertype] dport got unsigned short warning: incorrect type in argument 3 (different base types) expected unsigned short [unsigned] [usertype] value got restricted __be16 [usertype] sport warning: incorrect type in argument 3 (different base types) expected unsigned short [unsigned] [usertype] value got restricted __be16 [usertype] dport Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-08 16:28:06 -08:00
Neal Cardwell	4fb17a6091	tcp: mitigate ACK loops for connections as tcp_timewait_sock Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is represented by a tcp_timewait_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-08 01:03:13 -08:00
Neal Cardwell	f2b2c582e8	tcp: mitigate ACK loops for connections as tcp_sock Ensure that in state ESTABLISHED, where the connection is represented by a tcp_sock, we rate limit dupacks in response to incoming packets (a) with TCP timestamps that fail PAWS checks, or (b) with sequence numbers or ACK numbers that are out of the acceptable window. We do not send a dupack in response to out-of-window packets if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a dupack in response to an out-of-window packet. There is already a similar (although global) rate-limiting mechanism for "challenge ACKs". When deciding whether to send a challence ACK, we first consult the new per-connection rate limit, and then the global rate limit. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-08 01:03:12 -08:00
Neal Cardwell	a9b2c06dbe	tcp: mitigate ACK loops for connections as tcp_request_sock In the SYN_RECV state, where the TCP connection is represented by tcp_request_sock, we now rate-limit SYNACKs in response to a client's retransmitted SYNs: we do not send a SYNACK in response to client SYN if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we last sent a SYNACK in response to a client's retransmitted SYN. This allows the vast majority of legitimate client connections to proceed unimpeded, even for the most aggressive platforms, iOS and MacOS, which actually retransmit SYNs 1-second intervals for several times in a row. They use SYN RTO timeouts following the progression: 1,1,1,1,1,2,4,8,16,32. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-08 01:03:12 -08:00
Neal Cardwell	032ee42369	tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks Helpers for mitigating ACK loops by rate-limiting dupacks sent in response to incoming out-of-window packets. This patch includes: - rate-limiting logic - sysctl to control how often we allow dupacks to out-of-window packets - SNMP counter for cases where we rate-limited our dupack sending The rate-limiting logic in this patch decides to not send dupacks in response to out-of-window segments if (a) they are SYNs or pure ACKs and (b) the remote endpoint is sending them faster than the configured rate limit. We rate-limit our responses rather than blocking them entirely or resetting the connection, because legitimate connections can rely on dupacks in response to some out-of-window segments. For example, zero window probes are typically sent with a sequence number that is below the current window, and ZWPs thus expect to thus elicit a dupack in response. We allow dupacks in response to TCP segments with data, because these may be spurious retransmissions for which the remote endpoint wants to receive DSACKs. This is safe because segments with data can't realistically be part of ACK loops, which by their nature consist of each side sending pure/data-less ACKs to each other. The dupack interval is controlled by a new sysctl knob, tcp_invalid_ratelimit, given in milliseconds, in case an administrator needs to dial this upward in the face of a high-rate DoS attack. The name and units are chosen to be analogous to the existing analogous knob for ICMP, icmp_ratelimit. The default value for tcp_invalid_ratelimit is 500ms, which allows at most one such dupack per 500ms. This is chosen to be 2x faster than the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule 2.4). We allow the extra 2x factor because network delay variations can cause packets sent at 1 second intervals to be compressed and arrive much closer. Reported-by: Avery Fay <avery@mixpanel.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-08 01:03:12 -08:00
David S. Miller	6e03f896b5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-05 14:33:28 -08:00
David S. Miller	f2683b743f	Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More iov_iter work from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-04 20:46:55 -08:00
Eric Dumazet	9878196578	tcp: do not pace pure ack packets When we added pacing to TCP, we decided to let sch_fq take care of actual pacing. All TCP had to do was to compute sk->pacing_rate using simple formula: sk->pacing_rate = 2 * cwnd * mss / rtt It works well for senders (bulk flows), but not very well for receivers or even RPC : cwnd on the receiver can be less than 10, rtt can be around 100ms, so we can end up pacing ACK packets, slowing down the sender. Really, only the sender should pace, according to its own logic. Instead of adding a new bit in skb, or call yet another flow dissection, we tweak skb->truesize to a small value (2), and we instruct sch_fq to use new helper and not pace pure ack. Note this also helps TCP small queue, as ack packets present in qdisc/NIC do not prevent sending a data packet (RPC workload) This helps to reduce tx completion overhead, ack packets can use regular sock_wfree() instead of tcp_wfree() which is a bit more expensive. This has no impact in the case packets are sent to loopback interface, as we do not coalesce ack packets (were we would detect skb->truesize lie) In case netem (with a delay) is used, skb_orphan_partial() also sets skb->truesize to 1. This patch is a combination of two patches we used for about one year at Google. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-04 20:36:31 -08:00
Tom Herbert	dcdc899469	net: add skb functions to process remote checksum offload This patch adds skb_remcsum_process and skb_gro_remcsum_process to perform the appropriate adjustments to the skb when receiving remote checksum offload. Updated vxlan and gue to use these functions. Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did not see any change in performance. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-04 13:54:07 -08:00
Al Viro	21226abb4e	net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter() That takes care of the majority of ->sendmsg() instances - most of them via memcpy_to_msg() or assorted getfrag() callbacks. One place where we still keep memcpy_fromiovecend() is tipc - there we potentially read the same data over and over; separate patch, that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-04 01:34:15 -05:00
Al Viro	57be5bdad7	ip: convert tcp_sendmsg() to iov_iter primitives patch is actually smaller than it seems to be - most of it is unindenting the inner loop body in tcp_sendmsg() itself... the bit in tcp_input.c is going to get reverted very soon - that's what memcpy_from_msg() will become, but not in this commit; let's keep it reasonably contained... There's one potentially subtle change here: in case of short copy from userland, mainline tcp_send_syn_data() discards the skb it has allocated and falls back to normal path, where we'll send as much as possible after rereading the same data again. This patch trims SYN+data skb instead - that way we don't need to copy from the same place twice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-04 01:34:14 -05:00
Al Viro	cacdc7d2f9	ip: stash a pointer to msghdr in struct ping_fakehdr ... instead of storing its ->mgs_iter.iov there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-04 01:34:14 -05:00
Al Viro	7ae9abfd9d	ipv4: raw_send_hdrinc(): pass msghdr Switch from passing msg->iov_iter.iov to passing msg itself Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2015-02-04 01:34:13 -05:00
Florian Westphal	843c2fdf7a	net: dctcp: loosen requirement to assert ECT(0) during 3WHS One deployment requirement of DCTCP is to be able to run in a DC setting along with TCP traffic. As Glenn Judd's NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter" [1] (tba) explains, one way to solve this on switch side is to split DCTCP and TCP traffic in two queues per switch port based on the DSCP: one queue soley intended for DCTCP traffic and one for non-DCTCP traffic. For the DCTCP queue, there's the marking threshold K as explained in commit `e3118e8359` ("net: tcp: add DCTCP congestion control algorithm") for RED marking ECT(0) packets with CE. For the non-DCTCP queue, there's f.e. a classic tail drop queue. As already explained in `e3118e8359`, running DCTCP at scale when not marking SYN/SYN-ACK packets with ECT(0) has severe consequences as for non-ECT(0) packets, traversing the RED marking DCTCP queue will result in a severe reduction of connection probability. This is due to the DCTCP queue being dominated by ECT(0) traffic and switches handle non-ECT traffic in the RED marking queue after passing K as drops, where K is usually a low watermark in order to leave enough tailroom for bursts. Splitting DCTCP traffic among several queues (ECN and non-ECN queue) is being considered a terrible idea in the network community as it splits single flows across multiple network paths. Therefore, commit `e3118e8359` implements this on Linux as ECT(0) marked traffic, as we argue that marking all packets of a DCTCP flow is the only viable solution and also doesn't speak against the draft. However, recently, a DCTCP implementation for FreeBSD hit also their mainline kernel [2]. In order to let them play well together with Linux' DCTCP, we would need to loosen the requirement that ECT(0) has to be asserted during the 3WHS as not implemented in FreeBSD. This simplifies the ECN test and lets DCTCP work together with FreeBSD. Joint work with Daniel Borkmann. [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd [2] `8ad8794452` Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Glenn Judd <glenn.judd@morganstanley.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-02 18:48:55 -08:00
Willem de Bruijn	49ca0d8bfa	net-timestamp: no-payload option Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit timestamps, this loops timestamps on top of empty packets. Doing so reduces the pressure on SO_RCVBUF. Payload inspection and cmsg reception (aside from timestamps) are no longer possible. This works together with a follow on patch that allows administrators to only allow tx timestamping if it does not loop payload or metadata. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes (rfc -> v1) - add documentation - remove unnecessary skb->len test (thanks to Richard Cochran) Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-02 18:46:51 -08:00
Eric Dumazet	bdbbb8527b	ipv4: tcp: get rid of ugly unicast_sock In commit `be9f4a44e7` ("ipv4: tcp: remove per net tcp_sock") I tried to address contention on a socket lock, but the solution I chose was horrible : commit `3a7c384ffd` ("ipv4: tcp: unicast_sock should not land outside of TCP stack") addressed a selinux regression. commit `0980e56e50` ("ipv4: tcp: set unicast_sock uc_ttl to -1") took care of another regression. commit `b5ec8eeac4` ("ipv4: fix ip_send_skb()") fixed another regression. commit `811230cd85` ("tcp: ipv4: initialize unicast_sock sk_pacing_rate") was another shot in the dark. Really, just use a proper socket per cpu, and remove the skb_orphan() call, to re-enable flow control. This solves a serious problem with FQ packet scheduler when used in hostile environments, as we do not want to allocate a flow structure for every RST packet sent in response to a spoofed packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-02-01 23:06:19 -08:00
Eric Dumazet	349c9e3c73	ipv4: icmp: use percpu allocation Get rid of nr_cpu_ids and use modern percpu allocation. Note that the sockets themselves are not yet allocated using NUMA affinity. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-31 17:48:18 -08:00
Kenneth Klette Jonassen	932eb7638a	tcp: use SACK RTTs for CC Current behavior only passes RTTs from sequentially acked data to CC. If sender gets a combined ACK for segment 1 and SACK for segment 3, then the computed RTT for CC is the time between sending segment 1 and receiving SACK for segment 3. Pass the minimum computed RTT from any acked data to CC, i.e. time between sending segment 3 and receiving SACK for segment 3. Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-31 17:25:37 -08:00
Daniel Borkmann	207895fd38	net: mark some potential candidates __read_mostly They are all either written once or extremly rarely (e.g. from init code), so we can move them to the .data..read_mostly section. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-30 17:58:39 -08:00
Li Wei	3cdaa5be9e	ipv4: Don't increase PMTU with Datagram Too Big message. RFC 1191 said, "a host MUST not increase its estimate of the Path MTU in response to the contents of a Datagram Too Big message." Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-29 15:28:59 -08:00
Eric Dumazet	811230cd85	tcp: ipv4: initialize unicast_sock sk_pacing_rate When I added sk_pacing_rate field, I forgot to initialize its value in the per cpu unicast_sock used in ip_send_unicast_reply() This means that for sch_fq users, RST packets, or ACK packets sent on behalf of TIME_WAIT sockets might be sent to slowly or even dropped once we reach the per flow limit. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `95bd09eb27` ("tcp: TSO packets automatic sizing") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 23:24:47 -08:00
Jesse Gross	b8693877ae	openvswitch: Add support for checksums on UDP tunnels. Currently, it isn't possible to request checksums on the outer UDP header of tunnels - the TUNNEL_CSUM flag is ignored. This adds support for requesting that UDP checksums be computed on transmit and properly reported if they are present on receive. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 23:04:15 -08:00
Neal Cardwell	d6b1a8a92a	tcp: fix timing issue in CUBIC slope calculation This patch fixes a bug in CUBIC that causes cwnd to increase slightly too slowly when multiple ACKs arrive in the same jiffy. If cwnd is supposed to increase at a rate of more than once per jiffy, then CUBIC was sometimes too slow. Because the bic_target is calculated for a future point in time, calculated with time in jiffies, the cwnd can increase over the course of the jiffy while the bic_target calculated as the proper CUBIC cwnd at time t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only increases on jiffy tick boundaries. So since the cnt is set to: ca->cnt = cwnd / (bic_target - cwnd); as cwnd increases but bic_target does not increase due to jiffy granularity, the cnt becomes too large, causing cwnd to increase too slowly. For example: - suppose at the beginning of a jiffy, cwnd=40, bic_target=44 - so CUBIC sets: ca->cnt = cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10 - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai() increases cwnd to 41 - so CUBIC sets: ca->cnt = cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13 So now CUBIC will wait for 13 packets to be ACKed before increasing cwnd to 42, insted of 10 as it should. The fix is to avoid adjusting the slope (determined by ca->cnt) multiple times within a jiffy, and instead skip to compute the Reno cwnd, the "TCP friendliness" code path. Reported-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 22:18:38 -08:00
Neal Cardwell	9cd981dcf1	tcp: fix stretch ACK bugs in CUBIC Change CUBIC to properly handle stretch ACKs in additive increase mode by passing in the count of ACKed packets to tcp_cong_avoid_ai(). In addition, because we are now precisely accounting for stretch ACKs, including delayed ACKs, we can now remove the delayed ACK tracking and estimation code that tracked recent delayed ACK behavior in ca->delayed_ack. Reported-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 22:18:38 -08:00
Neal Cardwell	c22bdca947	tcp: fix stretch ACK bugs in Reno Change Reno to properly handle stretch ACKs in additive increase mode by passing in the count of ACKed packets to tcp_cong_avoid_ai(). In addition, if snd_cwnd crosses snd_ssthresh during slow start processing, and we then exit slow start mode, we need to carry over any remaining "credit" for packets ACKed and apply that to additive increase by passing this remaining "acked" count to tcp_cong_avoid_ai(). Reported-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 22:18:38 -08:00
Neal Cardwell	814d488c61	tcp: fix the timid additive increase on stretch ACKs tcp_cong_avoid_ai() was too timid (snd_cwnd increased too slowly) on "stretch ACKs" -- cases where the receiver ACKed more than 1 packet in a single ACK. For example, suppose w is 10 and we get a stretch ACK for 20 packets, so acked is 20. We ought to increase snd_cwnd by 2 (since acked/w = 20/10 = 2), but instead we were only increasing cwnd by 1. This patch fixes that behavior. Reported-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 22:18:37 -08:00
Neal Cardwell	e73ebb0881	tcp: stretch ACK fixes prep LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that cover more than the RFC-specified maximum of 2 packets. These stretch ACKs can cause serious performance shortfalls in common congestion control algorithms that were designed and tuned years ago with receiver hosts that were not using LRO or GRO, and were instead politely ACKing every other packet. This patch series fixes Reno and CUBIC to handle stretch ACKs. This patch prepares for the upcoming stretch ACK bug fix patches. It adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and changes all congestion control algorithms to pass in 1 for the ACKed count. It also changes tcp_slow_start() to return the number of packet ACK "credits" that were not processed in slow start mode, and can be processed by the congestion control module in additive increase mode. In future patches we will fix tcp_cong_avoid_ai() to handle stretch ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start and additive increase mode. Reported-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-28 22:18:37 -08:00
David S. Miller	95f873f2ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: arch/arm/boot/dts/imx6sx-sdb.dts net/sched/cls_bpf.c Two simple sets of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-27 16:59:56 -08:00
subashab@codeaurora.org	fc752f1f43	ping: Fix race in free in receive path An exception is seen in ICMP ping receive path where the skb destructor sock_rfree() tries to access a freed socket. This happens because ping_rcv() releases socket reference with sock_put() and this internally frees up the socket. Later icmp_rcv() will try to free the skb and as part of this, skb destructor is called and which leads to a kernel panic as the socket is freed already in ping_rcv(). -->\|exception -007\|sk_mem_uncharge -007\|sock_rfree -008\|skb_release_head_state -009\|skb_release_all -009\|__kfree_skb -010\|kfree_skb -011\|icmp_rcv -012\|ip_local_deliver_finish Fix this incorrect free by cloning this skb and processing this cloned skb instead. This patch was suggested by Eric Dumazet Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-27 00:04:53 -08:00
Herbert Xu	86f3cddbc3	udp_diag: Fix socket skipping within chain While working on rhashtable walking I noticed that the UDP diag dumping code is buggy. In particular, the socket skipping within a chain never happens, even though we record the number of sockets that should be skipped. As this code was supposedly copied from TCP, this patch does what TCP does and resets num before we walk a chain. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-27 00:02:41 -08:00
Hannes Frederic Sowa	df4d92549f	ipv4: try to cache dst_entries which would cause a redirect Not caching dst_entries which cause redirects could be exploited by hosts on the same subnet, causing a severe DoS attack. This effect aggravated since commit `f886497212` ("ipv4: fix dst race in sk_dst_get()"). Lookups causing redirects will be allocated with DST_NOCACHE set which will force dst_release to free them via RCU. Unfortunately waiting for RCU grace period just takes too long, we can end up with >1M dst_entries waiting to be released and the system will run OOM. rcuos threads cannot catch up under high softirq load. Attaching the flag to emit a redirect later on to the specific skb allows us to cache those dst_entries thus reducing the pressure on allocation and deallocation. This issue was discovered by Marcelo Leitner. Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-26 17:28:27 -08:00
Alexander Duyck	64c6272351	fib_trie: Various clean-ups for handling slen While doing further work on the fib_trie I noted a few items. First I was using calls that were far more complicated than they needed to be for determining when to push/pull the suffix length. I have updated the code to reflect the simplier logic. The second issue is that I realised we weren't necessarily handling the case of a leaf_info struct surviving a flush. I have updated the logic so that now we will call pull_suffix in the event of having a leaf info value left in the leaf after flushing it. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:16 -08:00
Alexander Duyck	02525368f4	fib_trie: Move fib_find_alias to file where it is used The function fib_find_alias is only accessed by functions in fib_trie.c as such it makes sense to relocate it and cast it as static so that the compiler can take advantage of optimizations it can do to it as a local function. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:16 -08:00
Alexander Duyck	30cfe7c9c8	fib_trie: Use empty_children instead of counting empty nodes in stats collection It doesn't make much sense to count the pointers ourselves when empty_children already has a count for the number of NULL pointers stored in the tnode. As such save ourselves the cycles and just use empty_children. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:16 -08:00
Alexander Duyck	95f60ea3e9	fib_trie: Add collapse() and should_collapse() to resize This patch really does two things. First it pulls the logic for determining if we should collapse one node out of the tree and the actual code doing the collapse into a separate pair of functions. This helps to make the changes to these areas more readable. Second it encodes the upper 32b of the empty_children value onto the full_children value in the case of bits == KEYLENGTH. By doing this we are able to handle the case of a 32b node where empty_children would appear to be 0 when it was actually 1ul << 32. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:16 -08:00
Alexander Duyck	a80e89d4c6	fib_trie: Fall back to slen update on inflate/halve failure This change corrects an issue where if inflate or halve fails we were exiting the resize function without at least updating the slen for the node. To correct this I have moved the update of max_size into the while loop so that it is only decremented on a successful call to either inflate or halve. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:16 -08:00
Alexander Duyck	69fa57b1e4	fib_trie: Fix RCU bug and merge similar bits of inflate/halve This patch addresses two issues. The first issue is the fact that I believe I had the RCU freeing sequence slightly out of order. As a result we could get into an issue if a caller went into a child of a child of the new node, then backtraced into the to be freed parent, and then attempted to access a child of a child that may have been consumed in a resize of one of the new nodes children. To resolve this I have moved the resize after we have freed the oldtnode. The only side effect of this is that we will now be calling resize on more nodes in the case of inflate due to the fact that we don't have a good way to test to see if a full_tnode on the new node was there before or after the allocation. This should have minimal impact however since the node should already be correctly size so it is just the cost of calling should_inflate that we will be taking on the node which is only a couple of cycles. The second issue is the fact that inflate and halve were essentially doing the same thing after the new node was added to the trie replacing the old one. As such it wasn't really necessary to keep the code in both functions so I have split it out into two other functions, called replace and update_children. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:15 -08:00
Alexander Duyck	b3832117b4	fib_trie: Use index & (~0ul << n->bits) instead of index >> n->bits In doing performance testing and analysis of the changes I recently found that by shifting the index I had created an unnecessary dependency. I have updated the code so that we instead shift a mask by bits and then just test against that as that should save us about 2 CPU cycles since we can generate the mask while the key and pos are being processed. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-25 14:47:15 -08:00
Tom Herbert	d998f8efa4	udp: Do not require sock in udp_tunnel_xmit_skb The UDP tunnel transmit functions udp_tunnel_xmit_skb and udp_tunnel6_xmit_skb include a socket argument. The socket being passed to the functions (from VXLAN) is a UDP created for receive side. The only thing that the socket is used for in the transmit functions is to get the setting for checksum (enabled or zero). This patch removes the argument and and adds a nocheck argument for checksum setting. This eliminates the unnecessary dependency on a UDP socket for UDP tunnel transmit. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-24 23:15:40 -08:00
Florian Fainelli	728c02089a	net: ipv4: handle DSA enabled master network devices The logic to configure a network interface for kernel IP auto-configuration is very simplistic, and does not handle the case where a device is stacked onto another such as with DSA. This causes the kernel not to open and configure the master network device in a DSA switch tree, and therefore slave network devices using this master network devices as conduit device cannot be open. This restriction comes from a check in net/dsa/slave.c, which is basically checking the master netdev flags for IFF_UP and returns -ENETDOWN if it is not the case. Automatically bringing-up DSA master network devices allows DSA slave network devices to be used as valid interfaces for e.g: NFS root booting by allowing kernel IP autoconfiguration to succeed on these interfaces. On the reverse path, make sure we do not attempt to close a DSA-enabled device as this would implicitely prevent the slave DSA network device from operating. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-19 15:45:10 -05:00
Nicolas Dichtel	1728d4fabd	tunnels: advertise link netns via netlink Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is added to rtnetlink messages. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-19 14:32:03 -05:00
David S. Miller	7b46a644a4	netlink: Fix bugs in nlmsg_end() conversions. Commit `053c095a82` ("netlink: make nlmsg_end() and genlmsg_end() void") didn't catch all of the cases where callers were breaking out on the return value being equal to zero, which they no longer should when zero means success. Fix all such cases. Reported-by: Marcel Holtmann <marcel@holtmann.org> Reported-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-18 23:36:08 -05:00
Johannes Berg	053c095a82	netlink: make nlmsg_end() and genlmsg_end() void Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-18 01:03:45 -05:00
Willem de Bruijn	f812116b17	ip: zero sockaddr returned on error queue The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That structure is defined and allocated on the stack as struct { struct sock_extended_err ee; struct sockaddr_in(6) offender; } errhdr; The second part is only initialized for certain SO_EE_ORIGIN values. Always initialize it completely. An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that would return uninitialized bytes. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Also verified that there is no padding between errhdr.ee and errhdr.offender that could leak additional kernel data. Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-15 19:41:16 -05:00
Eric Dumazet	5055c371bf	ipv4: per cpu uncached list RAW sockets with hdrinc suffer from contention on rt_uncached_lock spinlock. One solution is to use percpu lists, since most routes are destroyed by the cpu that created them. It is unclear why we even have to put these routes in uncached_list, as all outgoing packets should be freed when a device is dismantled. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `caacf05e5a` ("ipv4: Properly purge netdev references on uncached routes.") Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-15 18:26:16 -05:00
David S. Miller	3f3558bb51	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/xen-netfront.c Minor overlapping changes in xen-netfront.c, mostly to do with some buffer management changes alongside the split of stats into TX and RX. Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-15 00:53:17 -05:00
Tom Herbert	a2b12f3c7a	udp: pass udp_offload struct to UDP gro callbacks This patch introduces udp_offload_callbacks which has the same GRO functions (but not a GSO function) as offload_callbacks, except there is an argument to a udp_offload struct passed to gro_receive and gro_complete functions. This additional argument can be used to retrieve the per port structure of the encapsulation for use in gro processing (mostly by doing container_of on the structure). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-14 15:20:04 -05:00
Jiri Pirko	df8a39defa	net: rename vlan_tx_* helpers since "tx" is misleading there The same macros are used for rx as well. So rename it. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-13 17:51:08 -05:00
Sébastien Barré	08abdffa1c	tcp: avoid reducing cwnd when ACK+DSACK is received With TLP, the peer may reply to a probe with an ACK+D-SACK, with ack value set to tlp_high_seq. In the current code, such ACK+DSACK will be missed and only at next, higher ack will the TLP episode be considered done. Since the DSACK is not present anymore, this will cost a cwnd reduction. This patch ensures that this scenario does not cause a cwnd reduction, since receiving an ACK+DSACK indicates that both the initial segment and the probe have been received by the peer. The following packetdrill test, from Neal Cardwell, validates this patch: // Establish a connection. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.020 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 1 packet. +0 write(4, ..., 1000) = 1000 +0 > P. 1:1001(1000) ack 1 // Loss probe retransmission. // packets_out == 1 => schedule PTO in max(2RTT, 1.5RTT + 200ms) // In this case, this means: 1.5*RTT + 200ms = 230ms +.230 > P. 1:1001(1000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% // Receiver ACKs at tlp_high_seq with a DSACK, // indicating they received the original packet and probe. +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop> +0 %{ assert tcpi_snd_cwnd == 10 }% // Send another packet. +0 write(4, ..., 1000) = 1000 +0 > P. 1001:2001(1000) ack 1 // Receiver ACKs above tlp_high_seq, which should end the TLP episode // if we haven't already. We should not reduce cwnd. +.020 < . 1:1(0) ack 2001 win 257 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% Credits: -Gregory helped in finding that tcp_process_tlp_ack was where the cwnd got reduced in our MPTCP tests. -Neal wrote the packetdrill test above -Yuchung reworked the patch to make it more readable. Cc: Gregory Detal <gregory.detal@uclouvain.be> Cc: Nandita Dukkipati <nanditad@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-13 14:22:02 -05:00
David S. Miller	2bd8221804	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== netfilter/ipvs fixes for net The following patchset contains netfilter/ipvs fixes, they are: 1) Small fix for the FTP helper in IPVS, a diff variable may be left unset when CONFIG_IP_VS_IPV6 is set. Patch from Dan Carpenter. 2) Fix nf_tables port NAT in little endian archs, patch from leroy christophe. 3) Fix race condition between conntrack confirmation and flush from userspace. This is the second reincarnation to resolve this problem. 4) Make sure inner messages in the batch come with the nfnetlink header. 5) Relax strict check from nfnetlink_bind() that may break old userspace applications using all 1s group mask. 6) Schedule removal of chains once no sets and rules refer to them in the new nf_tables ruleset flush command. Reported by Asbjoern Sloth Toennesen. Note that this batch comes later than usual because of the short winter holidays. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-12 00:14:49 -05:00
David S. Miller	44d84d7272	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2015-01-06 22:29:20 -05:00
Daniel Borkmann	81164413ad	net: tcp: add per route congestion control This work adds the possibility to define a per route/destination congestion control algorithm. Generally, this opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics, even transparently for a single application listening on all links. For our specific use case, this additionally facilitates deployment of DCTCP, for example, applications can easily serve internal traffic/dsts in DCTCP and external one with CUBIC. Other scenarios would also allow for utilizing e.g. long living, low priority background flows for certain destinations/routes while still being able for normal traffic to utilize the default congestion control algorithm. We also thought about a per netns setting (where different defaults are possible), but given its actually a link specific property, we argue that a per route/destination setting is the most natural and flexible. The administrator can utilize this through ip-route(8) by appending "congctl [lock] <name>", where <name> denotes the name of a congestion control algorithm and the optional lock parameter allows to enforce the given algorithm so that applications in user space would not be allowed to overwrite that algorithm for that destination. The dst metric lookups are being done when a dst entry is already available in order to avoid a costly lookup and still before the algorithms are being initialized, thus overhead is very low when the feature is not being used. While the client side would need to drop the current reference on the module, on server side this can actually even be avoided as we just got a flat-copied socket clone. Joint work with Florian Westphal. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:55:24 -05:00
Daniel Borkmann	ea69763999	net: tcp: add RTAX_CC_ALGO fib handling This patch adds the minimum necessary for the RTAX_CC_ALGO congestion control metric to be set up and dumped back to user space. While the internal representation of RTAX_CC_ALGO is handled as a u32 key, we avoided to expose this implementation detail to user space, thus instead, we chose the netlink attribute that is being exchanged between user space to be the actual congestion control algorithm name, similarly as in the setsockopt(2) API in order to allow for maximum flexibility, even for 3rd party modules. It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as it should have been stored in RTAX_FEATURES instead, we first thought about reusing it for the congestion control key, but it brings more complications and/or confusion than worth it. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:55:24 -05:00
Daniel Borkmann	c5c6a8ab45	net: tcp: add key management to congestion control This patch adds necessary infrastructure to the congestion control framework for later per route congestion control support. For a per route congestion control possibility, our aim is to store a unique u32 key identifier into dst metrics, which can then be mapped into a tcp_congestion_ops struct. We argue that having a RTAX key entry is the most simple, generic and easy way to manage, and also keeps the memory footprint of dst entries lower on 64 bit than with storing a pointer directly, for example. Having a unique key id also allows for decoupling actual TCP congestion control module management from the FIB layer, i.e. we don't have to care about expensive module refcounting inside the FIB at this point. We first thought of using an IDR store for the realization, which takes over dynamic assignment of unused key space and also performs the key to pointer mapping in RCU. While doing so, we stumbled upon the issue that due to the nature of dynamic key distribution, it just so happens, arguably in very rare occasions, that excessive module loads and unloads can lead to a possible reuse of previously used key space. Thus, previously stale keys in the dst metric are now being reassigned to a different congestion control algorithm, which might lead to unexpected behaviour. One way to resolve this would have been to walk FIBs on the actually rare occasion of a module unload and reset the metric keys for each FIB in each netns, but that's just very costly. Therefore, we argue a better solution is to reuse the unique congestion control algorithm name member and map that into u32 key space through jhash. For that, we split the flags attribute (as it currently uses 2 bits only anyway) into two u32 attributes, flags and key, so that we can keep the cacheline boundary of 2 cachelines on x86_64 and cache the precalculated key at registration time for the fast path. On average we might expect 2 - 4 modules being loaded worst case perhaps 15, so a key collision possibility is extremely low, and guaranteed collision-free on LE/BE for all in-tree modules. Overall this results in much simpler code, and all without the overhead of an IDR. Due to the deterministic nature, modules can now be unloaded, the congestion control algorithm for a specific but unloaded key will fall back to the default one, and on module reload time it will switch back to the expected algorithm transparently. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:55:24 -05:00
Daniel Borkmann	29ba4fffd3	net: tcp: refactor reinitialization of congestion control We can just move this to an extra function and make the code a bit more readable, no functional change. Joint work with Florian Westphal. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:55:24 -05:00
Tom Herbert	ad6f939ab1	ip: Add offset parameter to ip_cmsg_recv Add ip_cmsg_recv_offset function which takes an offset argument that indicates the starting offset in skb where data is being received from. This will be useful in the case of UDP and provided checksum to user space. ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of zero. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:44:46 -05:00
Tom Herbert	5961de9f19	ip: Add offset parameter to ip_cmsg_recv Add ip_cmsg_recv_offset function which takes an offset argument that indicates the starting offset in skb where data is being received from. This will be useful in the case of UDP and provided checksum to user space. ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of zero. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:44:46 -05:00
Tom Herbert	c44d13d6f3	ip: IP cmsg cleanup Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that they can be referenced in other source files. Restructure ip_cmsg_recv to not go through flags using shift, check for flags by 'and'. This eliminates both the shift and a conditional per flag check. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:44:46 -05:00
Tom Herbert	224d019c4f	ip: Move checksum convert defines to inet Move convert_csum from udp_sock to inet_sock. This allows the possibility that we can use convert checksum for different types of sockets and also allows convert checksum to be enabled from inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-05 22:44:46 -05:00
Jesse Gross	46b1e4f911	geneve: Check family when reusing sockets. When searching for an existing socket to reuse, the address family is not taken into account - only port number. This means that an IPv4 socket could be used for IPv6 traffic and vice versa, which is sure to cause problems when passing packets. It is not possible to trigger this problem currently because the only user of Geneve creates just IPv4 sockets. However, that is likely to change in the near future. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-04 22:21:33 -05:00
Jesse Gross	df5dba8e52	geneve: Remove socket hash table. The hash table for open Geneve ports is used only on creation and deletion time. It is not performance critical and is not likely to grow to a large number of items. Therefore, this can be changed to use a simple linked list. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-04 22:21:33 -05:00
Jesse Gross	829a3ada9c	geneve: Simplify locking. The existing Geneve locking scheme was pulled over directly from VXLAN. However, VXLAN has a number of built in mechanisms which make the locking more complex and are unlikely to be necessary with Geneve. This simplifies the locking to use a basic scheme of a mutex when doing updates plus RCU on receive. In addition to making the code easier to read, this also avoids the possibility of a race when creating or destroying sockets since UDP sockets and the list of Geneve sockets are protected by different locks. After this change, the entire operation is atomic. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-04 22:21:33 -05:00
Jesse Gross	61f3cade76	geneve: Remove workqueue. The work queue is used only to free the UDP socket upon destruction. This is not necessary with Geneve and generally makes the code more difficult to reason about. It also introduces nondeterministic behavior such as when a socket is rapidly deleted and recreated, which could fail as the the deletion happens asynchronously. Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-04 22:21:33 -05:00
Herbert Xu	843925f33f	tcp: Do not apply TSO segment limit to non-TSO packets Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs. In fact the problem was completely unrelated to IPsec. The bug is also reproducible if you just disable TSO/GSO. The problem is that when the MSS goes down, existing queued packet on the TX queue that have not been transmitted yet all look like TSO packets and get treated as such. This then triggers a bug where tcp_mss_split_point tells us to generate a zero-sized packet on the TX queue. Once that happens we're screwed because the zero-sized packet can never be removed by ACKs. Fixes: `1485348d24` ("tcp: Apply device TSO segment limit earlier") Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-02 16:13:20 -05:00
Joe Stringer	a4c9ea5e8f	geneve: Add Geneve GRO support This results in an approximately 30% increase in throughput when handling encapsulated bulk traffic. Signed-off-by: Joe Stringer <joestringer@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-01-02 15:46:41 -05:00
Alexander Duyck	5405afd1a3	fib_trie: Add tracking value for suffix length This change adds a tracking value for the maximum suffix length of all prefixes stored in any given tnode. With this value we can determine if we need to backtrace or not based on if the suffix is greater than the pos value. By doing this we can reduce the CPU overhead for lookups in the local table as many of the prefixes there are 32b long and have a suffix length of 0 meaning we can immediately backtrace to the root node without needing to test any of the nodes between it and where we ended up. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:55 -05:00
Alexander Duyck	21d1f11db0	fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child For some reason the compiler doesn't seem to understand that when we are in a loop that runs from tnode_child_length - 1 to 0 we don't expect the value of tn->bits to change. As such every call to tnode_get_child was rerunning tnode_chile_length which ended up consuming quite a bit of space in the resultant assembly code. I have gone though and verified that in all cases where tnode_get_child is used we are either winding though a fixed loop from tnode_child_length - 1 to 0, or are in a fastpath case where we are verifying the value by either checking for any remaining bits after shifting index by bits and testing for leaf, or by using tnode_child_length. size net/ipv4/fib_trie.o Before: text data bss dec hex filename 15506 376 8 15890 3e12 net/ipv4/fib_trie.o After: text data bss dec hex filename 14827 376 8 15211 3b6b net/ipv4/fib_trie.o Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:55 -05:00
Alexander Duyck	12c081a5c8	fib_trie: inflate/halve nodes in a more RCU friendly way This change pulls the node_set_parent functionality out of put_child_reorg and instead leaves that to the function to take care of as well. By doing this we can fully construct the new cluster of tnodes and all of the pointers out of it before we start routing pointers into it. I am suspecting this will likely fix some concurency issues though I don't have a good test to show as such. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:55 -05:00
Alexander Duyck	fc86a93b46	fib_trie: Push tnode flushing down to inflate/halve This change pushes the tnode freeing down into the inflate and halve functions. It makes more sense here as we have a better grasp of what is going on and when a given cluster of nodes is ready to be freed. I believe this may address a bug in the freeing logic as well. For some reason if the freelist got to a certain size we would call synchronize_rcu(). I'm assuming that what they meant to do is call synchronize_rcu() after they had handed off that much memory via call_rcu(). As such that is what I have updated the behavior to be. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:55 -05:00
Alexander Duyck	ff181ed876	fib_trie: Push assignment of child to parent down into inflate/halve This change makes it so that the assignment of the tnode to the parent is handled directly within whatever function is currently handling the node be it inflate, halve, or resize. By doing this we can avoid some of the need to set NULL pointers in the tree while we are resizing the subnodes. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:55 -05:00
Alexander Duyck	f05a48198b	fib_trie: Add functions should_inflate and should_halve This change pulls the logic for if we should inflate/halve the nodes out into separate functions. It also addresses what I believe is a bug where 1 full node is all that is needed to keep a node from ever being halved. Simple script to reproduce the issue: modprobe dummy; ifconfig dummy0 up for i in `seq 0 255`; do ifconfig dummy0:$i 10.0.${i}.1/24 up; done ifconfig dummy0:256 10.0.255.33/16 up for i in `seq 0 254`; do ifconfig dummy0:$i down; done Results from /proc/net/fib_triestat Before: Local: Aver depth: 3.00 Max depth: 4 Leaves: 17 Prefixes: 18 Internal nodes: 11 1: 8 2: 2 10: 1 Pointers: 1048 Null ptrs: 1021 Total size: 11 kB After: Local: Aver depth: 3.41 Max depth: 5 Leaves: 17 Prefixes: 18 Internal nodes: 12 1: 8 2: 3 3: 1 Pointers: 36 Null ptrs: 8 Total size: 3 kB Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	cf3637bb8f	fib_trie: Move resize to after inflate/halve This change consists of a cut/paste of resize to behind inflate and halve so that I could remove the two function prototypes. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	345e9b5426	fib_trie: Push rcu_read_lock/unlock to callers This change is to start cleaning up some of the rcu_read_lock/unlock handling. I realized while reviewing the code there are several spots that I don't believe are being handled correctly or are masking warnings by locally calling rcu_read_lock/unlock instead of calling them at the correct level. A common example is a call to fib_get_table followed by fib_table_lookup. The rcu_read_lock/unlock ought to wrap both but there are several spots where they were not wrapped. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	98293e8d2f	fib_trie: Use unsigned long for anything dealing with a shift by bits This change makes it so that anything that can be shifted by, or compared to a value shifted by bits is updated to be an unsigned long. This is mostly a precaution against an insanely huge address space that somehow starts coming close to the 2^32 root node size which would require something like 1.5 billion addresses. I chose unsigned long instead of unsigned long long since I do not believe it is possible to allocate a 32 bit tnode on a 32 bit system as the memory consumed would be 16GB + 28B which exceeds the addressible space for any one process. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	e9b44019d4	fib_trie: Update meaning of pos to represent unchecked bits This change moves the pos value to the other side of the "bits" field. By doing this it actually simplifies a significant amount of code in the trie. For example when halving a tree we know that the bit lost exists at oldnode->pos, and if we inflate the tree the new bit being add is at tn->pos. Previously to find those bits you would have to subtract pos and bits from the keylength or start with a value of (1 << 31) and then shift that. There are a number of spots throughout the code that benefit from this. In the case of the hot-path searches the main advantage is that we can drop 2 or more operations from the search path as we no longer need to compute the value for the index to be shifted by and can instead just use the raw pos value. In addition the tkey_extract_bits is now defunct and can be replaced by get_index since the two operations were doing the same thing, but now get_index does it much more quickly as it is only an xor and shift versus a pair of shifts and a subtraction. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	836a0123c9	fib_trie: Optimize fib_table_insert This patch updates the fib_table_insert function to take advantage of the changes made to improve the performance of fib_table_lookup. As a result the code should be smaller and run faster then the original. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	939afb0657	fib_trie: Optimize fib_find_node This patch makes use of the same features I made use of for fib_table_lookup to streamline fib_find_node. The resultant code should be smaller and run faster than the original. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	9f9e636d4f	fib_trie: Optimize fib_table_lookup to avoid wasting time on loops/variables This patch is meant to reduce the complexity of fib_table_lookup by reducing the number of variables to the bare minimum while still keeping the same if not improved functionality versus the original. Most of this change was started off by the desire to rid the function of chopped_off and current_prefix_length as they actually added very little to the function since they only applied when computing the cindex. I was able to replace them mostly with just a check for the prefix match. As long as the prefix between the key and the node being tested was the same we know we can search the tnode fully versus just testing cindex 0. The second portion of the change ended up being a massive reordering. Originally the calls to check_leaf were up near the start of the loop, and the backtracing and descending into lower levels of tnodes was later. This didn't make much sense as the structure of the tree means the leaves are always the last thing to be tested. As such I reordered things so that we instead have a loop that will delve into the tree and only exit when we have either found a leaf or we have exhausted the tree. The advantage of rearranging things like this is that we can fully inline check_leaf since there is now only one reference to it in the function. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	adaf981685	fib_trie: Merge leaf into tnode This change makes it so that leaf and tnode are the same struct. As a result there is no need for rt_trie_node anymore since everyting can be merged into tnode. On 32b systems this results in the leaf being 4 bytes larger, however I don't know if that is really an issue as this and an eariler patch that added bits & pos have increased the size from 20 to 28. If I am not mistaken slub/slab allocate on power of 2 sizes so 20 was likely being rounded up to 32 anyway. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	37fd30f2da	fib_trie: Merge tnode_free and leaf_free into node_free Both the leaf and the tnode had an rcu_head in them, but they had them in slightly different places. Since we now have them in the same spot and know that any node with bits == 0 is a leaf and the rest are either vmalloc or kmalloc tnodes depending on the value of bits it makes it easy to combine the functions and reduce overhead. In addition I have taken advantage of the rcu_head pointer to go ahead and put together a simple linked list instead of using the tnode pointer as this way we can merge either type of structure for freeing. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	64c9b6fb26	fib_trie: Make leaf and tnode more uniform This change makes some fundamental changes to the way leaves and tnodes are constructed. The big differences are: 1. Leaves now populate pos and bits indicating their full key size. 2. Trie nodes now mask out their lower bits to be consistent with the leaf 3. Both structures have been reordered so that rt_trie_node now consisists of a much larger region including the pos, bits, and rcu portions of the tnode structure. On 32b systems this will result in the leaf being 4B larger as the pos and bits values were added to a hole created by the key as it was only 4B in length. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:54 -05:00
Alexander Duyck	8274a97aa4	fib_trie: Update usage stats to be percpu instead of global variables The trie usage stats were currently being shared by all threads that were calling fib_table_lookup. As a result when multiple threads were performing lookups simultaneously the trie would begin to cache bounce between those threads. In order to prevent this I have updated the usage stats to use a set of percpu variables. By doing this we should be able to avoid the cache bouncing and still make use of these stats. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 18:25:53 -05:00
stephen hemminger	bec94d430f	gre: allow live address change The GRE tap device supports Ethernet over GRE, but doesn't care about the source address of the tunnel, therefore it can be changed without bring device down. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-31 14:18:28 -05:00
Pravin B Shelar	997e068ebc	openvswitch: Fix vport_send double free Today vport-send has complex error handling because it involves freeing skb and updating stats depending on return value from vport send implementation. This can be simplified by delegating responsibility of freeing skb to the vport implementation for all cases. So that vport-send needs just update stats. Fixes: `91b7514cdf` ("openvswitch: Unify vport error stats handling") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-23 23:57:31 -05:00
leroy christophe	7b5bca4676	netfilter: nf_tables: fix port natting in little endian archs Make sure this fetches 16-bits port data from the register. Remove casting to make sparse happy, not needed anymore. Signed-off-by: leroy christophe <christophe.leroy@c-s.fr> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-12-23 15:34:28 +01:00
Jesse Gross	12069401d8	geneve: Fix races between socket add and release. Currently, searching for a socket to add a reference to is not synchronized with deletion of sockets. This can result in use after free if there is another operation that is removing a socket at the same time. Solving this requires both holding the appropriate lock and checking the refcount to ensure that it has not already hit zero. Inspired by a related (but not exactly the same) issue in the VXLAN driver. Fixes: `0b5e8b8e` ("net: Add Geneve tunneling protocol driver") CC: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-18 12:38:13 -05:00
Jesse Gross	7ed767f731	geneve: Remove socket and offload handlers at destruction. Sockets aren't currently removed from the the global list when they are destroyed. In addition, offload handlers need to be cleaned up as well. Fixes: `0b5e8b8e` ("net: Add Geneve tunneling protocol driver") CC: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-18 12:38:13 -05:00
Thomas Graf	f1fb521f7d	ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup() The encap->type comes straight from Netlink. Validate it against max supported encap types just like ip_encap_hlen() already does. Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-16 15:20:41 -05:00
Thomas Graf	bb1553c800	ip_tunnel: Add sanity checks to ip_tunnel_encap_add_ops() The symbols are exported and could be used by external modules. Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-16 15:20:41 -05:00
Timo Teräs	8a0033a947	gre: fix the inner mac header in nbma tunnel xmit path The NBMA GRE tunnels temporarily push GRE header that contain the per-packet NBMA destination on the skb via header ops early in xmit path. It is the later pulled before the real GRE header is constructed. The inner mac was thus set differently in nbma case: the GRE header has been pushed by neighbor layer, and mac header points to beginning of the temporary gre header (set by dev_queue_xmit). Now that the offloads expect mac header to point to the gre payload, fix the xmit patch to: - pull first the temporary gre header away - and reset mac header to point to gre payload This fixes tso to work again with nbma tunnels. Fixes: `14051f0452` ("gre: Use inner mac length when computing tunnel length") Signed-off-by: Timo Teräs <timo.teras@iki.fi> Cc: Tom Herbert <therbert@google.com> Cc: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-15 11:46:04 -05:00
Alexander Duyck	e962f30297	fib_trie: Fix trie balancing issue if new node pushes down existing node This patch addresses an issue with the level compression of the fib_trie. Specifically in the case of adding a new leaf that triggers a new node to be added that takes the place of the old node. The result is a trie where the 1 child tnode is on one side and one leaf is on the other which gives you a very deep trie. Below is the script I used to generate a trie on dummy0 with a 10.X.X.X family of addresses. ip link add type dummy ipval=184549374 bit=2 for i in `seq 1 23` do ifconfig dummy0:$bit $ipval/8 ipval=`expr $ipval - $bit` bit=`expr $bit \* 2` done cat /proc/net/fib_triestat Running the script before the patch: Local: Aver depth: 10.82 Max depth: 23 Leaves: 29 Prefixes: 30 Internal nodes: 27 1: 26 2: 1 Pointers: 56 Null ptrs: 1 Total size: 5 kB After applying the patch and repeating: Local: Aver depth: 4.72 Max depth: 9 Leaves: 29 Prefixes: 30 Internal nodes: 12 1: 3 2: 2 3: 7 Pointers: 70 Null ptrs: 30 Total size: 4 kB What this fix does is start the rebalance at the newly created tnode instead of at the parent tnode. This way if there is a gap between the parent and the new node it doesn't prevent the new tnode from being coalesced with any pre-existing nodes that may have been pushed into one of the new nodes child branches. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-12 10:58:53 -05:00
Linus Torvalds	70e71ca0af	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: 1) New offloading infrastructure and example 'rocker' driver for offloading of switching and routing to hardware. This work was done by a large group of dedicated individuals, not limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend, Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu 2) Start making the networking operate on IOV iterators instead of modifying iov objects in-situ during transfers. Thanks to Al Viro and Herbert Xu. 3) A set of new netlink interfaces for the TIPC stack, from Richard Alpe. 4) Remove unnecessary looping during ipv6 routing lookups, from Martin KaFai Lau. 5) Add PAUSE frame generation support to gianfar driver, from Matei Pavaluca. 6) Allow for larger reordering levels in TCP, which are easily achievable in the real world right now, from Eric Dumazet. 7) Add a variable of napi_schedule that doesn't need to disable cpu interrupts, from Eric Dumazet. 8) Use a doubly linked list to optimize neigh_parms_release(), from Nicolas Dichtel. 9) Various enhancements to the kernel BPF verifier, and allow eBPF programs to actually be attached to sockets. From Alexei Starovoitov. 10) Support TSO/LSO in sunvnet driver, from David L Stevens. 11) Allow controlling ECN usage via routing metrics, from Florian Westphal. 12) Remote checksum offload, from Tom Herbert. 13) Add split-header receive, BQL, and xmit_more support to amd-xgbe driver, from Thomas Lendacky. 14) Add MPLS support to openvswitch, from Simon Horman. 15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen Klassert. 16) Do gro flushes on a per-device basis using a timer, from Eric Dumazet. This tries to resolve the conflicting goals between the desired handling of bulk vs. RPC-like traffic. 17) Allow userspace to ask for the CPU upon what a packet was received/steered, via SO_INCOMING_CPU. From Eric Dumazet. 18) Limit GSO packets to half the current congestion window, from Eric Dumazet. 19) Add a generic helper so that all drivers set their RSS keys in a consistent way, from Eric Dumazet. 20) Add xmit_more support to enic driver, from Govindarajulu Varadarajan. 21) Add VLAN packet scheduler action, from Jiri Pirko. 22) Support configurable RSS hash functions via ethtool, from Eyal Perry. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits) Fix race condition between vxlan_sock_add and vxlan_sock_release net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header net/mlx4: Add support for A0 steering net/mlx4: Refactor QUERY_PORT net/mlx4_core: Add explicit error message when rule doesn't meet configuration net/mlx4: Add A0 hybrid steering net/mlx4: Add mlx4_bitmap zone allocator net/mlx4: Add a check if there are too many reserved QPs net/mlx4: Change QP allocation scheme net/mlx4_core: Use tasklet for user-space CQ completion events net/mlx4_core: Mask out host side virtualization features for guests net/mlx4_en: Set csum level for encapsulated packets be2net: Export tunnel offloads only when a VxLAN tunnel is created gianfar: Fix dma check map error when DMA_API_DEBUG is enabled cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call net: fec: only enable mdio interrupt before phy device link up net: fec: clear all interrupt events to support i.MX6SX net: fec: reset fep link status in suspend function net: sock: fix access via invalid file descriptor net: introduce helper macro for_each_cmsghdr ...	2014-12-11 14:27:06 -08:00
Gu Zheng	f95b414edb	net: introduce helper macro for_each_cmsghdr Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating cmsghdr from msghdr, just cleanup. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-10 22:41:55 -05:00
Linus Torvalds	b6da0076ba	Merge branch 'akpm' (patchbomb from Andrew) Merge first patchbomb from Andrew Morton: - a few minor cifs fixes - dma-debug upadtes - ocfs2 - slab - about half of MM - procfs - kernel/exit.c - panic.c tweaks - printk upates - lib/ updates - checkpatch updates - fs/binfmt updates - the drivers/rtc tree - nilfs - kmod fixes - more kernel/exit.c - various other misc tweaks and fixes * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) exit: pidns: fix/update the comments in zap_pid_ns_processes() exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting exit: exit_notify: re-use "dead" list to autoreap current exit: reparent: call forget_original_parent() under tasklist_lock exit: reparent: avoid find_new_reaper() if no children exit: reparent: introduce find_alive_thread() exit: reparent: introduce find_child_reaper() exit: reparent: document the ->has_child_subreaper checks exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting exit: proc: don't try to flush /proc/tgid/task/tgid exit: release_task: fix the comment about group leader accounting exit: wait: drop tasklist_lock before psig->c* accounting exit: wait: don't use zombie->real_parent exit: wait: cleanup the ptrace_reparented() checks usermodehelper: kill the kmod_thread_locker logic usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper() fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ...	2014-12-10 18:34:42 -08:00
Johannes Weiner	3e32cb2e0a	mm: memcontrol: lockless page counters Memory is internally accounted in bytes, using spinlock-protected 64-bit counters, even though the smallest accounting delta is a page. The counter interface is also convoluted and does too many things. Introduce a new lockless word-sized page counter API, then change all memory accounting over to it. The translation from and to bytes then only happens when interfacing with userspace. The removed locking overhead is noticable when scaling beyond the per-cpu charge caches - on a 4-socket machine with 144-threads, the following test shows the performance differences of 288 memcgs concurrently running a page fault benchmark: vanilla: 18631648.500498 task-clock (msec) # 140.643 CPUs utilized ( +- 0.33% ) 1,380,638 context-switches # 0.074 K/sec ( +- 0.75% ) 24,390 cpu-migrations # 0.001 K/sec ( +- 8.44% ) 1,843,305,768 page-faults # 0.099 M/sec ( +- 0.00% ) 50,134,994,088,218 cycles # 2.691 GHz ( +- 0.33% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 8,049,712,224,651 instructions # 0.16 insns per cycle ( +- 0.04% ) 1,586,970,584,979 branches # 85.176 M/sec ( +- 0.05% ) 1,724,989,949 branch-misses # 0.11% of all branches ( +- 0.48% ) 132.474343877 seconds time elapsed ( +- 0.21% ) lockless: 12195979.037525 task-clock (msec) # 133.480 CPUs utilized ( +- 0.18% ) 832,850 context-switches # 0.068 K/sec ( +- 0.54% ) 15,624 cpu-migrations # 0.001 K/sec ( +- 10.17% ) 1,843,304,774 page-faults # 0.151 M/sec ( +- 0.00% ) 32,811,216,801,141 cycles # 2.690 GHz ( +- 0.18% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 9,999,265,091,727 instructions # 0.30 insns per cycle ( +- 0.10% ) 2,076,759,325,203 branches # 170.282 M/sec ( +- 0.12% ) 1,656,917,214 branch-misses # 0.08% of all branches ( +- 0.55% ) 91.369330729 seconds time elapsed ( +- 0.45% ) On top of improved scalability, this also gets rid of the icky long long types in the very heart of memcg, which is great for 32 bit and also makes the code a lot more readable. Notable differences between the old and new API: - res_counter_charge() and res_counter_charge_nofail() become page_counter_try_charge() and page_counter_charge() resp. to match the more common kernel naming scheme of try_do()/do() - res_counter_uncharge_until() is only ever used to cancel a local counter and never to uncharge bigger segments of a hierarchy, so it's replaced by the simpler page_counter_cancel() - res_counter_set_limit() is replaced by page_counter_limit(), which expects its callers to serialize against themselves - res_counter_memparse_write_strategy() is replaced by page_counter_limit(), which rounds down to the nearest page size - rather than up. This is more reasonable for explicitely requested hard upper limits. - to keep charging light-weight, page_counter_try_charge() charges speculatively, only to roll back if the result exceeds the limit. Because of this, a failing bigger charge can temporarily lock out smaller charges that would otherwise succeed. The error is bounded to the difference between the smallest and the biggest possible charge size, so for memcg, this means that a failing THP charge can send base page charges into reclaim upto 2MB (4MB) before the limit would have been reached. This should be acceptable. [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse] [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-12-10 17:41:04 -08:00
Linus Torvalds	cbfe0de303	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS changes from Al Viro: "First pile out of several (there _definitely_ will be more). Stuff in this one: - unification of d_splice_alias()/d_materialize_unique() - iov_iter rewrite - killing a bunch of ->f_path.dentry users (and f_dentry macro). Getting that completed will make life much simpler for unionmount/overlayfs, since then we'll be able to limit the places sensitive to file _dentry_ to reasonably few. Which allows to have file_inode(file) pointing to inode in a covered layer, with dentry pointing to (negative) dentry in union one. Still not complete, but much closer now. - crapectomy in lustre (dead code removal, mostly) - "let's make seq_printf return nothing" preparations - assorted cleanups and fixes There _definitely_ will be more piles" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) copy_from_iter_nocache() new helper: iov_iter_kvec() csum_and_copy_..._iter() iov_iter.c: handle ITER_KVEC directly iov_iter.c: convert copy_to_iter() to iterate_and_advance iov_iter.c: convert copy_from_iter() to iterate_and_advance iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() iov_iter.c: convert iov_iter_zero() to iterate_and_advance iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds iov_iter.c: convert iov_iter_npages() to iterate_all_kinds iov_iter.c: iterate_and_advance iov_iter.c: macros for iterating over iov_iter kill f_dentry macro dcache: fix kmemcheck warning in switch_names new helper: audit_file() nfsd_vfs_write(): use file_inode() ncpfs: use file_inode() kill f_dentry uses lockd: get rid of ->f_path.dentry->d_sb ...	2014-12-10 16:10:49 -08:00
David S. Miller	22f10923dd	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/amd/xgbe/xgbe-desc.c drivers/net/ethernet/renesas/sh_eth.c Overlapping changes in both conflict cases. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-10 15:48:20 -05:00
David S. Miller	6e5f59aacb	Merge branch 'for-davem-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More iov_iter work for the networking from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-10 13:17:23 -05:00
Eric Dumazet	0f85feae6b	tcp: fix more NULL deref after prequeue changes When I cooked commit `c3658e8d0f` ("tcp: fix possible NULL dereference in tcp_vX_send_reset()") I missed other spots we could deref a NULL skb_dst(skb) Again, if a socket is provided, we do not need skb_dst() to get a pointer to network namespace : sock_net(sk) is good enough. Reported-by: Dann Frazier <dann.frazier@canonical.com> Bisected-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `ca777eff51` ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-09 21:38:44 -05:00
Eric Dumazet	605ad7f184	tcp: refine TSO autosizing Commit `95bd09eb27` ("tcp: TSO packets automatic sizing") tried to control TSO size, but did this at the wrong place (sendmsg() time) At sendmsg() time, we might have a pessimistic view of flow rate, and we end up building very small skbs (with 2 MSS per skb). This is bad because : - It sends small TSO packets even in Slow Start where rate quickly increases. - It tends to make socket write queue very big, increasing tcp_ack() processing time, but also increasing memory needs, not necessarily accounted for, as fast clones overhead is currently ignored. - Lower GRO efficiency and more ACK packets. Servers with a lot of small lived connections suffer from this. Lets instead fill skbs as much as possible (64KB of payload), but split them at xmit time, when we have a precise idea of the flow rate. skb split is actually quite efficient. Patch looks bigger than necessary, because TCP Small Queue decision now has to take place after the eventual split. As Neal suggested, introduce a new tcp_tso_autosize() helper, so that tcp_tso_should_defer() can be synchronized on same goal. Rename tp->xmit_size_goal_segs to tp->gso_segs, as this variable contains number of mss that we can put in GSO packet, and is not related to the autosizing goal anymore. Tested: 40 ms rtt link nstat >/dev/null netperf -H remote -l -2000000 -- -s 1000000 nstat \| egrep "IpInReceives\|IpOutRequests\|TcpOutSegs\|IpExtOutOctets" Before patch : Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/s 87380 2000000 2000000 0.36 44.22 IpInReceives 600 0.0 IpOutRequests 599 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2033249 0.0 After patch : Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 2000000 2000000 0.36 44.27 IpInReceives 221 0.0 IpOutRequests 232 0.0 TcpOutSegs 1397 0.0 IpExtOutOctets 2013953 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-09 16:39:22 -05:00
Al Viro	c0371da604	put iov_iter into msghdr Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:29:03 -05:00
Al Viro	f4362a2c95	switch tcp_sock->ucopy from iovec (ucopy.iov) to msghdr (ucopy.msg) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:28:22 -05:00
Al Viro	f69e6d131f	ip_generic_getfrag, udplite_getfrag: switch to passing msghdr Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:28:22 -05:00
Al Viro	b61e9dcc5e	raw.c: stick msghdr into raw_frag_vec we'll want access to ->msg_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-12-09 16:28:21 -05:00
Eric Dumazet	42eef7a0bb	tcp_cubic: refine Hystart delay threshold In commit `2b4636a5f8` ("tcp_cubic: make the delay threshold of HyStart less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms. The remaining problem is that using delay_min + (delay_min/16) as the threshold is too sensitive. 6.25 % of variation is too small for rtt above 60 ms, which are not uncommon. Lets use 12.5 % instead (delay_min + (delay_min/8)) Tested: 80 ms RTT between peers, FQ/pacing packet scheduler on sender. 10 bulk transfers of 10 seconds : nstat >/dev/null for i in `seq 1 10` do netperf -H remote -- -k THROUGHPUT \| grep THROUGHPUT done nstat \| grep Hystart With the 6.25 % threshold : THROUGHPUT=20.66 THROUGHPUT=249.38 THROUGHPUT=254.10 THROUGHPUT=14.94 THROUGHPUT=251.92 THROUGHPUT=237.73 THROUGHPUT=19.18 THROUGHPUT=252.89 THROUGHPUT=21.32 THROUGHPUT=15.58 TcpExtTCPHystartTrainDetect 2 0.0 TcpExtTCPHystartTrainCwnd 4756 0.0 TcpExtTCPHystartDelayDetect 5 0.0 TcpExtTCPHystartDelayCwnd 180 0.0 With the 12.5 % threshold THROUGHPUT=251.09 THROUGHPUT=247.46 THROUGHPUT=250.92 THROUGHPUT=248.91 THROUGHPUT=250.88 THROUGHPUT=249.84 THROUGHPUT=250.51 THROUGHPUT=254.15 THROUGHPUT=250.62 THROUGHPUT=250.89 TcpExtTCPHystartTrainDetect 1 0.0 TcpExtTCPHystartTrainCwnd 3175 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-09 14:58:23 -05:00
Eric Dumazet	6e3a8a937c	tcp_cubic: add SNMP counters to track how effective is Hystart When deploying FQ pacing, one thing we noticed is that CUBIC Hystart triggers too soon. Having SNMP counters to have an idea of how often the various Hystart methods trigger is useful prior to any modifications. This patch adds SNMP counters tracking, how many time "ack train" or "Delay" based Hystart triggers, and cumulative sum of cwnd at the time Hystart decided to end SS (Slow Start) myhost:~# nstat -a \| grep Hystart TcpExtTCPHystartTrainDetect 9 0.0 TcpExtTCPHystartTrainCwnd 20650 0.0 TcpExtTCPHystartDelayDetect 10 0.0 TcpExtTCPHystartDelayCwnd 360 0.0 -> Train detection was triggered 9 times, and average cwnd was 20650/9=2294, Delay detection was triggered 10 times and average cwnd was 36 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-09 14:58:23 -05:00
Al Viro	ba00410b81	Merge branch 'iov_iter' into for-next	2014-12-08 20:39:29 -05:00
Joe Perches	60c04aecd8	udp: Neaten and reduce size of compute_score functions The compute_score functions are a bit difficult to read. Neaten them a bit to reduce object sizes and make them a bit more intelligible. Return early to avoid indentation and avoid unnecessary initializations. (allyesconfig, but w/ -O2 and no profiling) $ size net/ipv[46]/udp.o.* text data bss dec hex filename 28680 1184 25 29889 74c1 net/ipv4/udp.o.new 28756 1184 25 29965 750d net/ipv4/udp.o.old 17600 1010 2 18612 48b4 net/ipv6/udp.o.new 17632 1010 2 18644 48d4 net/ipv6/udp.o.old Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-08 20:28:47 -05:00
Willem de Bruijn	829ae9d611	net-timestamp: allow reading recv cmsg on errqueue with origin tstamp Allow reading of timestamps and cmsg at the same time on all relevant socket families. One use is to correlate timestamps with egress device, by asking for cmsg IP_PKTINFO. on AF_INET sockets, call the relevant function (ip_cmsg_recv). To avoid changing legacy expectations, only do so if the caller sets a new timestamping flag SOF_TIMESTAMPING_OPT_CMSG. on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already returned for all origins. only change is to set ifindex, which is not initialized for all error origins. In both cases, only generate the pktinfo message if an ifindex is known. This is not the case for ACK timestamps. The difference between the protocol families is probably a historical accident as a result of the different conditions for generating cmsg in the relevant ip(v6)_recv_error function: ipv4: if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) { ipv6: if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) { At one time, this was the same test bar for the ICMP/ICMP6 distinction. This is no longer true. Signed-off-by: Willem de Bruijn <willemb@google.com> ---- Changes v1 -> v2 large rewrite - integrate with existing pktinfo cmsg generation code - on ipv4: only send with new flag, to maintain legacy behavior - on ipv6: send at most a single pktinfo cmsg - on ipv6: initialize fields if not yet initialized The recv cmsg interfaces are also relevant to the discussion of whether looping packet headers is problematic. For v6, cmsgs that identify many headers are already returned. This patch expands that to v4. If it sounds reasonable, I will follow with patches 1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY (http://patchwork.ozlabs.org/patch/366967/) 2. sysctl to conditionally drop all timestamps that have payload or cmsg from users without CAP_NET_RAW. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-08 20:20:48 -05:00
Willem de Bruijn	7ce875e5ec	ipv4: warn once on passing AF_INET6 socket to ip_recv_error One line change, in response to catching an occurrence of this bug. See also fix `f4713a3dfa` ("net-timestamp: make tcp_recvmsg call ...") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-08 20:20:48 -05:00
Tom Herbert	6fb2a75673	gre: Set inner mac header in gro complete Set the inner mac header to point to the GRE payload when doing GRO. This is needed if we proceed to send the packet through GRE GSO which now uses the inner mac header instead of inner network header to determine the length of encapsulation headers. Fixes: `14051f0452` ("gre: Use inner mac length when computing tunnel length") Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-05 21:18:34 -08:00
David S. Miller	244ebd9f8f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains netfilter updates for net-next. Basically, enhancements for xt_recent, skip zeroing of timer in conntrack, fix linking problem with recent redirect support for nf_tables, ipset updates and a couple of cleanups. More specifically, they are: 1) Rise maximum number per IP address to be remembered in xt_recent while retaining backward compatibility, from Florian Westphal. 2) Skip zeroing timer area in nf_conn objects, also from Florian. 3) Inspect IPv4 and IPv6 traffic from the bridge to allow filtering using using meta l4proto and transport layer header, from Alvaro Neira. 4) Fix linking problems in the new redirect support when CONFIG_IPV6=n and IP6_NF_IPTABLES=n. And ipset updates from Jozsef Kadlecsik: 5) Support updating element extensions when the set is full (fixes netfilter bugzilla id 880). 6) Fix set match with 32-bits userspace / 64-bits kernel. 7) Indicate explicitly when /0 networks are supported in ipset. 8) Simplify cidr handling for hash:net types. 9) Allocate the proper size of memory when /0 networks are supported. 10) Explicitly add padding elements to hash:net,net and hash:net,port, because the elements must be u32 sized for the used hash function. Jozsef is also cooking ipset RCU conversion which should land soon if they reach the merge window in time. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2014-12-05 20:56:46 -08:00
David S. Miller	60b7379dc5	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-11-29 20:47:48 -08:00
Pablo Neira Ayuso	b59eaf9e28	netfilter: combine IPv4 and IPv6 nf_nat_redirect code in one module This resolves linking problems with CONFIG_IPV6=n: net/built-in.o: In function `redirect_tg6': xt_REDIRECT.c:(.text+0x6d021): undefined reference to `nf_nat_redirect_ipv6' Reported-by: Andreas Ruprecht <rupran@einserver.de> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-11-27 13:08:42 +01:00
Willem de Bruijn	f4713a3dfa	net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets. If the socket is of family AF_INET6, call ipv6_recv_error instead of ip_recv_error. This change is more complex than a single branch due to the loadable ipv6 module. It reuses a pre-existing indirect function call from ping. The ping code is safe to call, because it is part of the core ipv6 module and always present when AF_INET6 sockets are active. Fixes: `4ed2d765` (net-timestamp: TCP timestamping) Signed-off-by: Willem de Bruijn <willemb@google.com> ---- It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6) to ip_recv_error. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-26 15:45:04 -05:00
Tom Herbert	4fd671ded1	gue: Call remcsum_adjust Change remote checksum offload to call remcsum_adjust. This also eliminates the optimization to skip an IP header as part of the adjustment (really does not seem to be much of a win). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-26 12:25:44 -05:00
David S. Miller	d3fc6b3fdd	Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More work from Al Viro to move away from modifying iovecs by using iov_iter instead. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-25 20:02:51 -05:00
Eric Dumazet	c3658e8d0f	tcp: fix possible NULL dereference in tcp_vX_send_reset() After commit `ca777eff51` ("tcp: remove dst refcount false sharing for prequeue mode") we have to relax check against skb dst in tcp_v[46]_send_reset() if prequeue dropped the dst. If a socket is provided, a full lookup was done to find this socket, so the dst test can be skipped. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88191 Reported-by: Jaša Bartelj <jasa.bartelj@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Daniel Borkmann <dborkman@redhat.com> Fixes: `ca777eff51` ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-25 14:29:18 -05:00
Jane Zhou	91a0b60346	net/ping: handle protocol mismatching scenario ping_lookup() may return a wrong sock if sk_buff's and sock's protocols dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong sock will be returned. the fix is to "continue" the searching, if no matching, return NULL. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Jane Zhou <a17711@motorola.com> Signed-off-by: Yiwei Zhao <gbjc64@motorola.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-24 16:48:20 -05:00
David S. Miller	958d03b016	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next Pablo Neira Ayuso says: ==================== netfilter/ipvs updates for net-next The following patchset contains Netfilter updates for your net-next tree, this includes the NAT redirection support for nf_tables, the cgroup support for nft meta and conntrack zone support for the connlimit match. Coming after those, a bunch of sparse warning fixes, missing netns bits and cleanups. More specifically, they are: 1) Prepare IPv4 and IPv6 NAT redirect code to use it from nf_tables, patches from Arturo Borrero. 2) Introduce the nf_tables redir expression, from Arturo Borrero. 3) Remove an unnecessary assignment in ip_vs_xmit/__ip_vs_get_out_rt(). Patch from Alex Gartrell. 4) Add nft_log_dereference() macro to the nf_log infrastructure, patch from Marcelo Leitner. 5) Add some extra validation when registering logger families, also from Marcelo. 6) Some spelling cleanups from stephen hemminger. 7) Fix sparse warning in nf_logger_find_get(). 8) Add cgroup support to nf_tables meta, patch from Ana Rey. 9) A Kconfig fix for the new redir expression and fix sparse warnings in the new redir expression. 10) Fix several sparse warnings in the netfilter tree, from Florian Westphal. 11) Reduce verbosity when OOM in nfnetlink_log. User can basically do nothing when this situation occurs. 12) Add conntrack zone support to xt_connlimit, again from Florian. 13) Add netnamespace support to the h323 conntrack helper, contributed by Vasily Averin. 14) Remove unnecessary nul-pointer checks before free_percpu() and module_put(), from Markus Elfring. 15) Use pr_fmt in nfnetlink_log, again patch from Marcelo Leitner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-24 16:00:58 -05:00
Al Viro	7eab8d9e8a	new helper: memcpy_to_msg() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 04:28:51 -05:00
Al Viro	6ce8e9ce59	new helper: memcpy_from_msg() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 04:28:48 -05:00
Al Viro	227158db16	new helper: skb_copy_and_csum_datagram_msg() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-11-24 04:28:44 -05:00
lucien	20ea60ca99	ip_tunnel: the lack of vti_link_ops' dellink() cause kernel panic Now the vti_link_ops do not point the .dellink, for fb tunnel device (ip_vti0), the net_device will be removed as the default .dellink is unregister_netdevice_queue,but the tunnel still in the tunnel list, then if we add a new vti tunnel, in ip_tunnel_find(): hlist_for_each_entry_rcu(t, head, hash_node) { if (local == t->parms.iph.saddr && remote == t->parms.iph.daddr && link == t->parms.link && ==> type == t->dev->type && ip_tunnel_key_match(&t->parms, flags, key)) break; } the panic will happen, cause dev of ip_tunnel *t is null: [ 3835.072977] IP: [<ffffffffa04103fd>] ip_tunnel_find+0x9d/0xc0 [ip_tunnel] [ 3835.073008] PGD b2c21067 PUD b7277067 PMD 0 [ 3835.073008] Oops: 0000 [#1] SMP ..... [ 3835.073008] Stack: [ 3835.073008] ffff8800b72d77f0 ffffffffa0411924 ffff8800bb956000 ffff8800b72d78e0 [ 3835.073008] ffff8800b72d78a0 0000000000000000 ffffffffa040d100 ffff8800b72d7858 [ 3835.073008] ffffffffa040b2e3 0000000000000000 0000000000000000 0000000000000000 [ 3835.073008] Call Trace: [ 3835.073008] [<ffffffffa0411924>] ip_tunnel_newlink+0x64/0x160 [ip_tunnel] [ 3835.073008] [<ffffffffa040b2e3>] vti_newlink+0x43/0x70 [ip_vti] [ 3835.073008] [<ffffffff8150d4da>] rtnl_newlink+0x4fa/0x5f0 [ 3835.073008] [<ffffffff812f68bb>] ? nla_strlcpy+0x5b/0x70 [ 3835.073008] [<ffffffff81508fb0>] ? rtnl_link_ops_get+0x40/0x60 [ 3835.073008] [<ffffffff8150d11f>] ? rtnl_newlink+0x13f/0x5f0 [ 3835.073008] [<ffffffff81509cf4>] rtnetlink_rcv_msg+0xa4/0x270 [ 3835.073008] [<ffffffff8126adf5>] ? sock_has_perm+0x75/0x90 [ 3835.073008] [<ffffffff81509c50>] ? rtnetlink_rcv+0x30/0x30 [ 3835.073008] [<ffffffff81529e39>] netlink_rcv_skb+0xa9/0xc0 [ 3835.073008] [<ffffffff81509c48>] rtnetlink_rcv+0x28/0x30 .... modprobe ip_vti ip link del ip_vti0 type vti ip link add ip_vti0 type vti rmmod ip_vti do that one or more times, kernel will panic. fix it by assigning ip_tunnel_dellink to vti_link_ops' dellink, in which we skip the unregister of fb tunnel device. do the same on ip6_vti. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-23 21:11:17 -05:00
David S. Miller	1459143386	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ieee802154/fakehard.c A bug fix went into 'net' for ieee802154/fakehard.c, which is removed in 'net-next'. Add build fix into the merge from Stephen Rothwell in openvswitch, the logging macros take a new initial 'log' argument, a new call was added in 'net' so when we merge that in here we have to explicitly add the new 'log' arg to it else the build fails. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-21 22:28:24 -05:00
Calvin Owens	0c228e833c	tcp: Restore RFC5961-compliant behavior for SYN packets Commit `c3ae62af8e` ("tcp: should drop incoming frames without ACK flag set") was created to mitigate a security vulnerability in which a local attacker is able to inject data into locally-opened sockets by using TCP protocol statistics in procfs to quickly find the correct sequence number. This broke the RFC5961 requirement to send a challenge ACK in response to spurious RST packets, which was subsequently fixed by commit `7b514a886b` ("tcp: accept RST without ACK flag"). Unfortunately, the RFC5961 requirement that spurious SYN packets be handled in a similar manner remains broken. RFC5961 section 4 states that: ... the handling of the SYN in the synchronized state SHOULD be performed as follows: 1) If the SYN bit is set, irrespective of the sequence number, TCP MUST send an ACK (also referred to as challenge ACK) to the remote peer: <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK> After sending the acknowledgment, TCP MUST drop the unacceptable segment and stop processing further. By sending an ACK, the remote peer is challenged to confirm the loss of the previous connection and the request to start a new connection. A legitimate peer, after restart, would not have a TCB in the synchronized state. Thus, when the ACK arrives, the peer should send a RST segment back with the sequence number derived from the ACK field that caused the RST. This RST will confirm that the remote peer has indeed closed the previous connection. Upon receipt of a valid RST, the local TCP endpoint MUST terminate its connection. The local TCP endpoint should then rely on SYN retransmission from the remote end to re-establish the connection. This patch lets SYN packets through the discard added in `c3ae62af8e`, so that spurious SYN packets are properly dealt with as per the RFC. The challenge ACK is sent unconditionally and is rate-limited, so the original vulnerability is not reintroduced by this patch. Signed-off-by: Calvin Owens <calvinowens@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-21 15:33:50 -05:00
Jiri Pirko	5968250c86	vlan: introduce *vlan_hwaccel_push_inside helpers Use them to push skb->vlan_tci into the payload and avoid code duplication. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-21 14:20:17 -05:00
Jiri Pirko	62749e2cb3	vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto Name fits better. Plus there's going to be introduced __vlan_insert_tag later on. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-21 14:20:17 -05:00
Eric Dumazet	355a901e6c	tcp: make connect() mem charging friendly While working on sk_forward_alloc problems reported by Denys Fedoryshchenko, we found that tcp connect() (and fastopen) do not call sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so sk_forward_alloc is negative while connect is in progress. We can fix this by calling regular sk_stream_alloc_skb() both for the SYN packet (in tcp_connect()) and the syn_data packet in tcp_send_syn_data() Then, tcp_send_syn_data() can avoid copying syn_data as we simply can manipulate syn_data->cb[] to remove SYN flag (and increment seq) Instead of open coding memcpy_fromiovecend(), simply use this helper. This leaves in socket write queue clean fast clone skbs. This was tested against our fastopen packetdrill tests. Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-19 14:57:01 -05:00
Rick Jones	e3e3217029	icmp: Remove some spurious dropped packet profile hits from the ICMP path If icmp_rcv() has successfully processed the incoming ICMP datagram, we should use consume_skb() rather than kfree_skb() because a hit on the likes of perf -e skb:kfree_skb is not called-for. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-18 15:28:28 -05:00
Daniel Borkmann	feb91a02cc	ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs It has been reported that generating an MLD listener report on devices with large MTUs (e.g. 9000) and a high number of IPv6 addresses can trigger a skb_over_panic(): skbuff: skb_over_panic: text:ffffffff80612a5d len:3776 put:20 head:ffff88046d751000 data:ffff88046d751010 tail:0xed0 end:0xec0 dev:port1 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:100! invalid opcode: 0000 [#1] SMP Modules linked in: ixgbe(O) CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 3.14.23+ #4 [...] Call Trace: <IRQ> [<ffffffff80578226>] ? skb_put+0x3a/0x3b [<ffffffff80612a5d>] ? add_grhead+0x45/0x8e [<ffffffff80612e3a>] ? add_grec+0x394/0x3d4 [<ffffffff80613222>] ? mld_ifc_timer_expire+0x195/0x20d [<ffffffff8061308d>] ? mld_dad_timer_expire+0x45/0x45 [<ffffffff80255b5d>] ? call_timer_fn.isra.29+0x12/0x68 [<ffffffff80255d16>] ? run_timer_softirq+0x163/0x182 [<ffffffff80250e6f>] ? __do_softirq+0xe0/0x21d [<ffffffff8025112b>] ? irq_exit+0x4e/0xd3 [<ffffffff802214bb>] ? smp_apic_timer_interrupt+0x3b/0x46 [<ffffffff8063f10a>] ? apic_timer_interrupt+0x6a/0x70 mld_newpack() skb allocations are usually requested with dev->mtu in size, since commit `72e09ad107` ("ipv6: avoid high order allocations") we have changed the limit in order to be less likely to fail. However, in MLD/IGMP code, we have some rather ugly AVAILABLE(skb) macros, which determine if we may end up doing an skb_put() for adding another record. To avoid possible fragmentation, we check the skb's tailroom as skb->dev->mtu - skb->len, which is a wrong assumption as the actual max allocation size can be much smaller. The IGMP case doesn't have this issue as commit `57e1ab6ead` ("igmp: refine skb allocations") stores the allocation size in the cb[]. Set a reserved_tailroom to make it fit into the MTU and use skb_availroom() helper instead. This also allows to get rid of igmp_skb_size(). Reported-by: Wei Liu <lw1a2.jing@gmail.com> Fixes: `72e09ad107` ("ipv6: avoid high order allocations") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: David L Stevens <david.stevens@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-16 16:55:06 -05:00
David S. Miller	f1227c5c1b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter updates for your net tree, they are: 1) Fix missing initialization of the range structure (allocated in the stack) in nft_masq_{ipv4, ipv6}_eval, from Daniel Borkmann. 2) Make sure the data we receive from userspace contains the req_version structure, otherwise return an error incomplete on truncated input. From Dan Carpenter. 3) Fix handling og skb->sk which may cause incorrect handling of connections from a local process. Via Simon Horman, patch from Calvin Owens. 4) Fix wrong netns in nft_compat when setting target and match params structure. 5) Relax chain type validation in nft_compat that was recently included, this broke the matches that need to be run from the route chain type. Now iptables-test.py automated regression tests report success again and we avoid the only possible problematic case, which is the use of nat targets out of nat chain type. 6) Use match->table to validate the tablename, instead of the match->name. Again patch for nft_compat. 7) Restore the synchronous release of objects from the commit and abort path in nf_tables. This is causing two major problems: splats when using nft_compat, given that matches and targets may sleep and call_rcu is invoked from softirq context. Moreover Patrick reported possible event notification reordering when rules refer to anonymous sets. 8) Fix race condition in between packets that are being confirmed by conntrack and the ctnetlink flush operation. This happens since the removal of the central spinlock. Thanks to Jesper D. Brouer to looking into this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-16 14:23:56 -05:00
Panu Matilainen	49dd18ba46	ipv4: Fix incorrect error code when adding an unreachable route Trying to add an unreachable route incorrectly returns -ESRCH if if custom FIB rules are present: [root@localhost ~]# ip route add 74.125.31.199 dev eth0 via 1.2.3.4 RTNETLINK answers: Network is unreachable [root@localhost ~]# ip rule add to 55.66.77.88 table 200 [root@localhost ~]# ip route add 74.125.31.199 dev eth0 via 1.2.3.4 RTNETLINK answers: No such process [root@localhost ~]# Commit `83886b6b63` ("[NET]: Change "not found" return value for rule lookup") changed fib_rules_lookup() to use -ESRCH as a "not found" code internally, but for user space it should be translated into -ENETUNREACH. Handle the translation centrally in ipv4-specific fib_lookup(), leaving the DECnet case alone. On a related note, commit `b7a71b51ee` ("ipv4: removed redundant conditional") removed a similar translation from ip_route_input_slow() prematurely AIUI. Fixes: `b7a71b51ee` ("ipv4: removed redundant conditional") Signed-off-by: Panu Matilainen <pmatilai@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-16 14:11:45 -05:00
David S. Miller	076ce44825	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/chelsio/cxgb4vf/sge.c drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c sge.c was overlapping two changes, one to use the new __dev_alloc_page() in net-next, and one to use s->fl_pg_order in net. ixgbe_phy.c was a set of overlapping whitespace changes. Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-14 01:01:12 -05:00
Eric Dumazet	d649a7a81f	tcp: limit GSO packets to half cwnd In DC world, GSO packets initially cooked by tcp_sendmsg() are usually big, as sk_pacing_rate is high. When network is congested, cwnd can be smaller than the GSO packets found in socket write queue. tcp_write_xmit() splits GSO packets using the available cwnd, and we end up sending a single GSO packet, consuming all available cwnd. With GRO aggregation on the receiver, we might handle a single GRO packet, sending back a single ACK. 1) This single ACK might be lost TLP or RTO are forced to attempt a retransmit. 2) This ACK releases a full cwnd, sender sends another big GSO packet, in a ping pong mode. This behavior does not fill the pipes in the best way, because of scheduling artifacts. Make sure we always have at least two GSO packets in flight. This allows us to safely increase GRO efficiency without risking spurious retransmits. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-13 15:21:44 -05:00
Thomas Graf	882288c05e	FOU: Fix no return statement warning for !CONFIG_NET_FOU_IP_TUNNELS net/ipv4/fou.c: In function ‘ip_tunnel_encap_del_fou_ops’: net/ipv4/fou.c:861:1: warning: no return statement in function returning non-void [-Wreturn-type] Fixes: `a8c5f90fb5` ("ip_tunnel: Ops registration for secondary encap (fou, gue)") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-13 14:32:00 -05:00
Florian Westphal	5676864431	netfilter: fix various sparse warnings net/bridge/br_netfilter.c:870:6: symbol 'br_netfilter_enable' was not declared. Should it be static? no; add include net/ipv4/netfilter/nft_reject_ipv4.c:22:6: symbol 'nft_reject_ipv4_eval' was not declared. Should it be static? yes net/ipv6/netfilter/nf_reject_ipv6.c:16:6: symbol 'nf_send_reset6' was not declared. Should it be static? no; add include net/ipv6/netfilter/nft_reject_ipv6.c:22:6: symbol 'nft_reject_ipv6_eval' was not declared. Should it be static? yes net/netfilter/core.c:33:32: symbol 'nf_ipv6_ops' was not declared. Should it be static? no; add include net/netfilter/xt_DSCP.c:40:57: cast truncates bits from constant value (ffffff03 becomes 3) net/netfilter/xt_DSCP.c:57:59: cast truncates bits from constant value (ffffff03 becomes 3) add __force, 3 is what we want. net/ipv4/netfilter/nf_log_arp.c:77:6: symbol 'nf_log_arp_packet' was not declared. Should it be static? yes net/ipv4/netfilter/nf_reject_ipv4.c:17:6: symbol 'nf_send_reset' was not declared. Should it be static? no; add include Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2014-11-13 12:14:42 +01:00
Tom Herbert	a8c5f90fb5	ip_tunnel: Ops registration for secondary encap (fou, gue) Instead of calling fou and gue functions directly from ip_tunnel use ops for these that were previously registered. This patch adds the logic to add and remove encapsulation operations for ip_tunnel, and modified fou (and gue) to register with ip_tunnels. This patch also addresses a circular dependency between ip_tunnel and fou that was causing link errors when CONFIG_NET_IP_TUNNEL=y and CONFIG_NET_FOU=m. References to fou an gue have been removed from ip_tunnel.c Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-12 15:01:35 -05:00
Joe Perches	4243cdc2c1	udp: Neaten function pointer calls and add braces Standardize function pointer uses. Convert calling style from: (*foo)(args...); to: foo(args...); Other miscellanea: o Add braces around loops with single ifs on multiple lines o Realign arguments around these functions o Invert logic in if to return immediately. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-12 14:51:59 -05:00
Eric Dumazet	5337b5b75c	ipv6: fix IPV6_PKTINFO with v4 mapped Use IS_ENABLED(CONFIG_IPV6), to enable this code if IPv6 is a module. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `c8e6ad0829` ("ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-11 15:32:45 -05:00
WANG Cong	d7480fd3b1	neigh: remove dynamic neigh table registration support Currently there are only three neigh tables in the whole kernel: arp table, ndisc table and decnet neigh table. What's more, we don't support registering multiple tables per family. Therefore we can just make these tables statically built-in. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2014-11-11 15:23:54 -05:00

1 2 3 4 5 ...

6541 Commits