Commit Graph

8500 Commits

Author SHA1 Message Date
Eric Dumazet
ed2dfd9009 tcp/dccp: warn user for preferred ip_local_port_range
After commit 07f4c90062 ("tcp/dccp: try to not exhaust
ip_local_port_range in connect()") it is advised to have an even number
of ports described in /proc/sys/net/ipv4/ip_local_port_range

This means start/end values should have a different parity.

Let's warn sysadmins of this, so that they can update their settings
if they want to.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27 14:35:36 -04:00
Florian Westphal
d6b915e29f ip_fragment: don't forward defragmented DF packet
We currently always send fragments without DF bit set.

Thus, given following setup:

mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
   A           R1              R2         B

Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.

However, if R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 will send an icmp error in response to a packet that was never sent,
citing mtu that the original sender never exceeded.

The other minor issue is that a refragmentation on R1 will conceal the
MTU of R2-B since refragmentation does not set DF bit on the fragments.

This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.

If the DF fragment size is larger or equal to the non-df one, we will
consider the packet a path mtu probe:
We set DF bit on the reassembled skb and also tag it with a new IPCB flag
to force refragmentation even if skb fits outdev mtu.

We will also set DF bit on each fragment in this case.

Joint work with Hannes Frederic Sowa.

Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27 13:03:31 -04:00
Eric Dumazet
095dc8e0c3 tcp: fix/cleanup inet_ehash_locks_alloc()
If tcp ehash table is constrained to a very small number of buckets
(eg boot parameter thash_entries=128), then we can crash if spinlock
array has more entries.

While we are at it, un-inline inet_ehash_locks_alloc() and make
following changes :

- Budget 2 cache lines per cpu worth of 'spinlocks'
- Try to kmalloc() the array to avoid extra TLB pressure.
  (Most servers at Google allocate 8192 bytes for this hash table)
- Get rid of various #ifdef

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26 19:48:46 -04:00
Eric Dumazet
7f1598678d ipv6: ipv6_select_ident() returns a __be32
ipv6_select_ident() returns a 32bit value in network order.

Fixes: 286c2349f6 ("ipv6: Clean up ipv6_select_ident() and ip6_fragment()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 20:27:11 -04:00
Martin KaFai Lau
d52d3997f8 ipv6: Create percpu rt6_info
After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exception',
we need to compensate the performance hit (bouncing dst->__refcnt).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:35 -04:00
Martin KaFai Lau
8d0b94afdc ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister
This patch keeps track of the DST_NOCACHE routes in a list and replaces its
dev with loopback during the iface down/unregister event.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:34 -04:00
Martin KaFai Lau
3da59bd945 ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set
This patch always creates RTF_CACHE clone with DST_NOCACHE
when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to
the fl6->daddr.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:34 -04:00
Martin KaFai Lau
b197df4f0f ipv6: Add rt6_get_cookie() function
Instead of doing the rt6->rt6i_node check whenever we need
to get the route's cookie.  Refactor it into rt6_get_cookie().
It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also
percpu rt6_info later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:34 -04:00
Martin KaFai Lau
45e4fd2668 ipv6: Only create RTF_CACHE routes after encountering pmtu exception
This patch creates a RTF_CACHE routes only after encountering a pmtu
exception.

After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6
tree, the rt->rt6i_node->fn_sernum is bumped which will fail the
ip6_dst_check() and trigger a relookup.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:33 -04:00
Martin KaFai Lau
2647a9b070 ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST
When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.

After removing the rt6i_dst and rt6i_src dependency in the last patch,
we also need to stop the caller from depending on rt6i_gateway and
RTF_ANYCAST.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:33 -04:00
Martin KaFai Lau
fd0273d793 ipv6: Remove external dependency on rt6i_dst and rt6i_src
This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address.  The dst and src can be recovered from
the calling site.

We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:32 -04:00
Martin KaFai Lau
286c2349f6 ipv6: Clean up ipv6_select_ident() and ip6_fragment()
This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.

It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-25 13:25:32 -04:00
David S. Miller
36583eb54d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cadence/macb.c
	drivers/net/phy/phy.c
	include/linux/skbuff.h
	net/ipv4/tcp.c
	net/switchdev/switchdev.c

Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.

phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.

tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.

macb.c involved the addition of two zyncq device entries.

skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23 01:22:35 -04:00
Eric Dumazet
f5af1f57a2 inet_hashinfo: remove bsocket counter
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit
2b05ad33e1 ("tcp: bind() fix autoselection to share ports")

This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Flavio Leitner <fbl@redhat.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 18:55:32 -04:00
Eric Dumazet
eb9344781a tcp: add a force_schedule argument to sk_stream_alloc_skb()
In commit 8e4d980ac2 ("tcp: fix behavior for epoll edge trigger")
we fixed a possible hang of TCP sockets under memory pressure,
by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
if no packet is in socket write queue.

It turns out there are other cases where we want to force memory
schedule :

tcp_fragment() & tso_fragment() need to split a big TSO packet into
two smaller ones. If we block here because of TCP memory pressure,
we can effectively block TCP socket from sending new data.
If no further ACK is coming, this hang would be definitive, and socket
has no chance to effectively reduce its memory usage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:56:40 -04:00
Andy Zhou
06b2c61c92 ip: remove unused function prototype
ip_do_nat() function was removed prior to kernel 3.4. Remove the
unnecessary function prototype as well.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:54:36 -04:00
Daniel Borkmann
492135557d tcp: add rfc3168, section 6.1.1.1. fallback
This work as a follow-up of commit f7b3bec6f5 ("net: allow setting ecn
via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
ECN connections. In other words, this work adds a retry with a non-ECN
setup SYN packet, as suggested from the RFC on the first timeout:

  [...] A host that receives no reply to an ECN-setup SYN within the
  normal SYN retransmission timeout interval MAY resend the SYN and
  any subsequent SYN retransmissions with CWR and ECE cleared. [...]

Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
that is, Linux default since 2009 via commit 255cac91c3 ("tcp: extend
ECN sysctl to allow server-side only ECN"):

 1) Normal ECN-capable path:

    SYN ECE CWR ----->
                <----- SYN ACK ECE
            ACK ----->

 2) Path with broken middlebox, when client has fallback:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
            SYN ----->
                <----- SYN ACK
            ACK ----->

In case we would not have the fallback implemented, the middlebox drop
point would basically end up as:

    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)
    SYN ECE CWR ----X crappy middlebox drops packet
                      (timeout, rtx)

In any case, it's rather a smaller percentage of sites where there would
occur such additional setup latency: it was found in end of 2014 that ~56%
of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
fallback would mitigate with a slight latency trade-off. Recent related
paper on this topic:

  Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
  Gorry Fairhurst, and Richard Scheffenegger:
    "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
    Proc. PAM 2015, New York.
  http://ecn.ethz.ch/ecn-pam15.pdf

Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
allows for disabling the fallback.

tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
rather we let tcp_ecn_rcv_synack() take that over on input path in case a
SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
ECN being negotiated eventually in that case.

Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
Signed-off-by: Brian Trammell <trammell@tik.ee.ethz.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Dave That <dave.taht@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 16:53:37 -04:00
Eric Dumazet
b5d721d761 inet: properly align icsk_ca_priv
tcp_illinois and upcoming tcp_cdg require 64bit alignment of
icsk_ca_priv

x86 does not care, but other architectures might.

Fixes: 05cbc0db03 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Fan Du <fan.du@intel.com>
Acked-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 11:08:00 -04:00
Andy Zhou
49d16b23cd bridge_netfilter: No ICMP packet on IPv4 fragmentation error
When bridge netfilter re-fragments an IP packet for output, all
packets that can not be re-fragmented to their original input size
should be silently discarded.

However, current bridge netfilter output path generates an ICMP packet
with 'size exceeded MTU' message for such packets, this is a bug.

This patch refactors the ip_fragment() API to allow two separate
use cases. The bridge netfilter user case will not
send ICMP, the routing output will, as before.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:39 -04:00
Andy Zhou
5cf4228082 ipv4: introduce frag_expire_skip_icmp()
Improve readability of skip ICMP for de-fragmentation expiration logic.
This change will also make the logic easier to maintain when the
following patches in this series are applied.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-19 00:15:26 -04:00
WANG Cong
de133464c9 netns: make nsid_lock per net
The spinlock is used to protect netns_ids which is per net,
so there is no need to use a global spinlock.

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 23:41:11 -04:00
Samudrala, Sridhar
45d4122ca7 switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops

v3: updated to sync with named union changes.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:49:09 -04:00
Eric Dumazet
b8da51ebb1 tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
a6c5ea4ccf tcp: rename sk_forced_wmem_schedule() to sk_forced_mem_schedule()
We plan to use sk_forced_wmem_schedule() in input path as well,
so make it non static and rename it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
1a24e04e4b net: fix sk_mem_reclaim_partial()
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.

SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
d53a2aa3a1 net: fix sparse error in csum_replace4()
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/nf_nat_l3proto_ipv4.o
  CHECK   net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64:    got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71:    expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71:    got restricted __be32 [usertype] to

Fixes: 4565af0d40 ("net: optimise csum_replace4()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 13:08:29 -04:00
Eric Dumazet
264ea103a7 tcp: syncookies: extend validity range
Now we allow storing more request socks per listener, we might
hit syncookie mode less often and hit following bug in our stack :

When we send a burst of syncookies, then exit this mode,
tcp_synq_no_recent_overflow() can return false if the ACK packets coming
from clients are coming three seconds after the end of syncookie
episode.

This is a way too strong requirement and conflicts with rest of
syncookie code which allows ACK to be aged up to 2 minutes.

Perfectly valid ACK packets are dropped just because clients might be
in a crowded wifi environment or on another planet.

So let's fix this, and also change tcp_synq_overflow() to not
dirty a cache line for every syncookie we send, as we are under attack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:32:17 -04:00
John W. Linville
35d32e8fe4 geneve: move definition of geneve_hdr() to geneve.h
This is a static inline with identical definitions in multiple places...

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:59:13 -04:00
Jiri Pirko
59346afe7a flow_dissector: change port array into src, dst tuple
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
67a900cc04 flow_dissector: introduce support for Ethernet addresses
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
b924933cbb flow_dissector: introduce support for ipv6 addressses
So far, only hashes made out of ipv6 addresses could be dissected. This
patch introduces support for dissection of full ipv6 addresses.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
c3f8eaeb6e flow_dissector: add missing header includes
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
06635a35d1 flow_dissect: use programable dissector in skb_flow_dissect and friends
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
fbff949e3b flow_dissector: introduce programable flow_dissector
Introduce dissector infrastructure which allows user to specify which
parts of skb he wants to dissect.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:47 -04:00
Jiri Pirko
9c684b5083 net: move __skb_get_hash function declaration to flow_dissector.h
Since the definition of the function is in flow_dissector.c, it makes
sense to have the declaration in flow_dissector.h

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:46 -04:00
Jiri Pirko
10b89ee43e net: move *skb_get_poff declarations into correct header
Since these functions are defined in flow_dissector.c, move header
declarations from skbuff.h into flow_dissector.h

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Jiri Pirko
b0a31431b4 flow_dissector: remove unused function flow_get_hlen declaration
commit 56193d1bce ("net: Add function for parsing the header length out
of linear ethernet frames") added this function declaration but it is
defined nowhere.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Jiri Pirko
1bd758eb1c net: change name of flow_dissector header to match the .c file name
add couple of empty lines on the way.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:45 -04:00
Florian Westphal
e578d9c025 net: sched: use counter to break reclassify loops
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.

skb_act_clone is now identical to skb_clone, so just use that.

Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
 protocol ip u32 match u32 0 0 police rate 10Kbit burst \
 64000 mtu 1500 action reclassify

Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:08:14 -04:00
David S. Miller
b04096ff33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Four minor merge conflicts:

1) qca_spi.c renamed the local variable used for the SPI device
   from spi_device to spi, meanwhile the spi_set_drvdata() call
   got moved further up in the probe function.

2) Two changes were both adding new members to codel params
   structure, and thus we had overlapping changes to the
   initializer function.

3) 'net' was making a fix to sk_release_kernel() which is
   completely removed in 'net-next'.

4) In net_namespace.c, the rtnl_net_fill() call for GET operations
   had the command value fixed, meanwhile 'net-next' adjusted the
   argument signature a bit.

This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 14:31:43 -04:00
Scott Feldman
42275bd8fc switchdev: don't use anonymous union on switchdev attr/obj structs
Older gcc versions (e.g.  gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs.  Fix this by using named union on structs.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reported-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 14:20:59 -04:00
Scott Feldman
5eb764edee switchdev: align comment with other comments in block
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 12:26:27 -04:00
Ying Xue
9449c3cd90 net: make skb_dst_pop routine static
As xfrm_output_one() is the only caller of skb_dst_pop(), we should
make skb_dst_pop() localized.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 23:19:49 -04:00
Scott Feldman
58c2cb16b1 switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/del
The IPv4 FIB ops convert nicely to the switchdev objs and we're left with
only four switchdev ops: port get/set and port add/del.  Other objs will
follow, such as FDB.  So go ahead and convert IPv4 FIB over to switchdev
obj for consistency, anticipating more objs to come.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
8793d0a664 switchdev: add new switchdev_port_bridge_getlink
Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and
call into port driver to get port attrs.  For now, only BR_LEARNING and
BR_LEARNING_SYNC are returned.  To add more, we'll probably want to break
away from ndo_dflt_bridge_getlink() and build the netlink skb directly in
the switchdev code.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
87a5dae59e switchdev: remove unused switchdev_port_bridge_dellink
Now we can remove old wrappers for dellink.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
5c34e02214 switchdev: add new switchdev_port_bridge_dellink
Same change as setlink.  Provide the wrapper op for SELF ndo_bridge_dellink
and call into the switchdev driver to delete afspec VLANs.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:55 -04:00
Scott Feldman
e71f220b34 switchdev: remove old switchdev_port_bridge_setlink
New attr-based bridge_setlink can recurse lower devs and recover on err, so
remove old wrapper (including ndo_dflt_switchdev_port_bridge_setlink).

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:54 -04:00
Scott Feldman
6004c86718 switchdev: add bridge port flags attr
rocker: use switchdev get/set attr for bridge port flags

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:54 -04:00
Scott Feldman
6fc3016da7 switchdev: add port vlan obj
VLAN obj has flags (PVID and untagged) as well as start and end vid ranges.
The switchdev driver can optimize programing the device using the ranges.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 18:43:53 -04:00