Conflicts:
Documentation/devicetree/bindings/net/micrel-ks8851.txt
net/core/netpoll.c
The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.
In micrel-ks8851.txt we simply have overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 812e44dd18 ("ip6mr: advertise new mfc entries via rtnl") reuses the
function ip6mr_fill_mroute() to notify mfc events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.
Libraries like libnl will wait forever for NLMSG_DONE.
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6_append_data_mtu(), when the xfrm mode is not tunnel(such as
transport),the ipsec header need to be added in the first fragment, so the mtu
will decrease to reserve space for it, then the second fragment come, the mtu
should be turn back, as the commit 0c1833797a
said. however, in the commit a493e60ac4bbe2e977e7129d6d8cbb0dd236be, it use
*mtu = min(*mtu, ...) to change the mtu, which lead to the new mtu is alway
equal with the first fragment's. and cannot turn back.
when I test through ping6 -c1 -s5000 $ip (mtu=1280):
...frag (0|1232) ESP(spi=0x00002000,seq=0xb), length 1232
...frag (1232|1216)
...frag (2448|1216)
...frag (3664|1216)
...frag (4880|164)
which should be:
...frag (0|1232) ESP(spi=0x00001000,seq=0x1), length 1232
...frag (1232|1232)
...frag (2464|1232)
...frag (3696|1232)
...frag (4928|116)
so delete the min() when change back the mtu.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Fixes: 75a493e60a ("ipv6: ip6_append_data_mtu did not care about pmtudisc and frag_size")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
One patch to rename a newly introduced struct. The rest is
the rework of the IPsec virtual tunnel interface for ipv6 to
support inter address family tunneling and namespace crossing.
1) Rename the newly introduced struct xfrm_filter to avoid a
conflict with iproute2. From Nicolas Dichtel.
2) Introduce xfrm_input_afinfo to access the address family
dependent tunnel callback functions properly.
3) Add and use a IPsec protocol multiplexer for ipv6.
4) Remove dst_entry caching. vti can lookup multiple different
dst entries, dependent of the configured xfrm states. Therefore
it does not make to cache a dst_entry.
5) Remove caching of flow informations. vti6 does not use the the
tunnel endpoint addresses to do route and xfrm lookups.
6) Update the vti6 to use its own receive hook.
7) Remove the now unused xfrm_tunnel_notifier. This was used from vti
and is replaced by the IPsec protocol multiplexer hooks.
8) Support inter address family tunneling for vti6.
9) Check if the tunnel endpoints of the xfrm state and the vti interface
are matching and return an error otherwise.
10) Enable namespace crossing for vti devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the bh safe variant with the hard irq safe variant.
We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.
Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The tunnel endpoints of the xfrm_state we got from the xfrm_lookup
must match the tunnel endpoints of the vti interface. This patch
ensures this matching.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With this patch we can tunnel ipv4 traffic via a vti6
interface. A vti6 interface can now have an ipv4 address
and ipv4 traffic can be routed via a vti6 interface.
The resulting traffic is xfrm transformed and tunneled
through ipv6 if matching IPsec policies and states are
present.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This was used from vti and is replaced by the IPsec protocol
multiplexer hooks. It is now unused, so remove it.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
With this patch, vti6 uses the IPsec protocol multiplexer to
register its own receive side hooks for ESP, AH and IPCOMP.
Vti6 now does the following on receive side:
1. Do an input policy check for the IPsec packet we received.
This is required because this packet could be already
prosecces by IPsec, so an inbuond policy check is needed.
2. Mark the packet with the i_key. The policy and the state
must match this key now. Policy and state belong to the vti
namespace and policy enforcement is done at the further layers.
3. Call the generic xfrm layer to do decryption and decapsulation.
4. Wait for a callback from the xfrm layer to properly clean the
skb to not leak informations on namespace transitions and
update the device statistics.
On transmit side:
1. Mark the packet with the o_key. The policy and the state
must match this key now.
2. Do a xfrm_lookup on the original packet with the mark applied.
3. Check if we got an IPsec route.
4. Clean the skb to not leak informations on namespace
transitions.
5. Attach the dst_enty we got from the xfrm_lookup to the skb.
6. Call dst_output to do the IPsec processing.
7. Do the device statistics.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Unlike ip6_tunnel, vti6 does not use the the tunnel
endpoint addresses to do route and xfrm lookups.
So no need to cache the flow informations. It also
does not make sense to calculate the mtu based on
such flow informations, so remove this too.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Unlike ip6_tunnel, vti6 can lookup multiple different dst entries,
dependent of the configured xfrm states. Therefore it does not make
sense to cache a dst_entry.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch adds an IPsec protocol multiplexer for ipv6. With
this it is possible to add alternative protocol handlers, as
needed for IPsec virtual tunnel interfaces.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
tmp_prefered_lft is an offset to ifp->tstamp, not now. Therefore
age needs to be added to the condition.
Age calculation in ipv6_create_tempaddr is different from the one
in addrconf_verify and doesn't consider ADDRCONF_TIMER_FUZZ_MINUS.
This can cause age in ipv6_create_tempaddr to be less than the one
in addrconf_verify and therefore unnecessary temporary address to
be generated.
Use age calculation as in addrconf_modify to avoid this.
Signed-off-by: Heiner Kallweit <heiner.kallweit@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets which have L2 address different from ours should be
already filtered before entering into ip6_forward().
Perform that check at the beginning to avoid processing such packets.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DST_NOCOUNT should only be used if an authorized user adds routes
locally. In case of routes which are added on behalf of router
advertisments this flag must not get used as it allows an unlimited
number of routes getting added remotely.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS
offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds).
This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS
header.
The issue detected and the fix verified by running MS HCK Offload LSO test on
top of QEMU Windows guests, as this test sends IPv6 packets with
IPPROTO_DSTOPTS.
Signed-off-by: Anton Nayshtut <anton@swortex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e688a60480 ("net: introduce DST_NOPEER dst flag") introduced
DST_NOPEER because because of crashes in ipv6_select_ident called from
udp6_ufo_fragment.
Since commit 916e4cf46d ("ipv6: reuse ip6_frag_id from
ip6_ufo_append_data") we don't call ipv6_select_ident any more from
ip6_ufo_append_data, thus this flag lost its purpose and can be removed.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following snmp stats:
TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.
TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.
Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Lawrence Brakmo <brakmo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 9195bb8e38 ("ipv6: improve
ipv6_find_hdr() to skip empty routing headers") broke ipv6_find_hdr().
When a target is specified like IPPROTO_ICMPV6 ipv6_find_hdr()
returns -ENOENT when it's found, not the header as expected.
A part of IPVS is broken and possible also nft_exthdr_eval().
When target is -1 which it is most cases, it works.
This patch exits the do while loop if the specific header is found
so the nexthdr could be returned as expected.
Reported-by: Art -kwaak- van Breemen <ard@telegraafnet.nl>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
CC:Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Build fix for ip_vti when NET_IP_TUNNEL is not set.
We need this set to have ip_tunnel_get_stats64()
available.
2) Fix a NULL pointer dereference on sub policy usage.
We try to access a xfrm_state from the wrong array.
3) Take xfrm_state_lock in xfrm_migrate_state_find(),
we need it to traverse through the state lists.
4) Clone states properly on migration, otherwise we crash
when we migrate a state with aead algorithm attached.
5) Fix unlink race when between thread context and timer
when policies are deleted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This option has the same semantic as IP_PMTUDISC_OMIT for IPv4 which
got recently introduced. It doesn't honor the path mtu discovered by the
host but in contrary to IPV6_PMTUDISC_INTERFACE allows the generation of
fragments if the packet size exceeds the MTU of the outgoing interface
MTU.
Fixes: 93b36cf342 ("ipv6: support IPV6_PMTU_INTERFACE on sockets")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These info messages are rather pointless without any means to identify
the source of the bogus packets. Logging the src and dst addresses and
ports may help a bit.
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the UFO fragmentation process does not correctly handle inner
UDP frames.
(The following tcpdumps are captured on the parent interface with ufo
disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
both sit device):
IPv6:
16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
We can see that fragmentation header offset is not correctly updated.
(fragmentation id handling is corrected by 916e4cf46d ("ipv6: reuse
ip6_frag_id from ip6_ufo_append_data")).
IPv4:
16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
In this case fragmentation id is incremented and offset is not updated.
First, I aligned inet_gso_segment and ipv6_gso_segment:
* align naming of flags
* ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
always ensure that the state of this flag is left untouched when
returning from upper gso segmenation function
* ipv6_gso_segment: move skb_reset_inner_headers below updating the
fragmentation header data, we don't care for updating fragmentation
header data
* remove currently unneeded comment indicating skb->encapsulation might
get changed by upper gso_segment callback (gre and udp-tunnel reset
encapsulation after segmentation on each fragment)
If we encounter an IPIP or SIT gso skb we now check for the protocol ==
IPPROTO_UDP and that we at least have already traversed another ip(6)
protocol header.
The reason why we have to special case GSO_IPIP and GSO_SIT is that
we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
Reported-by: Wolfgang Walter <linux@stwm.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Introduce skb_to_sgvec_nomark function to add further data to the sg list
without calling sg_unmark_end first. Needed to add extended sequence
number informations. From Fan Du.
2) Add IPsec extended sequence numbers support to the Authentication Header
protocol for ipv4 and ipv6. From Fan Du.
3) Make the IPsec flowcache namespace aware, from Fan Du.
4) Avoid creating temporary SA for every packet when no key manager is
registered. From Horia Geanta.
5) Support filtering of SA dumps to show only the SAs that match a
given filter. From Nicolas Dichtel.
6) Remove caching of xfrm_policy_sk_bundles. The cached socket policy bundles
are never used, instead we create a new cache entry whenever xfrm_lookup()
is called on a socket policy. Most protocols cache the used routes to the
socket, so this caching is not needed.
7) Fix a forgotten SADB_X_EXT_FILTER length check in pfkey, from Nicolas
Dichtel.
8) Cleanup error handling of xfrm_state_clone.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we generate a new fragmentation id on UFO segmentation. It
is pretty hairy to identify the correct net namespace and dst there.
Especially tunnels use IFF_XMIT_DST_RELEASE and thus have no skb_dst
available at all.
This causes unreliable or very predictable ipv6 fragmentation id
generation while segmentation.
Luckily we already have pregenerated the ip6_frag_id in
ip6_ufo_append_data and can use it here.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bug introduced by commit 7d442fab0a ("ipv4: Cache dst in tunnels").
Because sit code does not call ip_tunnel_init(), the dst_cache was not
initialized.
CC: Tom Herbert <therbert@google.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 469bdcefdc ip6_vti uses ip_tunnel_get_stats64(),
so we need to select NET_IP_TUNNEL to have this function available.
Fixes: 469bdcefdc ("ipv6: fix the use of pcpu_tstats in ip6_vti.c")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
* Fix nf_trace in nftables if XT_TRACE=n, from Florian Westphal.
* Don't use the fast payload operation in nf_tables if the length is
not power of 2 or it is not aligned, from Nikolay Aleksandrov.
* Fix missing break statement the inet flavour of nft_reject, which
results in evaluating IPv4 packets with the IPv6 evaluation routine,
from Patrick McHardy.
* Fix wrong kconfig symbol in nft_meta to match the routing realm,
from Paul Bolle.
* Allocate the NAT null binding when creating new conntracks via
ctnetlink to avoid that several packets race at initializing the
the conntrack NAT extension, original patch from Florian Westphal,
revisited version from me.
* Fix DNAT handling in the snmp NAT helper, the same handling was being
done for SNAT and DNAT and 2.4 already contains that fix, from
Francois-Xavier Le Bail.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_3ad.h
drivers/net/bonding/bond_main.c
Two minor conflicts in bonding, both of which were overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
These include are here since kernel 2.2.7, but probably never used.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This bug was reported by Steinar H. Gunderson and was introduced by commit
f7cb888633 ("sit/gre6: don't try to add the same route two times").
root@morgental:~# ip tunnel add foo mode gre remote 1.2.3.4 ttl 64
root@morgental:~# ip link set foo up mtu 1468
root@morgental:~# ip -6 route show dev foo
fe80::/64 proto kernel metric 256
but after the above commit, no such route shows up.
There is no link local route because dev->dev_addr is 0 (because local ipv4
address is 0), hence no link local address is configured.
In this scenario, the link local address is added manually: 'ip -6 addr add
fe80::1 dev foo' and because prefix is /128, no link local route is added by the
kernel.
Even if the right things to do is to add the link local address with a /64
prefix, we need to restore the previous behavior to avoid breaking userpace.
Reported-by: Steinar H. Gunderson <sesse@samfundet.no>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.
Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.
Move this into __nf_copy() helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner reported problems when the forwarding link path
has a lower mtu than the incoming one if the inbound interface supports GRO.
Given:
Host <mtu1500> R1 <mtu1200> R2
Host sends tcp stream which is routed via R1 and R2. R1 performs GRO.
In this case, the kernel will fail to send ICMP fragmentation needed
messages (or pkt too big for ipv6), as GSO packets currently bypass dstmtu
checks in forward path. Instead, Linux tries to send out packets exceeding
the mtu.
When locking route MTU on Host (i.e., no ipv4 DF bit set), R1 does
not fragment the packets when forwarding, and again tries to send out
packets exceeding R1-R2 link mtu.
This alters the forwarding dstmtu checks to take the individual gso
segment lengths into account.
For ipv6, we send out pkt too big error for gso if the individual
segments are too big.
For ipv4, we either send icmp fragmentation needed, or, if the DF bit
is not set, perform software segmentation and let the output path
create fragments when the packet is leaving the machine.
It is not 100% correct as the error message will contain the headers of
the GRO skb instead of the original/segmented one, but it seems to
work fine in my (limited) tests.
Eric Dumazet suggested to simply shrink mss via ->gso_size to avoid
sofware segmentation.
However it turns out that skb_segment() assumes skb nr_frags is related
to mss size so we would BUG there. I don't want to mess with it considering
Herbert and Eric disagree on what the correct behavior should be.
Hannes Frederic Sowa notes that when we would shrink gso_size
skb_segment would then also need to deal with the case where
SKB_MAX_FRAGS would be exceeded.
This uses sofware segmentation in the forward path when we hit ipv4
non-DF packets and the outgoing link mtu is too small. Its not perfect,
but given the lack of bug reports wrt. GRO fwd being broken this is a
rare case anyway. Also its not like this could not be improved later
once the dust settles.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add esn support for AH input stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet going in.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch add esn support for AH output stage by attaching upper 32bits
sequence number right after packet payload as specified by RFC 4302.
Then the ICV value will guard upper 32bits sequence number as well when
packet going out.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The patch 446fab5933 ("ipv6: enable anycast addresses
as source addresses in ICMPv6 error messages") causes an Oops when pinging a not
set up IPv6 peer on a sit tunnel.
The problem is that ipv6_anycast_destination() uses unconditionally skb_dst(skb),
which is NULL in this case.
The solution is to use instead the ipv6_chk_acast_addr_src() function.
Here are the steps to reproduce it:
modprobe sit
ip link add sit1 type sit remote 10.16.0.121 local 10.16.0.249
ip l s sit1 up
ip -6 a a dev sit1 2001🔢:123 remote 2001🔢:121
ping6 2001🔢:121
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a reject module for NFPROTO_INET. It does nothing but dispatch
to the AF-specific modules based on the hook family.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently the nft_reject module depends on symbols from ipv6. This is
wrong since no generic module should force IPv6 support to be loaded.
Split up the module into AF-specific and a generic part.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):
unreferenced object 0xffff88008cba4a40 (size 1696):
comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
hex dump (first 32 bytes):
0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00 .. j@..7..2.....
02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff810b710a>] kmem_cache_alloc+0xad/0xb9
[<ffffffff81270185>] sk_prot_alloc+0x29/0xc5
[<ffffffff812702cf>] sk_clone_lock+0x14/0x283
[<ffffffff812aaf3a>] inet_csk_clone_lock+0xf/0x7b
[<ffffffff8129a893>] netlink_broadcast+0x14/0x16
[<ffffffff812c1573>] tcp_create_openreq_child+0x1b/0x4c3
[<ffffffff812c033e>] tcp_v4_syn_recv_sock+0x38/0x25d
[<ffffffff812c13e4>] tcp_check_req+0x25c/0x3d0
[<ffffffff812bf87a>] tcp_v4_do_rcv+0x287/0x40e
[<ffffffff812a08a7>] ip_route_input_noref+0x843/0xa55
[<ffffffff812bfeca>] tcp_v4_rcv+0x4c9/0x725
[<ffffffff812a26f4>] ip_local_deliver_finish+0xe9/0x154
[<ffffffff8127a927>] __netif_receive_skb+0x4b2/0x514
[<ffffffff8127aa77>] process_backlog+0xee/0x1c5
[<ffffffff8127c949>] net_rx_action+0xa7/0x200
[<ffffffff81209d86>] add_interrupt_randomness+0x39/0x157
But there are many more, resulting in the machine going OOM after some
days.
From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():
void tcp_v4_early_demux(struct sk_buff *skb)
{
/* ... */
iph = ip_hdr(skb);
th = tcp_hdr(skb);
if (th->doff < sizeof(struct tcphdr) / 4)
return;
sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
iph->saddr, th->source,
iph->daddr, ntohs(th->dest),
skb->skb_iif);
if (sk) {
skb->sk = sk;
where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it. This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target. This then results
in the leak I see.
The very same issue seems to be with IPv6, but haven't tested.
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 25fb6ca4ed
"net IPv6 : Fix broken IPv6 routing table after loopback down-up"
allocates addrconf router for ipv6 address when lo device up.
but commit a881ae1f62
"ipv6:don't call addrconf_dst_alloc again when enable lo" breaks
this behavior.
Since the addrconf router is moved to the garbage list when
lo device down, we should release this router and rellocate
a new one for ipv6 address when lo device up.
This patch solves bug 67951 on bugzilla
https://bugzilla.kernel.org/show_bug.cgi?id=67951
change from v1:
use ip6_rt_put to repleace ip6_del_rt, thanks Hannes!
change code style, suggested by Sergei.
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows to consider an anycast address valid as source address
when given via an IPV6_PKTINFO or IPV6_2292PKTINFO ancillary data item.
So, when sending a datagram with ancillary data, the unicast and anycast
addresses are handled in the same way.
- Adds ipv6_chk_acast_addr_src() to check if an anycast address is link-local
on given interface or is global.
- Uses it in ip6_datagram_send_ctl().
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some ipv6 protocols cannot handle ipv4 addresses, so we must not allow
connecting and binding to them. sendmsg logic does already check msg->name
for this but must trust already connected sockets which could be set up
for connection to ipv4 address family.
Per-socket flag ipv6only is of no use here, as it is under users control
by setsockopt.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Uses ipv6_anycast_destination() in icmp6_send().
Suggested-by: Bill Fink <billfink@mindspring.com>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen
sock, we can delete the redundant calls of tcp_mtup_init() in
tcp_{v4,v6}_syn_recv_sock().
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_link_dev_addr sorts newly added addresses by scope in
ifp->addr_list. Smaller scope addresses are added to the tail of the
list. Use this fact to iterate in reverse over addr_list and break out
as soon as a higher scoped one showes up, so we can spare some cycles
on machines with lot's of addresses.
The ordering of the addresses is not relevant and we are more likely to
get the eui64 generated address with this change anyway.
Suggested-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently don't report IPV6_RECVPKTINFO in cmsg access ancillary data
for IPv4 datagrams on IPv6 sockets.
This patch splits the ip6_datagram_recv_ctl into two functions, one
which handles both protocol families, AF_INET and AF_INET6, while the
ip6_datagram_recv_specific_ctl only handles IPv6 cmsg data.
ip6_datagram_recv_*_ctl never reported back any errors, so we can make
them return void. Also provide a helper for protocols which don't offer dual
personality to further use ip6_datagram_recv_ctl, which is exported to
modules.
I needed to shuffle the code for ping around a bit to make it easier to
implement dual personality for ping ipv6 sockets in future.
Reported-by: Gert Doering <gert@space.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of IPV6_FL_F_REFLECT, there is no guarantee of
flow label unicity. This patch introduces a new sysctl to protect the old
behaviour, enable by default.
Changelog of V3:
* rename ip6_flowlabel_consistency to flowlabel_consistency
* use net_info_ratelimited()
* checkpatch cleanups
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This information is already available via IPV6_FLOWINFO
of IPV6_2292PKTOPTIONS, and them a filtering to get the flow label
information. But it is probably logical and easier for users to add this
here, and to control both sent/received flow label values with the
IPV6_FLOWLABEL_MGR option.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With this option, the socket will reply with the flow label value read
on received packets.
The goal is to have a connection with the same flow label in both
direction of the communication.
Changelog of V4:
* Do not erase the flow label on the listening socket. Use pktopts to
store the received value
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a follow-up patch to f3d3342602 ("net: rework recvmsg
handler msg_name and msg_namelen logic").
DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.
Signed-off-by: Steffen Hurrle <steffen@hurrle.net>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
net/ipv4/tcp_metrics.c
Overlapping changes between the "don't create two tcp metrics objects
with the same key" race fix in net and the addition of the destination
address in the lookup key in net-next.
Minor overlapping changes in bnx2x driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
The RFC 3810 defines two type of messages for multicast
listeners. The "Current State Report" message, as the name
implies, refreshes the *current* state to the querier.
Since the querier sends Query messages periodically, there
is no need to retransmit the report.
On the other hand, any change should be reported immediately
using "State Change Report" messages. Since it's an event
triggered by a change and that it can be affected by packet
loss, the rfc states it should be retransmitted [RobVar] times
to make sure routers will receive timely.
Currently, we are sending "Current State Reports" after
DAD is completed. Before that, we send messages using
unspecified address (::) which should be silently discarded
by routers.
This patch changes to send "State Change Report" messages
after DAD is completed fixing the behavior to be RFC compliant
and also to pass TAHI IPv6 testsuite.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 1ec047eb47 ("ipv6: introduce per-interface counter for
dad-completed ipv6 addresses") I build the detection of the first
operational link-local address much to complex. Additionally this code
now has a race condition.
Replace it with a much simpler variant, which just scans the address
list when duplicate address detection completes, to check if this is
the first valid link local address and send RS and MLD reports then.
Fixes: 1ec047eb47 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses")
Reported-by: Jiri Pirko <jiri@resnulli.us>
Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is following the commit b903d324be (ipv6: tcp: fix TCLASS
value in ACK messages sent from TIME_WAIT).
For the same reason than tclass, we have to store the flow label in the
inet_timewait_sock to provide consistency of flow label on the last ACK.
Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor the deletion/update of prefix routes when removing an
address. Now also consider IFA_F_NOPREFIXROUTE and if there is an address
present with this flag, to not cleanup the route. Instead, assume
that userspace is taking care of this route.
Also perform the same cleanup, when userspace changes an existing address
to add NOPREFIXROUTE (to an address that didn't have this flag). This is
done because when the address was added, a prefix route was created for it.
Since the user now wants to handle this route by himself, we cleanup this
route.
This cleanup of the route is not totally robust. There is no guarantee,
that the route we are about to delete was really the one added by the
kernel. This behavior does not change by the patch, and in practice it
should work just fine.
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding/modifying an IPv6 address, the userspace application needs
a way to suppress adding a prefix route. This is for example relevant
together with IFA_F_MANAGERTEMPADDR, where userspace creates autoconf
generated addresses, but depending on on-link, no route for the
prefix should be added.
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two places defined IPV6_TCLASS_SHIFT, so we should move it into ipv6.h,
and use this macro as possible. And define ip6_tclass helper to return
tclass
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change move anycast_src_echo_reply sysctl with other ipv6 sysctls.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bob Falken reported that after 4G packets, multicast forwarding stopped
working. This was because of a rule reference counter overflow which
freed the rule as soon as the overflow happend.
This patch solves this by adding the FIB_LOOKUP_NOREF flag to
fib_rules_lookup calls. This is safe even from non-rcu locked sections
as in this case the flag only implies not taking a reference to the rule,
which we don't need at all.
Rules only hold references to the namespace, which are guaranteed to be
available during the call of the non-rcu protected function reg_vif_xmit
because of the interface reference which itself holds a reference to
the net namespace.
Fixes: f0ad0860d0 ("ipv4: ipmr: support multiple tables")
Fixes: d1db275dd3 ("ipv6: ip6mr: support multiple tables")
Reported-by: Bob Falken <NetFestivalHaveFun@gmx.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.
Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested-by: Simon Schneider <simon-schneider@gmx.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the IPv6 forwarding path we are only concerend about the outgoing
interface MTU, but also respect locked MTUs on routes. Tunnel provider
or IPSEC already have to recheck and if needed send PtB notifications
to the sending host in case the data does not fit into the packet with
added headers (we only know the final header sizes there, while also
using path MTU information).
The reason for this change is, that path MTU information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv6 forwarding path to
wrongfully emit Packet-too-Big errors and drop IPv6 packets.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the past the IFA_PERMANENT flag indicated, that the valid and preferred
lifetime where ignored. Since change fad8da3e08 ("ipv6 addrconf: fix
preferred lifetime state-changing behavior while valid_lft is infinity")
we honour at least the preferred lifetime on those addresses. As such
the valid lifetime gets recalculated and updated to 0.
If loopback address is added manually this problem does not occur.
Also if NetworkManager manages IPv6, those addresses will get added via
inet6_rtm_newaddr and thus will have a correct lifetime, too.
Reported-by: François-Xavier Le Bail <fx.lebail@yahoo.com>
Reported-by: Damien Wyart <damien.wyart@gmail.com>
Fixes: fad8da3e08 ("ipv6 addrconf: fix preferred lifetime state-changing behavior while valid_lft is infinity")
Cc: Yasushi Asano <yasushi.asano@jp.fujitsu.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
nf_tables updates for net-next
The following patchset contains the following nf_tables updates,
mostly updates from Patrick McHardy, they are:
* Add the "inet" table and filter chain type for this new netfilter
family: NFPROTO_INET. This special table/chain allows IPv4 and IPv6
rules, this should help to simplify the burden in the administration
of dual stack firewalls. This also includes several patches to prepare
the infrastructure for this new table and a new meta extension to
match the layer 3 and 4 protocol numbers, from Patrick McHardy.
* Load both IPv4 and IPv6 conntrack modules in nft_ct if the rule is used
in NFPROTO_INET, as we don't certainly know which one would be used,
also from Patrick McHardy.
* Do not allow to delete a table that contains sets, otherwise these
sets become orphan, from Patrick McHardy.
* Hold a reference to the corresponding nf_tables family module when
creating a table of that family type, to avoid the module deletion
when in use, from Patrick McHardy.
* Update chain counters before setting the chain policy to ensure that
we don't leave the chain in inconsistent state in case of errors (aka.
restore chain atomicity). This also fixes a possible leak if it fails
to allocate the chain counters if no counters are passed to be restored,
from Patrick McHardy.
* Don't check for overflows in the table counter if we are just renaming
a chain, from Patrick McHardy.
* Replay the netlink request after dropping the nfnl lock to load the
module that supports provides a chain type, from Patrick.
* Fix chain type module references, from Patrick.
* Several cleanups, function renames, constification and code
refactorizations also from Patrick McHardy.
* Add support to set the connmark, this can be used to set it based on
the meta mark (similar feature to -j CONNMARK --restore), from
Kristian Evensen.
* A couple of fixes to the recently added meta/set support and nft_reject,
and fix missing chain type unregistration if we fail to register our
the family table/filter chain type, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In some cases we neither take a reference to the AF info nor to the
chain type, allowing the module to be unloaded while in use.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds a new table family and a new filter chain that you can
use to attach IPv4 and IPv6 rules. This should help to simplify
rule-set maintainance in dual-stack setups.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add support to register chains to multiple hooks for different address
families for mixed IPv4/IPv6 tables.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.
Make the AF-specific hook functions the default and make the optional
chain type hooks override them.
As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:
nf_tables_ipv4_init_net | -24
nft_do_chain_ipv4 | -113
2 functions changed, 137 bytes removed, diff: -137
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch built on top of Commit 299603e837
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.
The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.
Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.
The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.
Note that commit 60769a5dcd "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.
I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.
All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)
An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.
The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%
1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%
1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%
1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%
The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).
2.1 ethtool -K gre1 gro on (hence will use gro_cells)
2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%
2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%
2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%
2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%
2.2 ethtool -K gre1 gro off
2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%
2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%
2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%
2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change allows to follow a recommandation of RFC4942.
- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
as source addresses for ICMPv6 echo reply. This sysctl is false by default
to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().
Reference:
RFC4942 - IPv6 Transition/Coexistence Security Considerations
(http://tools.ietf.org/html/rfc4942#section-2.1.6)
2.1.6. Anycast Traffic Identification and Security
[...]
To avoid exposing knowledge about the internal structure of the
network, it is recommended that anycast servers now take advantage of
the ability to return responses with the anycast address as the
source address if possible.
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
net/ipv6/ip6_tunnel.c
net/ipv6/ip6_vti.c
ipv6 tunnel statistic bug fixes conflicting with consolidation into
generic sw per-cpu net stats.
qlogic conflict between queue counting bug fix and the addition
of multiple MAC address support.
Signed-off-by: David S. Miller <davem@davemloft.net>
It does not make sense to create an anycast address for an /128-prefix.
Suppress it.
As 32019e651c ("ipv6: Do not leave router anycast address for /127
prefixes.") shows we also may not leave them, because we could accidentally
remove an anycast address the user has allocated or got added via another
prefix.
Cc: François-Xavier Le Bail <fx.lebail@yahoo.com>
Cc: Thomas Haller <thaller@redhat.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says: <pablo@netfilter.org>
====================
nftables updates for net-next
The following patchset contains nftables updates for your net-next tree,
they are:
* Add set operation to the meta expression by means of the select_ops()
infrastructure, this allows us to set the packet mark among other things.
From Arturo Borrero Gonzalez.
* Fix wrong format in sscanf in nf_tables_set_alloc_name(), from Daniel
Borkmann.
* Add new queue expression to nf_tables. These comes with two previous patches
to prepare this new feature, one to add mask in nf_tables_core to
evaluate the queue verdict appropriately and another to refactor common
code with xt_NFQUEUE, from Eric Leblond.
* Do not hide nftables from Kconfig if nfnetlink is not enabled, also from
Eric Leblond.
* Add the reject expression to nf_tables, this adds the missing TCP RST
support. It comes with an initial patch to refactor common code with
xt_NFQUEUE, again from Eric Leblond.
* Remove an unused variable assignment in nf_tables_dump_set(), from Michal
Nazarewicz.
* Remove the nft_meta_target code, now that Arturo added the set operation
to the meta expression, from me.
* Add help information for nf_tables to Kconfig, also from me.
* Allow to dump all sets by specifying NFPROTO_UNSPEC, similar feature is
available to other nf_tables objects, requested by Arturo, from me.
* Expose the table usage counter, so we can know how many chains are using
this table without dumping the list of chains, from Tomasz Bursztyka.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
They are same, so unify them as one, pcpu_sw_netstats.
Define pcpu_sw_netstat in netdevice.h, remove pcpu_tstats
from if_tunnel and remove br_cpu_netstats from br_private.h
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when read/write the 64bit data, the correct lock should be hold.
and we can use the generic vti6_get_stats to return stats, and
not define a new one in ip6_vti.c
Fixes: 87b6d218f3 ("tunnel: implement 64 bits statistics")
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when read/write the 64bit data, the correct lock should be hold.
Fixes: 87b6d218f3 ("tunnel: implement 64 bits statistics")
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixed a problem with setting the lifetime of an IPv6
address. When setting preferred_lft to a value not zero or
infinity, while valid_lft is infinity(0xffffffff) preferred
lifetime is set to forever and does not update. Therefore
preferred lifetime never becomes deprecated. valid lifetime
and preferred lifetime should be set independently, even if
valid lifetime is infinity, preferred lifetime must expire
correctly (meaning it must eventually become deprecated)
Signed-off-by: Yasushi Asano <yasushi.asano@jp.fujitsu.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
since the prune parameter for fib6_clean_all always is 0, remove it.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Running 'make namespacecheck' shows:
net/ipv6/route.o
ipv6_route_table_template
rt6_bind_peer
net/ipv6/icmp.o
icmpv6_route_lookup
ipv6_icmp_table_template
This addresses some of those warnings by:
* make icmpv6_route_lookup static
* move inline's out of ip6_route.h since only used into route.c
* move rt6_bind_peer into route.c
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function __rtnl_af_register is never called outside this
code, and the return value is always 0.
Signed-off-by: David S. Miller <davem@davemloft.net>
when read/write the 64bit data, the correct lock should be hold.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>