Commit Graph

3591 Commits

Author SHA1 Message Date
Daniel Borkmann
bf84a01063 net: sock: make sock_tx_timestamp void
Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d4947353, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14 15:41:49 -04:00
Cong Wang
f88c91ddba ipv6: statically link register_inet6addr_notifier()
Tomas reported the following build error:

net/built-in.o: In function `ieee80211_unregister_hw':
(.text+0x10f0e1): undefined reference to `unregister_inet6addr_notifier'
net/built-in.o: In function `ieee80211_register_hw':
(.text+0x10f610): undefined reference to `register_inet6addr_notifier'
make: *** [vmlinux] Error 1

when built IPv6 as a module.

So we have to statically link these symbols.

Reported-by: Tomas Melin <tomas.melin@iki.fi>
Cc: Tomas Melin <tomas.melin@iki.fi>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: YOSHIFUJI Hidaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-14 15:24:17 -04:00
David S. Miller
16e3d9648a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
1)  Allow to avoid copying DSCP during encapsulation
    by setting a SA flag. From Nicolas Dichtel.

2) Constify the netlink dispatch table, no need to modify it
   at runtime. From Mathias Krause.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-11 16:14:37 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Daniel Borkmann
617fe29d45 net: ipv6: only invalidate previously tokenized addresses
Instead of invalidating all IPv6 addresses with global scope
when one decides to use IPv6 tokens, we should only invalidate
previous tokens and leave the rest intact until they expire
eventually (or are intact forever). For doing this less greedy
approach, we're adding a bool at the end of inet6_ifaddr structure
instead, for two reasons: i) per-inet6_ifaddr flag space is
already used up, making it wider might not be a good idea,
since ii) also we do not necessarily need to export this
information into user space.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-09 13:12:23 -04:00
Daniel Borkmann
fc403832f7 net: ipv6: also allow token to be set when device not ready
When we set the iftoken in inet6_set_iftoken(), we return -EINVAL
when the device does not have flag IF_READY. This is however not
necessary and rather an artificial usability barrier, since we
simply can set the token despite that, and in case the device is
ready, we just send out our rs, otherwise ifup et al. will do
this for us anyway.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-09 13:12:23 -04:00
Daniel Borkmann
914faa147b net: ipv6: minor: use in6addr_any in token init
Since we check for !ipv6_addr_any(&in6_dev->token) in
addrconf_prefix_rcv(), make the token initialization on
device setup more intuitive by using in6addr_any as an
initializer.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-09 13:12:23 -04:00
Daniel Borkmann
f53adae4ea net: ipv6: add tokenized interface identifier support
This patch adds support for IPv6 tokenized IIDs, that allow
for administrators to assign well-known host-part addresses
to nodes whilst still obtaining global network prefix from
Router Advertisements. It is currently in draft status.

  The primary target for such support is server platforms
  where addresses are usually manually configured, rather
  than using DHCPv6 or SLAAC. By using tokenised identifiers,
  hosts can still determine their network prefix by use of
  SLAAC, but more readily be automatically renumbered should
  their network prefix change. [...]

  The disadvantage with static addresses is that they are
  likely to require manual editing should the network prefix
  in use change.  If instead there were a method to only
  manually configure the static identifier part of the IPv6
  address, then the address could be automatically updated
  when a new prefix was introduced, as described in [RFC4192]
  for example.  In such cases a DNS server might be
  configured with such a tokenised interface identifier of
  ::53, and SLAAC would use the token in constructing the
  interface address, using the advertised prefix. [...]

  http://tools.ietf.org/html/draft-chown-6man-tokenised-ipv6-identifiers-02

The implementation is partially based on top of Mark K.
Thompson's proof of concept. However, it uses the Netlink
interface for configuration resp. data retrival, so that
it can be easily extended in future. Successfully tested
by myself.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08 16:55:28 -04:00
Patrick McHardy
aaa795ad25 netfilter: nat: propagate errors from xfrm_me_harder()
Propagate errors from ip_xfrm_me_harder() instead of returning EPERM in
all cases.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-08 12:34:01 +02:00
Patrick McHardy
58e35d1471 netfilter: ipv6: propagate routing errors from ip6_route_me_harder()
Propagate routing errors from ip_route_me_harder() when dropping a packet
using NF_DROP_ERR(). This makes userspace get the proper error instead of
EPERM for everything.

# ip -6 r a unreachable default table 100
# ip -6 ru add fwmark 0x1 lookup 100
# ip6tables -t mangle -A OUTPUT -d 2001:4860:4860::8888 -j MARK --set-mark 0x1

Old behaviour:

PING 2001:4860:4860::8888(2001:4860:4860::8888) 56 data bytes
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted

New behaviour:

PING 2001:4860:4860::8888(2001:4860:4860::8888) 56 data bytes
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable
ping: sendmsg: Network is unreachable

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-08 12:34:01 +02:00
David S. Miller
d978a6361a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/nfc/microread/mei.c
	net/netfilter/nfnetlink_queue_core.c

Pull in 'net' to get Eric Biederman's AF_UNIX fix, upon which
some cleanups are going to go on-top.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07 18:37:01 -04:00
Christoph Paasch
50a75a8914 ipv6/tcp: Stop processing ICMPv6 redirect messages
Tetja Rediske found that if the host receives an ICMPv6 redirect message
after sending a SYN+ACK, the connection will be reset.

He bisected it down to 093d04d (ipv6: Change skb->data before using
icmpv6_notify() to propagate redirect), but the origin of the bug comes
from ec18d9a26 (ipv6: Add redirect support to all protocol icmp error
handlers.). The bug simply did not trigger prior to 093d04d, because
skb->data did not point to the inner IP header and thus icmpv6_notify
did not call the correct err_handler.

This patch adds the missing "goto out;" in tcp_v6_err. After receiving
an ICMPv6 Redirect, we should not continue processing the ICMP in
tcp_v6_err, as this may trigger the removal of request-socks or setting
sk_err(_soft).

Reported-by: Tetja Rediske <tetja@tetja.de>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07 12:36:08 -04:00
David S. Miller
d16658206a Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter and IPVS updates for
your net-next tree, most relevantly they are:

* Add net namespace support to NFLOG, ULOG and ebt_ulog and NFQUEUE.
  The LOG and ebt_log target has been also adapted, but they still
  depend on the syslog netnamespace that seems to be missing, from
  Gao Feng.

* Don't lose indications of congestion in IPv6 fragmentation handling,
  from Hannes Frederic Sowa.i

* IPVS conversion to use RCU, including some code consolidation patches
  and optimizations, also some from Julian Anastasov.

* cpu fanout support for NFQUEUE, from Holger Eitzenberger.

* Better error reporting to userspace when dropping packets from
  all our _*_[xfrm|route]_me_harder functions, from Patrick McHardy.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-07 12:22:06 -04:00
Hannes Frederic Sowa
b8dd6a223e netfilter: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling
This change brings netfilter reassembly logic on par with
reassembly.c. The corresponding change in net-next is
(eec2e61 ipv6: implement RFC3168 5.3 (ecn protection) for
ipv6 fragmentation handling)

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-06 13:06:37 +02:00
Gao feng
30e0c6a6be netfilter: nf_log: prepare net namespace support for loggers
This patch adds netns support to nf_log and it prepares netns
support for existing loggers. It is composed of four major
changes.

1) nf_log_register has been split to two functions: nf_log_register
   and nf_log_set. The new nf_log_register is used to globally
   register the nf_logger and nf_log_set is used for enabling
   pernet support from nf_loggers.

   Per netns is not yet complete after this patch, it comes in
   separate follow up patches.

2) Add net as a parameter of nf_log_bind_pf. Per netns is not
   yet complete after this patch, it only allows to bind the
   nf_logger to the protocol family from init_net and it skips
   other cases.

3) Adapt all nf_log_packet callers to pass netns as parameter.
   After this patch, this function only works for init_net.

4) Make the sysctl net/netfilter/nf_log pernet.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-05 20:12:54 +02:00
David S. Miller
4f4ecd5f2a Merge branch 'master' of git://1984.lsi.us.es/nf
Pablo Neira Ayuso says:

====================
The following patchset contains netfilter updates for your net tree,
they are:

* Fix missing the skb->trace reset in nf_reset, noticed by Gao Feng
  while using the TRACE target with several net namespaces.

* Fix prefix translation in IPv6 NPT if non-multiple of 32 prefixes
  are used, from Matthias Schiffer.

* Fix invalid nfacct objects with empty name, they are now rejected
  with -EINVAL, spotted by Michael Zintakis, patch from myself.

* A couple of fixes for wrong return values in the error path of
  nfnetlink_queue and nf_conntrack, from Wei Yongjun.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-04 17:41:53 -04:00
Matthias Schiffer
906b1c394d netfilter: ip6t_NPT: Fix translation for non-multiple of 32 prefix lengths
The bitmask used for the prefix mangling was being calculated
incorrectly, leading to the wrong part of the address being replaced
when the prefix length wasn't a multiple of 32.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-04-03 12:24:56 +02:00
David S. Miller
d662483264 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull net into net-next to get the synchronize_net() bug fix in
bonding.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-03 01:31:54 -04:00
Balakumaran Kannan
25fb6ca4ed net IPv6 : Fix broken IPv6 routing table after loopback down-up
IPv6 Routing table becomes broken once we do ifdown, ifup of the loopback(lo)
interface. After down-up, routes of other interface's IPv6 addresses through
'lo' are lost.

IPv6 addresses assigned to all interfaces are routed through 'lo' for internal
communication. Once 'lo' is down, those routing entries are removed from routing
table. But those removed entries are not being re-created properly when 'lo' is
brought up. So IPv6 addresses of other interfaces becomes unreachable from the
same machine. Also this breaks communication with other machines because of
NDISC packet processing failure.

This patch fixes this issue by reading all interface's IPv6 addresses and adding
them to IPv6 routing table while bringing up 'lo'.

==Testing==
Before applying the patch:
$ route -A inet6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
2000::20/128                   ::                         Un   0   1     0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$

After applying the patch:
$ route -A inet6
Kernel IPv6 routing
table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
2000::20/128                   ::                         Un   0   1     0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$ sudo ifdown lo
$ sudo ifup lo
$ route -A inet6
Kernel IPv6 routing table
Destination                    Next Hop                   Flag Met Ref Use If
2000::20/128                   ::                         U    256 0     0 eth0
fe80::/64                      ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
::1/128                        ::                         Un   0   1     0 lo
2000::20/128                   ::                         Un   0   1     0 lo
fe80::xxxx:xxxx:xxxx:xxxx/128  ::                         Un   0   1     0 lo
ff00::/8                       ::                         U    256 0     0 eth0
::/0                           ::                         !n   -1  1     1 lo
$

Signed-off-by: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com>
Signed-off-by: Maruthi Thotad <Maruthi.Thotad@ap.sony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-02 14:37:19 -04:00
David S. Miller
a210576cf8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/mac80211/sta_info.c
	net/wireless/core.h

Two minor conflicts in wireless.  Overlapping additions of extern
declarations in net/wireless/core.h and a bug fix overlapping with
the addition of a boolean parameter to __ieee80211_key_free().

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-01 13:36:50 -04:00
Hannes Frederic Sowa
1c4a154e52 ipv6: don't accept node local multicast traffic from the wire
Erik Hugne's errata proposal (Errata ID: 3480) to RFC4291 has been
verified: http://www.rfc-editor.org/errata_search.php?eid=3480

We have to check for pkt_type and loopback flag because either the
packets are allowed to travel over the loopback interface (in which case
pkt_type is PACKET_HOST and IFF_LOOPBACK flag is set) or they travel
over a non-loopback interface back to us (in which case PACKET_TYPE is
PACKET_LOOPBACK and IFF_LOOPBACK flag is not set).

Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29 14:57:33 -04:00
Hong zhi guo
573ce260b3 net-next: replace obsolete NLMSG_* with type safe nlmsg_*
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-28 14:25:25 -04:00
David S. Miller
e2a553dbf1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	include/net/ipip.h

The changes made to ipip.h in 'net' were already included
in 'net-next' before that header was moved to another location.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-27 13:52:49 -04:00
YOSHIFUJI Hideaki / 吉藤英明
cb6bf35502 firewire net, ipv6: IPv6 over Firewire (RFC3146) support.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:32:13 -04:00
Pravin B Shelar
f61dd388a9 Tunneling: use IP Tunnel stats APIs.
Use common function get calculate rtnl_link_stats64 stats.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:27:19 -04:00
Pravin B Shelar
c544193214 GRE: Refactor GRE tunneling code.
Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.

ip_tunnel module contains following components:
 - packet xmit and rcv generic code. xmit flow looks like
   (gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
 - hash table of all devices.
 - lookup for tunnel devices.
 - control plane operations like device create, destroy, ioctl, netlink
   operations code.
 - registration for tunneling modules, like gre, ipip etc.
 - define single pcpu_tstats dev->tstats.
 - struct tnl_ptk_info added to pass parsed tunnel packet parameters.

ipip.h header is renamed to ip_tunnel.h

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:27:18 -04:00
Hong Zhiguo
a79ca223e0 ipv6: fix bad free of addrconf_init_net
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-25 13:55:38 -04:00
David S. Miller
da13482534 Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter/IPVS updates for
your net-next tree, they are:

* Better performance in nfnetlink_queue by avoiding copy from the
  packet to netlink message, from Eric Dumazet.

* Remove unnecessary locking in the exit path of ebt_ulog, from Gao Feng.

* Use new function ipv6_iface_scope_id in nf_ct_ipv6, from Hannes Frederic Sowa.

* A couple of sparse fixes for IPVS, from Julian Anastasov.

* Use xor hashing in nfnetlink_queue, as suggested by Eric Dumazet, from
  myself.

* Allow to dump expectations per master conntrack via ctnetlink, from myself.

* A couple of cleanups to use PTR_RET in module init path, from Silviu-Mihai
  Popescu.

* Remove nf_conntrack module a bit faster if netns are in use, from
  Vladimir Davydov.

* Use checksum_partial in ip6t_NPT, from YOSHIFUJI Hideaki.

* Sparse fix for nf_conntrack, from Stephen Hemminger.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-25 12:11:44 -04:00
Hannes Frederic Sowa
eec2e6185f ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling
Hello!

After patch 1 got accepted to net-next I will also send a patch to
netfilter-devel to make the corresponding changes to the netfilter
reassembly logic.

Thanks,

  Hannes

-- >8 --
[PATCH 2/2] ipv6: implement RFC3168 5.3 (ecn protection) for ipv6 fragmentation handling

This patch also ensures that INET_ECN_CE is propagated if one fragment
had the codepoint set.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-24 17:16:30 -04:00
Nicolas Dichtel
63998ac24f ipv6: provide addr and netconf dump consistency info
This patch adds a dev_addr_genid for IPv6. The goal is to use it, combined with
dev_base_seq to check if a change occurs during a netlink dump.
If a change is detected, the flag NLM_F_DUMP_INTR is set in the first message
after the dump was interrupted.

Note that only dump of unicast addresses is checked (multicast and anycast are
not checked).

Reported-by: Junwei Zhang <junwei.zhang@6wind.com>
Reported-by: Hongjun Li <hongjun.li@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-24 17:16:29 -04:00
Thomas Graf
661d2967b3 rtnetlink: Remove passing of attributes into rtnl_doit functions
With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-22 10:31:16 -04:00
David S. Miller
61816596d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull in the 'net' tree to get Daniel Borkmann's flow dissector
infrastructure change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 12:46:26 -04:00
Tom Parkin
44046a593e udp: add encap_destroy callback
Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.

In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation.  For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.

This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 12:10:38 -04:00
David S. Miller
90b2621fd4 Merge branch 'master' of git://1984.lsi.us.es/nf
Pablo Neira Ayuso says:

====================
The following patchset contains 7 Netfilter/IPVS fixes for 3.9-rc, they are:

* Restrict IPv6 stateless NPT targets to the mangle table. Many users are
  complaining that this target does not work in the nat table, which is the
  wrong table for it, from Florian Westphal.

* Fix possible use before initialization in the netns init path of several
  conntrack protocol trackers (introduced recently while improving conntrack
  netns support), from Gao Feng.

* Fix incorrect initialization of copy_range in nfnetlink_queue, spotted
  by Eric Dumazet during the NFWS2013, patch from myself.

* Fix wrong calculation of next SCTP chunk in IPVS, from Julian Anastasov.

* Remove rcu_read_lock section in IPVS while calling ipv4_update_pmtu
  not required anymore after change introduced in 3.7, again from Julian.

* Fix SYN looping in IPVS state sync if the backup is used a real server
  in DR/TUN modes, this required a new /proc entry to disable the director
  function when acting as backup, also from Julian.

* Remove leftover IP_NF_QUEUE Kconfig after ip_queue removal, noted by
  Paul Bolle.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 10:23:52 -04:00
Hannes Frederic Sowa
5a3da1fe95 inet: limit length of fragment queue hash table bucket lists
This patch introduces a constant limit of the fragment queue hash
table bucket list lengths. Currently the limit 128 is choosen somewhat
arbitrary and just ensures that we can fill up the fragment cache with
empty packets up to the default ip_frag_high_thresh limits. It should
just protect from list iteration eating considerable amounts of cpu.

If we reach the maximum length in one hash bucket a warning is printed.
This is implemented on the caller side of inet_frag_find to distinguish
between the different users of inet_fragment.c.

I dropped the out of memory warning in the ipv4 fragment lookup path,
because we already get a warning by the slab allocator.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-19 10:28:36 -04:00
Eric Dumazet
0d4f060861 tcp: dont handle MTU reduction on LISTEN socket
When an ICMP ICMP_FRAG_NEEDED (or ICMPV6_PKT_TOOBIG) message finds a
LISTEN socket, and this socket is currently owned by the user, we
set TCP_MTU_REDUCED_DEFERRED flag in listener tsq_flags.

This is bad because if we clone the parent before it had a chance to
clear the flag, the child inherits the tsq_flags value, and next
tcp_release_cb() on the child will decrement sk_refcnt.

Result is that we might free a live TCP socket, as reported by
Dormando.

IPv4: Attempt to release TCP socket in state 1

Fix this issue by testing sk_state against TCP_LISTEN early, so that we
set TCP_MTU_REDUCED_DEFERRED on appropriate sockets (not a LISTEN one)

This bug was introduced in commit 563d34d057
(tcp: dont drop MTU reduction indications)

Reported-by: dormando <dormando@rydia.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-18 13:31:28 -04:00
Christoph Paasch
1a2c6181c4 tcp: Remove TCPCT
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
	average (among 7 runs) of 20845.5 Requests/Second
after:
	average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17 14:35:13 -04:00
Florian Westphal
a82783c91d netfilter: ip6t_NPT: restrict to mangle table
As the translation is stateless, using it in nat table
doesn't work (only initial packet is translated).
filter table OUTPUT works but won't re-route the packet after translation.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-03-15 12:58:21 +01:00
Hannes Frederic Sowa
d00bd3d4fb netfilter: nf_ct_ipv6: use ipv6_iface_scope_id in conntrack to return scope id
As in (842df07 ipv6: use newly introduced __ipv6_addr_needs_scope_id and
ipv6_iface_scope_id).

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-03-15 11:48:43 +01:00
YOSHIFUJI Hideaki
2d2fd8c50a netfilter: ip6t_NPT: Use csum_partial()
[ Some fixes went into mainstream before this patch, so I needed
  to rebase it upon the current tree, that's why it's different from
  the original one posted on the list --pablo ]

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-03-15 11:02:04 +01:00
David S. Miller
e5f2ef7ab4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/netdev.c

Minor conflict in e1000e, a line that got fixed in 'net'
has been removed in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 05:52:22 -04:00
Cong Wang
e8f72ea4a1 ipv6: introduce ip6tunnel_xmit() helper
Similar to iptunnel_xmit(), group these operations into a
helper function.

This by the way fixes the missing u64_stats_update_begin()
and u64_stats_update_end() for 32 bit arch.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-10 16:53:34 -04:00
Mathias Krause
22c352195e ipv6: remove superfluous nla_data() NULL pointer checks
nla_data() cannot return NULL, so these NULL pointer checks are
superfluous.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-10 16:46:09 -04:00
Cong Wang
6aed0c8bf7 tunnel: use iptunnel_xmit() again
With recent patches from Pravin, most tunnels can't use iptunnel_xmit()
any more, due to ip_select_ident() and skb->ip_summed. But we can just
move these operations out of iptunnel_xmit(), so that tunnels can
use it again.

This by the way fixes a bug in vxlan (missing nf_reset()) for net-next.

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-10 03:05:44 -04:00
Pravin B Shelar
7313626745 tunneling: Add generic Tunnel segmentation.
Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.

This can be used by tunneling protocols like VXLAN.

CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-09 16:09:17 -05:00
Pravin B Shelar
ec5f061564 net: Kill link between CSUM and SG features.
Earlier SG was unset if CSUM was not available for given device to
force skb copy to avoid sending inconsistent csum.
Commit c9af6db4c1 (net: Fix possible wrong checksum generation)
added explicit flag to force copy to fix this issue.  Therefore
there is no need to link SG and CSUM, following patch kills this
link between there two features.

This patch is also required following patch in series.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-09 16:08:57 -05:00
Hannes Frederic Sowa
3868b7aa76 ipv6: report sin6_scope_id if sockopt RECVORIGDSTADDR is set
v4:
a) unchanged

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-08 12:29:23 -05:00
Hannes Frederic Sowa
842df07397 ipv6: use newly introduced __ipv6_addr_needs_scope_id and ipv6_iface_scope_id
This patch requires multicast interface-scoped addresses to supply a
sin6_scope_id. Because the sin6_scope_id is now also correctly used
in case of interface-scoped multicast traffic this enables one to use
interface scoped addresses over interfaces which are not targeted by the
default multicast route (the route has to be put there manually, though).

getsockname() and getpeername() now return the correct sin6_scope_id in
case of interface-local mc addresses.

v2:
a) rebased ontop of patch 1/4 (now uses ipv6_addr_props)

v3:
a) reverted changes for ipv6_addr_props

v4:
a) unchanged

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>dave
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-08 12:29:22 -05:00
Thomas Graf
80580d4b20 ipv6: ndisc: remove redundant check for !dev->addr_len
send_sllao is already initialized with the value of dev->addr_len

Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-08 12:29:22 -05:00
Hannes Frederic Sowa
ddf64354af ipv6: stop multicast forwarding to process interface scoped addresses
v2:
a) used struct ipv6_addr_props

v3:
a) reverted changes for ipv6_addr_props

v4:
a) do not use __ipv6_addr_needs_scope_id

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-08 12:28:20 -05:00
Eric Dumazet
7f0e44ac9f ipv6 flowlabel: add __rcu annotations
Commit 18367681a1 (ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.)
omitted proper __rcu annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-07 16:33:10 -05:00
Nicolas Dichtel
7a6742003f netconf: add the handler to dump entries
It's useful to be able to get the initial state of all entries. The patch adds
the support for IPv4 and IPv6.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06 15:40:53 -05:00
Nicolas Dichtel
a947b0a93e xfrm: allow to avoid copying DSCP during encapsulation
By default, DSCP is copying during encapsulation.
Copying the DSCP in IPsec tunneling may be a bit dangerous because packets with
different DSCP may get reordered relative to each other in the network and then
dropped by the remote IPsec GW if the reordering becomes too big compared to the
replay window.

It is possible to avoid this copy with netfilter rules, but it's very convenient
to be able to configure it for each SA directly.

This patch adds a toogle for this purpose. By default, it's not set to maintain
backward compatibility.

Field flags in struct xfrm_usersa_info is full, hence I add a new attribute.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-03-06 07:02:45 +01:00
Flavio Leitner
dd9f319d94 tcp: ipv6: bind() use stronger condition for bind_conflict
We must try harder to get unique (addr, port) pairs when
doing port autoselection for sockets with SO_REUSEADDR
option set.

This is a continuation of commit aacd9289af
for IPv6.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-05 23:40:00 -05:00
Linus Torvalds
9da060d0ed Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "A moderately sized pile of fixes, some specifically for merge window
  introduced regressions although others are for longer standing items
  and have been queued up for -stable.

  I'm kind of tired of all the RDS protocol bugs over the years, to be
  honest, it's way out of proportion to the number of people who
  actually use it.

   1) Fix missing range initialization in netfilter IPSET, from Jozsef
      Kadlecsik.

   2) ieee80211_local->tim_lock needs to use BH disabling, from Johannes
      Berg.

   3) Fix DMA syncing in SFC driver, from Ben Hutchings.

   4) Fix regression in BOND device MAC address setting, from Jiri
      Pirko.

   5) Missing usb_free_urb in ISDN Hisax driver, from Marina Makienko.

   6) Fix UDP checksumming in bnx2x driver for 57710 and 57711 chips,
      fix from Dmitry Kravkov.

   7) Missing cfgspace_lock initialization in BCMA driver.

   8) Validate parameter size for SCTP assoc stats getsockopt(), from
      Guenter Roeck.

   9) Fix SCTP association hangs, from Lee A Roberts.

  10) Fix jumbo frame handling in r8169, from Francois Romieu.

  11) Fix phy_device memory leak, from Petr Malat.

  12) Omit trailing FCS from frames received in BGMAC driver, from Hauke
      Mehrtens.

  13) Missing socket refcount release in L2TP, from Guillaume Nault.

  14) sctp_endpoint_init should respect passed in gfp_t, rather than use
      GFP_KERNEL unconditionally.  From Dan Carpenter.

  15) Add AISX AX88179 USB driver, from Freddy Xin.

  16) Remove MAINTAINERS entries for drivers deleted during the merge
      window, from Cesar Eduardo Barros.

  17) RDS protocol can try to allocate huge amounts of memory, check
      that the user's request length makes sense, from Cong Wang.

  18) SCTP should use the provided KMALLOC_MAX_SIZE instead of it's own,
      bogus, definition.  From Cong Wang.

  19) Fix deadlocks in FEC driver by moving TX reclaim into NAPI poll,
      from Frank Li.  Also, fix a build error introduced in the merge
      window.

  20) Fix bogus purging of default routes in ipv6, from Lorenzo Colitti.

  21) Don't double count RTT measurements when we leave the TCP receive
      fast path, from Neal Cardwell."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tcp: fix double-counted receiver RTT when leaving receiver fast path
  CAIF: fix sparse warning for caif_usb
  rds: simplify a warning message
  net: fec: fix build error in no MXC platform
  net: ipv6: Don't purge default router if accept_ra=2
  net: fec: put tx to napi poll function to fix dead lock
  sctp: use KMALLOC_MAX_SIZE instead of its own MAX_KMALLOC_SIZE
  rds: limit the size allocated by rds_message_alloc()
  MAINTAINERS: remove eexpress
  MAINTAINERS: remove drivers/net/wan/cycx*
  MAINTAINERS: remove 3c505
  caif_dev: fix sparse warnings for caif_flow_cb
  ax88179_178a: ASIX AX88179_178A USB 3.0/2.0 to gigabit ethernet adapter driver
  sctp: use the passed in gfp flags instead GFP_KERNEL
  ipv[4|6]: correct dropwatch false positive in local_deliver_finish
  l2tp: Restore socket refcount when sendmsg succeeds
  net/phy: micrel: Disable asymmetric pause for KSZ9021
  bgmac: omit the fcs
  phy: Fix phy_device_free memory leak
  bnx2x: Fix KR2 work-around condition
  ...
2013-03-05 18:42:29 -08:00
Lorenzo Colitti
3e8b0ac3e4 net: ipv6: Don't purge default router if accept_ra=2
Setting net.ipv6.conf.<interface>.accept_ra=2 causes the kernel
to accept RAs even when forwarding is enabled. However, enabling
forwarding purges all default routes on the system, breaking
connectivity until the next RA is received. Fix this by not
purging default routes on interfaces that have accept_ra=2.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-04 14:12:07 -05:00
Neil Horman
d8c6f4b9b7 ipv[4|6]: correct dropwatch false positive in local_deliver_finish
I had a report recently of a user trying to use dropwatch to localise some frame
loss, and they were getting false positives.  Turned out they were using a user
space SCTP stack that used raw sockets to grab frames.  When we don't have a
registered protocol for a given packet, we record it as a drop, even if a raw
socket receieves the frame.  We should only record the drop in the event a raw
socket doesnt exist to receive the frames

Tested by the reported successfully

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: William Reich <reich@ulticom.com>
Tested-by: William Reich <reich@ulticom.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: William Reich <reich@ulticom.com>
CC: eric.dumazet@gmail.com
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-01 15:56:29 -05:00
Sasha Levin
b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Linus Torvalds
7c2db36e73 Merge branch 'akpm' (incoming from Andrew)
Merge misc patches from Andrew Morton:

 - Florian has vanished so I appear to have become fbdev maintainer
   again :(

 - Joel and Mark are distracted to welcome to the new OCFS2 maintainer

 - The backlight queue

 - Small core kernel changes

 - lib/ updates

 - The rtc queue

 - Various random bits

* akpm: (164 commits)
  rtc: rtc-davinci: use devm_*() functions
  rtc: rtc-max8997: use devm_request_threaded_irq()
  rtc: rtc-max8907: use devm_request_threaded_irq()
  rtc: rtc-da9052: use devm_request_threaded_irq()
  rtc: rtc-wm831x: use devm_request_threaded_irq()
  rtc: rtc-tps80031: use devm_request_threaded_irq()
  rtc: rtc-lp8788: use devm_request_threaded_irq()
  rtc: rtc-coh901331: use devm_clk_get()
  rtc: rtc-vt8500: use devm_*() functions
  rtc: rtc-tps6586x: use devm_request_threaded_irq()
  rtc: rtc-imxdi: use devm_clk_get()
  rtc: rtc-cmos: use dev_warn()/dev_dbg() instead of printk()/pr_debug()
  rtc: rtc-pcf8583: use dev_warn() instead of printk()
  rtc: rtc-sun4v: use pr_warn() instead of printk()
  rtc: rtc-vr41xx: use dev_info() instead of printk()
  rtc: rtc-rs5c313: use pr_err() instead of printk()
  rtc: rtc-at91rm9200: use dev_dbg()/dev_err() instead of printk()/pr_debug()
  rtc: rtc-rs5c372: use dev_dbg()/dev_warn() instead of printk()/pr_debug()
  rtc: rtc-ds2404: use dev_err() instead of printk()
  rtc: rtc-efi: use dev_err()/dev_warn()/pr_err() instead of printk()
  ...
2013-02-21 17:38:49 -08:00
Christian Kujau
242260fb85 sun.com documentation fixes
After I came across a help text for SUNGEM mentioning a broken sun.com
URL, I felt like fixing those up, as they are now pointing to oracle.com
URLs.

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-21 17:22:20 -08:00
Linus Torvalds
06991c28f3 Driver core patches for 3.9-rc1
Here is the big driver core merge for 3.9-rc1
 
 There are two major series here, both of which touch lots of drivers all
 over the kernel, and will cause you some merge conflicts:
   - add a new function called devm_ioremap_resource() to properly be
     able to check return values.
   - remove CONFIG_EXPERIMENTAL
 
 If you need me to provide a merged tree to handle these resolutions,
 please let me know.
 
 Other than those patches, there's not much here, some minor fixes and
 updates.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
 weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
 =yWAQ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core patches from Greg Kroah-Hartman:
 "Here is the big driver core merge for 3.9-rc1

  There are two major series here, both of which touch lots of drivers
  all over the kernel, and will cause you some merge conflicts:

   - add a new function called devm_ioremap_resource() to properly be
     able to check return values.

   - remove CONFIG_EXPERIMENTAL

  Other than those patches, there's not much here, some minor fixes and
  updates"

Fix up trivial conflicts

* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
  base: memory: fix soft/hard_offline_page permissions
  drivercore: Fix ordering between deferred_probe and exiting initcalls
  backlight: fix class_find_device() arguments
  TTY: mark tty_get_device call with the proper const values
  driver-core: constify data for class_find_device()
  firmware: Ignore abort check when no user-helper is used
  firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
  firmware: Make user-mode helper optional
  firmware: Refactoring for splitting user-mode helper code
  Driver core: treat unregistered bus_types as having no devices
  watchdog: Convert to devm_ioremap_resource()
  thermal: Convert to devm_ioremap_resource()
  spi: Convert to devm_ioremap_resource()
  power: Convert to devm_ioremap_resource()
  mtd: Convert to devm_ioremap_resource()
  mmc: Convert to devm_ioremap_resource()
  mfd: Convert to devm_ioremap_resource()
  media: Convert to devm_ioremap_resource()
  iommu: Convert to devm_ioremap_resource()
  drm: Convert to devm_ioremap_resource()
  ...
2013-02-21 12:05:51 -08:00
YOSHIFUJI Hideaki / 吉藤英明
ecd9883724 ipv6: fix race condition regarding dst->expires and dst->from.
Eric Dumazet wrote:
| Some strange crashes happen in rt6_check_expired(), with access
| to random addresses.
|
| At first glance, it looks like the RTF_EXPIRES and
| stuff added in commit 1716a96101
| (ipv6: fix problem with expired dst cache)
| are racy : same dst could be manipulated at the same time
| on different cpus.
|
| At some point, our stack believes rt->dst.from contains a dst pointer,
| while its really a jiffie value (as rt->dst.expires shares the same area
| of memory)
|
| rt6_update_expires() should be fixed, or am I missing something ?
|
| CC Neil because of https://bugzilla.redhat.com/show_bug.cgi?id=892060

Because we do not have any locks for dst_entry, we cannot change
essential structure in the entry; e.g., we cannot change reference
to other entity.

To fix this issue, split 'from' and 'expires' field in dst_entry
out of union.  Once it is 'from' is assigned in the constructor,
keep the reference until the very last stage of the life time of
the object.

Of course, it is unsafe to change 'from', so make rt6_set_from simple
just for fresh entries.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Neil Horman <nhorman@tuxdriver.com>
CC: Gao Feng <gaofeng@cn.fujitsu.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reported-by: Steinar H. Gunderson <sesse@google.com>
Reviewed-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-20 15:11:45 -05:00
David S. Miller
2ccba5433b Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
The following patchset contain updates for your net-next tree, they are:

* Fix (for just added) connlabel dependencies, from Florian Westphal.

* Add aliasing support for conntrack, thus users can either use -m state
  or -m conntrack from iptables while using the same kernel module, from
  Jozsef Kadlecsik.

* Some code refactoring for the CT target to merge common code in
  revision 0 and 1, from myself.

* Add aliasing support for CT, based on patch from Jozsef Kadlecsik.

* Add one mutex per nfnetlink subsystem, from myself.

* Improved logging for packets that are dropped by helpers, from myself.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 23:42:09 -05:00
David S. Miller
6338a53a2b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net into net
Pull in 'net' to take in the bug fixes that didn't make it into
3.8-final.

Also, deal with the semantic conflict of the change made to
net/ipv6/xfrm6_policy.c   A missing rt6->n neighbour release
was added to 'net', but in 'net-next' we no longer cache the
neighbour entries in the ipv6 routes so that change is not
appropriate there.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 23:34:21 -05:00
Pablo Neira Ayuso
b20ab9cc63 netfilter: nf_ct_helper: better logging for dropped packets
Connection tracking helpers have to drop packets under exceptional
situations. Currently, the user gets the following logging message
in case that happens:

	nf_ct_%s: dropping packet ...

However, depending on the helper, there are different reasons why a
packet can be dropped.

This patch modifies the existing code to provide more specific
error message in the scope of each helper to help users to debug
the reason why the packet has been dropped, ie:

	nf_ct_%s: dropping packet: reason ...

Thanks to Joe Perches for many formatting suggestions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-02-19 02:48:05 +01:00
Eric Dumazet
8064b3cf75 ipv6: fix a sparse warning
net/ipv6/reassembly.c:82:72: warning: incorrect type in argument 3 (different base types)
net/ipv6/reassembly.c:82:72:    expected unsigned int [unsigned] [usertype] c
net/ipv6/reassembly.c:82:72:    got restricted __be32 [usertype] id

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 15:28:00 -05:00
Romain KUNTZ
18cf0d0784 xfrm: release neighbor upon dst destruction
Neighbor is cloned in xfrm6_fill_dst but seems to never be released.
Neighbor entry should be released when XFRM6 dst entry is destroyed
in xfrm6_dst_destroy, otherwise references may be kept forever on
the device pointed by the neighbor entry.

I may not have understood all the subtleties of XFRM & dst so I would
be happy to receive comments on this patch.

Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:57:29 -05:00
Gao feng
ece31ffd53 net: proc: change proc_net_remove to remove_proc_entry
proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
Gao feng
d4beaa66ad net: proc: change proc_net_fops_create to proc_create
Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 14:53:08 -05:00
stephen hemminger
0d4bfa297c ipv6: fix warning in xfrm6_mode_tunnel_input
Should not use assignment in conditional:
 warning: suggest parentheses around assignment used as truth value [-Wparentheses]

Problem introduced by:
commit 14bbd6a565
Author: Pravin B Shelar <pshelar@nicira.com>
Date:   Thu Feb 14 09:44:49 2013 +0000

    net: Add skb_unclone() helper function.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 12:42:47 -05:00
Eric Dumazet
279e9f2ff7 ipv6: optimize inet6_hash_frag()
Use ipv6_addr_hash() and a single jhash invocation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-18 12:23:51 -05:00
Pravin B Shelar
68c3316311 v4 GRE: Add TCP segmentation offload for GRE
Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.

[ Compute pkt_len before ip_local_out() invocation. -DaveM ]

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-15 15:17:11 -05:00
Pravin B Shelar
14bbd6a565 net: Add skb_unclone() helper function.
This function will be used in next GRE_GSO patch. This patch does
not change any functionality.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
2013-02-15 15:10:37 -05:00
David S. Miller
e0376d0043 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
1) Remove a duplicated call to skb_orphan() in pf_key, from Cong Wang.

2) Prepare xfrm and pf_key for algorithms without pf_key support,
   from Jussi Kivilinna.

3) Fix an unbalanced lock in xfrm_output_one(), from Li RongQing.

4) Add an IPsec state resolution packet queue to handle
   packets that are send before the states are resolved.

5) xfrm4_policy_fini() is unused since 2.6.11, time to remove it.
   From Michal Kubecek.

6) The xfrm gc threshold was configurable just in the initial
   namespace, make it configurable in all namespaces. From
   Michal Kubecek.

7) We currently can not insert policies with mark and mask
   such that some flows would be matched from both policies.
   Allow this if the priorities of these policies are different,
   the one with the higher priority is used in this case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-14 13:29:20 -05:00
Michal Kubeček
894e2ac82b netfilter: nf_ct_reasm: fix per-netns sysctl initialization
Adjusting of data pointers in net/netfilter/nf_conntrack_frag6_*
sysctl table for other namespaces points to wrong netns_frags
structure and has reversed order of entries.

Problem introduced by commit c038a767cd in 3.7-rc1

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-02-13 20:46:06 +01:00
Pravin B Shelar
c9af6db4c1 net: Fix possible wrong checksum generation.
Patch cef401de7b (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-13 13:30:10 -05:00
Andrey Vagin
ee684b6f28 tcp: send packets with a socket timestamp
A socket timestamp is a sum of the global tcp_time_stamp and
a per-socket offset.

A socket offset is added in places where externally visible
tcp timestamp option is parsed/initialized.

Connections in the SYN_RECV state are not supported, global
tcp_time_stamp is used for them, because repair mode doesn't support
this state. In a future it can be implemented by the similar way
as for TIME_WAIT sockets.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-13 13:22:16 -05:00
David S. Miller
9f6d98c298 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c

The bnx2x gso_type setting bug fix in 'net' conflicted with
changes in 'net-next' that broke the gso_* setting logic
out into a seperate function, which also fixes the bug in
question.  Thus, use the 'net-next' version.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-12 18:58:28 -05:00
Hannes Frederic Sowa
2c5e893384 ipv6: by default join ff01::1 and in case of forwarding ff01::2 and ff05:2
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-11 14:16:26 -05:00
Hannes Frederic Sowa
20314092c1 ipv6: don't accept multicast traffic with scope 0
v2:
a) moved before multicast source address check
b) changed comment to netdev style

Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-11 14:00:54 -05:00
Hannes Frederic Sowa
dd40851521 ipv6: don't let node/interface scoped multicast traffic escape on the wire
Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Cc: Erik Hugne <erik.hugne@ericsson.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-11 14:00:54 -05:00
YOSHIFUJI Hideaki / 吉藤英明
ec16ef2228 ipv6 mcast: Do not join device multicast for interface-local multicasts.
RFC4291 (IPv6 addressing architecture) says that interface-Local scope
spans only a single interface on a node.  We should not join L2 device
multicast list for addresses in interface-local (or smaller) scope.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-11 00:21:44 -05:00
David S. Miller
cfa82e020b Merge branch 'master' of git://1984.lsi.us.es/nf
Pablo Neira Ayuso says:

====================
The following patchset contains Netfilter/IPVS fixes for 3.8-rc7, they are:

* Fix oops in IPVS state-sync due to releasing a random memory area due
  to unitialized pointer, from Dan Carpenter.

* Fix SCTP flow establishment due to bad checksumming mangling in IPVS,
  from Daniel Borkmann.

* Three fixes for the recently added IPv6 NPT, all from YOSHIFUJI Hideaki,
  with an amendment collapsed into those patches from Ulrich Weber. They
  fiix adjustment calculation, fix prefix mangling and ensure LSB of
  prefixes are zeroes (as required by RFC).

Specifically, it took me a while to validate the 1's complement arithmetics/
checksumming approach in the IPv6 NPT code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-10 20:44:08 -05:00
David S. Miller
fd5023111c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Synchronize with 'net' in order to sort out some l2tp, wireless, and
ipv6 GRE fixes that will be built on top of in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-08 18:02:14 -05:00
Amerigo Wang
6a98dcf032 ipv6: fix a RCU warning in net/ipv6/ip6_flowlabel.c
This patch fixes the following RCU warning:

[   51.680236] ===============================
[   51.681914] [ INFO: suspicious RCU usage. ]
[   51.683610] 3.8.0-rc6-next-20130206-sasha-00028-g83214f7-dirty #276 Tainted: G        W
[   51.686703] -------------------------------
[   51.688281] net/ipv6/ip6_flowlabel.c:671 suspicious rcu_dereference_check() usage!

we should use rcu_dereference_bh() when we hold rcu_read_lock_bh().

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-08 17:55:48 -05:00
YOSHIFUJI Hideaki / 吉藤英明
edb27228db netfilter: ip6t_NPT: Ensure to check lower part of prefixes are zero
RFC 6296 points that address bits that are not part of the prefix
has to be zeroed.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-02-07 18:40:27 +01:00
YOSHIFUJI Hideaki / 吉藤英明
d4c38fa87d netfilter: ip6t_NPT: Fix prefix mangling
Make sure only the bits that are part of the prefix are mangled.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-02-07 18:40:26 +01:00
YOSHIFUJI Hideaki / 吉藤英明
f5271fff56 netfilter: ip6t_NPT: Fix adjustment calculation
Cast __wsum from/to __sum16 is wrong.  Instead, apply appropriate
conversion function: csum_unfold() or csum_fold().

[ The original patch has been modified to undo the final ~ that
  csum_fold returns. We only need to fold the 32-bit word that
  results from the checksum calculation into a 16-bit to ensure
  that the original subnet is restored appropriately. Spotted by
  Ulrich Weber. ]

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-02-07 18:37:41 +01:00
Tommi Rantala
41ab3e31bd ipv6/ip6_gre: fix error case handling in ip6gre_tunnel_xmit()
ip6gre_tunnel_xmit() is leaking the skb when we hit this error branch,
and the -1 return value from this function is bogus. Use the error
handling we already have in place in ip6gre_tunnel_xmit() for this error
case to fix this.

Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06 16:02:05 -05:00
Steffen Klassert
f4e53e292a ipv6: Don't send packet to big messages to self
Calling icmpv6_send() on a local message size error leads to an
incorrect update of the path mtu in the case when IPsec is used.
So use ipv6_local_error() instead to notify the socket about the
error.

Reported-by: Jiri Bohac <jbohac@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-06 15:12:39 -05:00
Michal Kubecek
8d068875ca xfrm: make gc_thresh configurable in all namespaces
The xfrm gc threshold can be configured via xfrm{4,6}_gc_thresh
sysctl but currently only in init_net, other namespaces always
use the default value. This can substantially limit the number
of IPsec tunnels that can be effectively used.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-02-06 11:36:29 +01:00
David S. Miller
188d1f76d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/ethtool.c
	drivers/net/vmxnet3/vmxnet3_drv.c
	drivers/net/wireless/iwlwifi/dvm/tx.c
	net/ipv6/route.c

The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.

The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.

The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.

The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-05 14:12:20 -05:00
Jean Sacren
56db1c5f41 mcast: do not check 'rv' twice in a row
With the loop, don't check 'rv' twice in a row. Without the loop, 'rv'
doesn't even need to be checked.

Make the comment more grammar-friendly.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-04 13:26:49 -05:00
Vijay Subramanian
5f1e942cb4 tcp: ipv6: Update MIB counters for drops
This patch updates LINUX_MIB_LISTENDROPS and LINUX_MIB_LISTENOVERFLOWS in
tcp_v6_conn_request() and tcp_v6_err(). tcp_v6_conn_request() in particular can
drop SYNs for various reasons which are not currently tracked.

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-04 13:06:27 -05:00
Tom Parkin
8e72d37eb3 ipv6: export ip6_datagram_recv_ctl
ip6_datagram_recv_ctl and ip6_datagram_send_ctl are used for handling IPv6
ancillary data.  Since ip6_datagram_send_ctl is already publicly exported for
use in modules, ip6_datagram_recv_ctl should also be available to support
ancillary data in the receive path.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-31 13:53:08 -05:00
Tom Parkin
73df66f8b1 ipv6: rename datagram_send_ctl and datagram_recv_ctl
The datagram_*_ctl functions in net/ipv6/datagram.c are IPv6-specific.  Since
datagram_send_ctl is publicly exported it should be appropriately named to
reflect the fact that it's for IPv6 only.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-31 13:53:08 -05:00
YOSHIFUJI Hideaki / 吉藤英明
c33e7b05f1 ipv6 anycast: Convert ipv6_sk_ac_lock to spinlock.
Since all users are write-lock, it does not make sense to use
rwlock here.  Use simple spinlock.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-30 22:41:13 -05:00
YOSHIFUJI Hideaki / 吉藤英明
18367681a1 ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-30 22:41:13 -05:00
YOSHIFUJI Hideaki / 吉藤英明
d3aedd5ebd ipv6 flowlabel: Convert hash list to RCU.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-30 22:41:13 -05:00
YOSHIFUJI Hideaki / 吉藤英明
f256dc59d0 ipv6 flowlabel: Ensure to take lock when modifying np->ip6_sk_fl_list.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-30 22:41:12 -05:00
Marcelo Ricardo Leitner
bd30e94720 ipv6: do not create neighbor entries for local delivery
They will be created at output, if ever needed. This avoids creating
empty neighbor entries when TPROXYing/Forwarding packets for addresses
that are not even directly reachable.

Note that IPv4 already handles it this way. No neighbor entries are
created for local input.

Tested by myself and customer.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-30 20:26:07 -05:00
YOSHIFUJI Hideaki / 吉藤英明
d9e85655b5 netfilter ip6table_mangle: Use ipv6_addr_equal() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 22:58:40 -05:00
YOSHIFUJI Hideaki / 吉藤英明
ff88b30c71 xfrm: Use ipv6_addr_equal() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 22:58:40 -05:00
YOSHIFUJI Hideaki / 吉藤英明
07c2fecc36 ipv6 mcast: Use ipv6_addr_equal() in ip6_mc_source().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 22:58:40 -05:00
YOSHIFUJI Hideaki / 吉藤英明
5e98a36ed4 ipv6 addrconf: Fix interface identifiers of 802.15.4 devices.
The "Universal/Local" (U/L) bit must be complmented according to RFC4944
and RFC2464.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:43:04 -05:00
David S. Miller
f1e7b73acc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Bring in the 'net' tree so that we can get some ipv4/ipv6 bug
fixes that some net-next work will build upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:32:13 -05:00
Hannes Frederic Sowa
218774dc34 ipv6: add anti-spoofing checks for 6to4 and 6rd
This patch adds anti-spoofing checks in sit.c as specified in RFC3964
section 5.2 for 6to4 and RFC5969 section 12 for 6rd. I left out the
checks which could easily be implemented with netfilter.

Specifically this patch adds following logic (based loosely on the
pseudocode in RFC3964 section 5.2):

if prefix (inner_src_v6) == rd6_prefix (2002::/16 is the default)
        and outer_src_v4 != embedded_ipv4 (inner_src_v6)
                drop
if prefix (inner_dst_v6) == rd6_prefix (or 2002::/16 is the default)
        and outer_dst_v4 != embedded_ipv4 (inner_dst_v6)
                drop
accept

To accomplish the specified security checks proposed by above RFCs,
it is still necessary to employ uRPF filters with netfilter. These new
checks only kick in if the employed addresses are within the 2002::/16 or
another range specified by the 6rd-prefix (which defaults to 2002::/16).

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:22:03 -05:00
Eric W. Biederman
243bb4c6d7 ipv6: Fix inet6_csk_bind_conflict so it builds with user namespaces enabled
When attempting to build linux-next with user namespaces enabled I ran
into this fun build error.

  CC      net/ipv6/inet6_connection_sock.o
.../net/ipv6/inet6_connection_sock.c: In function ‘inet6_csk_bind_conflict’:
.../net/ipv6/inet6_connection_sock.c:37:12: error: incompatible types when initializing type ‘int’ using
 type ‘kuid_t’
.../net/ipv6/inet6_connection_sock.c:54:30: error: incompatible type for argument 1 of ‘uid_eq’
.../include/linux/uidgid.h:48:20: note: expected ‘kuid_t’ but argument is of type ‘int’
make[3]: *** [net/ipv6/inet6_connection_sock.o] Error 1
make[2]: *** [net/ipv6] Error 2
make[2]: *** Waiting for unfinished jobs....

Using kuid_t instead of int to hold the uid fixes this.

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 15:20:12 -05:00
Jiri Pirko
5c766d642b ipv4: introduce address lifetime
There are some usecase when lifetime of ipv4 addresses might be helpful.
For example:
1) initramfs networkmanager uses a DHCP daemon to learn network
configuration parameters
2) initramfs networkmanager addresses, routes and DNS configuration
3) initramfs networkmanager is requested to stop
4) initramfs networkmanager stops all daemons including dhclient
5) there are addresses and routes configured but no daemon running. If
the system doesn't start networkmanager for some reason, addresses and
routes will be used forever, which violates RFC 2131.

This patch is essentially a backport of ivp6 address lifetime mechanism
for ipv4 addresses.

Current "ip" tool supports this without any patch (since it does not
distinguish between ipv4 and ipv6 addresses in this perspective.

Also, this should be back-compatible with all current netlink users.

Reported-by: Pavel Šimerda <psimerda@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 13:59:57 -05:00
Jesper Dangaard Brouer
3ef0eb0db4 net: frag, move LRU list maintenance outside of rwlock
Updating the fragmentation queues LRU (Least-Recently-Used) list,
required taking the hash writer lock.  However, the LRU list isn't
tied to the hash at all, so we can use a separate lock for it.

Original-idea-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 13:36:24 -05:00
Jesper Dangaard Brouer
d433673e5f net: frag helper functions for mem limit tracking
This change is primarily a preparation to ease the extension of memory
limit tracking.

The change does reduce the number atomic operation, during freeing of
a frag queue.  This does introduce a some performance improvement, as
these atomic operations are at the core of the performance problems
seen on NUMA systems.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-29 13:36:24 -05:00
Eric Dumazet
cef401de7b net: fix possible wrong checksum generation
Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-28 00:27:15 -05:00
Dan Carpenter
75356a8143 ip6mr: limit IPv6 MRT_TABLE identifiers
We did this for IPv4 in b49d3c1e1c "net: ipmr: limit MRT_TABLE
identifiers" but we need to do it for IPv6 as well.  On IPv6 the name
is "pim6reg" instead of "pimreg" so there is one less digit allowed.

The strcpy() is in ip6mr_reg_vif().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-27 19:31:03 -05:00
David S. Miller
b640bee6d9 Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
This batch contains netfilter updates for you net-next tree, they are:

* The new connlabel extension for x_tables, that allows us to attach
  labels to each conntrack flow. The kernel implementation uses a
  bitmask and there's a file in user-space that maps the bits with the
  corresponding string for each existing label. By now, you can attach
  up to 128 overlapping labels. From Florian Westphal.

* A new round of improvements for the netns support for conntrack.
  Gao feng has moved many of the initialization code of each module
  of the netns init path. He also made several code refactoring, that
  code looks cleaner to me now.

* Added documentation for all possible tweaks for nf_conntrack via
  sysctl, from Jiri Pirko.

* Cisco 7941/7945 IP phone support for our SIP conntrack helper,
  from Kevin Cernekee.

* Missing header file in the snmp helper, from Stephen Hemminger.

* Finally, a couple of fixes to resolve minor issues with these
  changes, from myself.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-27 00:56:10 -05:00
Tom Herbert
72289b96c9 soreuseport: UDP/IPv6 implementation
Motivation for soreuseport would be something like a DNS server.  An
alternative would be to recv on the same socket from multiple threads.
As in the case of TCP, the load across these threads tends to be
disproportionate and we also see a lot of contection on the socket lock.
Note that SO_REUSEADDR already allows multiple UDP sockets to bind to
the same port, however there is no provision to prevent hijacking and
nothing to distribute packets across all the sockets sharing the same
bound port.  This patch does not change the semantics of SO_REUSEADDR,
but provides usable functionality of it for unicast.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-23 13:44:01 -05:00
Tom Herbert
5ba24953e9 soreuseport: TCP/IPv6 implementation
Motivation for soreuseport would be something like a web server
binding to port 80 running with multiple threads, where each thread
might have it's own listener socket.  This could be done as an
alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-23 13:44:01 -05:00
Gao feng
c296bb4d5d netfilter: nf_conntrack: refactor l4proto support for netns
Move the code that register/unregister l4proto to the
module_init/exit context.

Given that we have to modify some interfaces to accomodate
these changes, it is a good time to use shorter function names
for this using the nf_ct_* prefix instead of nf_conntrack_*,
that is:

nf_ct_l4proto_register
nf_ct_l4proto_pernet_register
nf_ct_l4proto_unregister
nf_ct_l4proto_pernet_unregister

We same many line breaks with it.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-01-23 14:40:53 +01:00
Gao feng
6330750d56 netfilter: nf_conntrack: refactor l3proto support for netns
Move the code that register/unregister l3proto to the
module_init/exit context.

Given that we have to modify some interfaces to accomodate
these changes, it is a good time to use shorter function names
for this using the nf_ct_* prefix instead of nf_conntrack_*,
that is:

nf_ct_l3proto_register
nf_ct_l3proto_pernet_register
nf_ct_l3proto_unregister
nf_ct_l3proto_pernet_unregister

We same many line breaks with it.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-01-23 14:39:20 +01:00
Cong Wang
9647bb80a5 ipv6: remove duplicated declaration of ip6_fragment()
It is declared in:
include/net/ip6_route.h:187:int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));

and net/ip6_route.h is already included.

Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-22 23:18:59 -05:00
YOSHIFUJI Hideaki / 吉藤英明
0cc8d8df9b netfilter: Use IS_ERR_OR_NULL().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-22 14:28:29 -05:00
YOSHIFUJI Hideaki / 吉藤英明
3f0d2ba0bd ipv6: Use IS_ERR_OR_NULL().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-22 14:28:28 -05:00
David S. Miller
0c8729c9b9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
1) The transport header did not point to the right place after
   esp/ah processing on tunnel mode in the receive path. As a
   result, the ECN field of the inner header was not set correctly,
   fixes from Li RongQing.

2) We did a null check too late in one of the xfrm_replay advance
   functions. This can lead to a division by zero, fix from
   Nickolai Zeldovich.

3) The size calculation of the hash table missed the muiltplication
   with the actual struct size when the hash table is freed.
   We might call the wrong free function, fix from Michal Kubecek.

4) On IPsec pmtu events we can't access the transport headers of
   the original packet, so force a relookup for all routes
   to notify about the pmtu event.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-22 14:20:28 -05:00
YOSHIFUJI Hideaki / 吉藤英明
b820bb6b99 ndisc: Do not try to update "updated" time if neighbour has already gone.
Commit 2152caea ("ipv6: Do not depend on rt->n in rt6_probe().")
introduce a bug to try to update "updated" time in neighbour
structure.
Update the "updated" time only if neighbour is available.

Bug was found by Dan Carpenter <dan.carpenter@oracle.com>

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 15:41:41 -05:00
Nicolas Dichtel
660b26dc1a mcast: add multicast proxy support (IPv4 and IPv6)
This patch add the support of proxy multicast, ie being able to build a static
multicast tree. It adds the support of (*,*) and (*,G) entries.

The user should define an (*,*) entry which is not used for real forwarding.
This entry defines the upstream in iif and contains all interfaces from the
static tree in its oifs. It will be used to forward packet upstream when they
come from an interface belonging to the static tree.
Hence, the user should define (*,G) entries to build its static tree. Note that
upstream interface must be part of oifs: packets are sent to all oifs
interfaces except the input interface. This ensures to always join the whole
static tree, even if the packet is not coming from the upstream interface.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:55:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
4d5c152e86 ndisc: Use compound literals to build redirect message.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:18 -05:00
YOSHIFUJI Hideaki / 吉藤英明
1cb3fe513f ndisc: Break down ndisc_build_skb() and build message directly.
Construct NS/NA/RS message directly using C99 compound literals.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:18 -05:00
YOSHIFUJI Hideaki / 吉藤英明
b44b5f4ae9 ndisc: Break down __ndisc_send().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:17 -05:00
YOSHIFUJI Hideaki / 吉藤英明
7b3d9b06d8 ndisc: Fill in ICMPv6 checksum and IPv6 header in ndisc_send_skb().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:17 -05:00
YOSHIFUJI Hideaki / 吉藤英明
f4de84c64e ndisc: Use ndisc_send_skb() for redirect.
Reuse dst if one is attached with skb.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:17 -05:00
YOSHIFUJI Hideaki / 吉藤英明
aa4bdd4b3f ndisc: Remove icmp6h argument from ndisc_send_skb().
skb_transport_header() (thus icmp6_hdr()) is available here,
use it.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:17 -05:00
YOSHIFUJI Hideaki / 吉藤英明
5f5a011563 ndisc: Make ndisc_fill_xxx_option() for sk_buff.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:16 -05:00
YOSHIFUJI Hideaki / 吉藤英明
2ce1357614 ndisc: Calculate message body length and option length separately.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:16 -05:00
YOSHIFUJI Hideaki / 吉藤英明
5135e633f9 ndisc: Reset skb->trasport_headner inside ndisc_alloc_send_skb().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:16 -05:00
YOSHIFUJI Hideaki / 吉藤英明
527a150fb2 ndisc: Defer building IPv6 header.
Build ICMPv6 message first and make buffer management easier;
we can use skb->len when filling checksum in ICMPv6 header,
and then build IP header with length field.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:16 -05:00
YOSHIFUJI Hideaki / 吉藤英明
af9a997629 ndisc: Remove dev argument for ndisc_send_skb().
Since we have skb->dev, use it.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:15 -05:00
YOSHIFUJI Hideaki / 吉藤英明
f382d03ad0 ndisc: Set skb->dev and skb->protocol inside ndisc_alloc_skb().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:15 -05:00
YOSHIFUJI Hideaki / 吉藤英明
c8d6c380d9 ndisc: Simplify arguments for ip6_nd_hdr().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:15 -05:00
YOSHIFUJI Hideaki / 吉藤英明
2576f17dfa ipv6: Unshare ip6_nd_hdr() and change return type to void.
- move ip6_nd_hdr() to its users' source files.
  In net/ipv6/mcast.c, it will be called ip6_mc_hdr().
- make return type to void since this function never fails.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:15 -05:00
YOSHIFUJI Hideaki / 吉藤英明
de09334b93 ndisc: Introduce ndisc_alloc_skb() helper.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:15 -05:00
YOSHIFUJI Hideaki / 吉藤英明
9c86dafe94 ndisc: Introduce ndisc_fill_redirect_hdr_option().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
6bce6b4e16 ndisc: Use skb_linearize() instead of pskb_may_pull(skb, skb->len).
Suggested by Eric Dumazet <edumazet@google.com>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
c558e9fca8 ndisc: Move ndisc_opt_addr_space() to include/net/ndisc.h.
This also makes ndisc_opt_addr_data() and ndisc_fill_addr_option()
use ndisc_opt_addr_space().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
315ff09dba ndisc: Reduce number of arguments for ndisc_fill_addr_option().
Add pointer to struct net_device (dev) and remove
data_len (= dev->addr_len) and addr_type (= dev->type).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-21 13:33:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
fb568637e5 ndisc: Make several arguments for ndisc_send_na() boolean.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-20 22:29:49 -05:00
YOSHIFUJI Hideaki / 吉藤英明
ca97a644d7 ipv6: Introduce ipv6_addr_is_solict_mult() to check Solicited Node Multicast Addresses.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-20 22:29:49 -05:00
Hannes Frederic Sowa
1ad759d847 ipv6: remove unneeded check to pskb_may_pull in ipip6_rcv
This is already checked by the caller (tunnel64_rcv) and brings ipip6_rcv
in line with ipip_rcv.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-18 14:43:51 -05:00
YOSHIFUJI Hideaki / 吉藤英明
115b0aa6b4 ndisc: Check NS message length before access.
Check message length before accessing "target" field,
as we do for other types.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-18 14:41:13 -05:00
YOSHIFUJI Hideaki / 吉藤英明
12fd84f438 ipv6: Remove unused neigh argument for icmp6_dst_alloc() and its callers.
Because of rt->n removal, we do not need neigh argument any more.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-18 14:41:13 -05:00
Steffen Klassert
6f809da27c ipv6: Add an error handler for icmp6
pmtu and redirect events are now handled in the protocols error handler,
so add an error handler for icmp6 to do this. It is needed in the case
when we have no socket context. Based on a patch by Duan Jiong.

Reported-by: Duan Jiong <djduanjiong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-18 14:19:42 -05:00
Greg Kroah-Hartman
ed408f7c0f Merge 3.9-rc4 into driver-core-next
This is to fix up a build problem with a wireless driver due to the
dynamic-debug patches in this branch.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 19:48:18 -08:00
YOSHIFUJI Hideaki / 吉藤英明
887c95cc1d ipv6: Complete neighbour entry removal from dst_entry.
CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明
6fd6ce2056 ipv6: Do not depend on rt->n in ip6_finish_output2().
If neigh is not found, create new one.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明
707be1ff3d ipv6: Do not depend on rt->n in ip6_dst_lookup_tail().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明
2152caea71 ipv6: Do not depend on rt->n in rt6_probe().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明
145a36217a ipv6: Do not depend on rt->n in rt6_check_neigh().
CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明
c440f1609b ipv6: Do not depend on rt->n in ip6_pol_route().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:19 -05:00
YOSHIFUJI Hideaki / 吉藤英明
dd0cbf29b1 ipv6 route: Dump gateway based on RTF_GATEWAY flag and rt->rt6i_gateway.
Do not depend on rt->n.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:18 -05:00
YOSHIFUJI Hideaki / 吉藤英明
8e022ee63f ndisc: Remove tbl argument for __ipv6_neigh_lookup().
We can refer to nd_tbl directly.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:18 -05:00
YOSHIFUJI Hideaki / 吉藤英明
7ff74a596b ndisc: Update neigh->updated with write lock.
neigh->nud_state and neigh->updated are under protection of
neigh->lock.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 18:38:18 -05:00
Romain KUNTZ
7efdba5bd9 ipv6: fix header length calculation in ip6_append_data()
Commit 299b0767 (ipv6: Fix IPsec slowpath fragmentation problem)
has introduced a error in the header length calculation that
provokes corrupted packets when non-fragmentable extensions
headers (Destination Option or Routing Header Type 2) are used.

rt->rt6i_nfheader_len is the length of the non-fragmentable
extension header, and it should be substracted to
rt->dst.header_len, and not to exthdrlen, as it was done before
commit 299b0767.

This patch reverts to the original and correct behavior. It has
been successfully tested with and without IPsec on packets
that include non-fragmentable extensions headers.

Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-17 03:37:13 -05:00
David S. Miller
4b87f92259 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	Documentation/networking/ip-sysctl.txt
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c

Both conflicts were simply overlapping context.

A build fix for qlcnic is in here too, simply removing the added
devinit annotations which no longer exist.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-15 15:05:59 -05:00
YOSHIFUJI Hideaki / 吉藤英明
6059283378 ipv6 netevent: Remove old_neigh from netevent_redirect.
The only user is cxgb3 driver.

old_neigh is used to check device change, but it must not happen
on redirect.  In this sense, we can remove old_neigh argument.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-14 15:04:59 -05:00
YOSHIFUJI Hideaki / 吉藤英明
dd3332bfcb ipv6: Store Router Alert option in IP6CB directly.
Router Alert option is very small and we can store the value
itself in the skb.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
2b464f61f0 ipv6 xfrm: Use ipv6_addr_hash() in xfrm6_tunnel_spi_hash_byaddr().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
c08977bb2b ipv6 route: Use ipv6_addr_hash() in rt6_info_hash_nhsfn().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
daad151263 ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().
Move generalized version of ipv6_is_mld() to header,
and use it from ip6_mc_input().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
e7219858ac ipv6: Use ipv6_get_dsfield() instead of ipv6_tclass().
Commit 7a3198a8 ("ipv6: helper function to get tclass") introduced
ipv6_tclass(), but similar function is already available as
ipv6_get_dsfield().

We might be able to call ipv6_tclass() from ipv6_get_dsfield(),
but it is confusing to have two versions.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:14 -05:00
YOSHIFUJI Hideaki / 吉藤英明
6502ca527f ipv6: Introduce ip6_flowinfo() to extract flowinfo (tclass + flowlabel).
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:13 -05:00
YOSHIFUJI Hideaki / 吉藤英明
3e4e4c1f2d ipv6: Introduce ip6_flow_hdr() to fill version, tclass and flowlabel.
This is not only for readability but also for optimization.
What we do here is to build the 32bit word at the beginning of the ipv6
header (the "ip6_flow" virtual member of struct ip6_hdr in RFC3542) and
we do not need to read the tclass portion of the target buffer.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-13 20:17:13 -05:00
Kees Cook
f9ceb16ec4 net/ipv6: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David S. Miller <davem@davemloft.net>
2013-01-11 11:40:01 -08:00
Romain Kuntz
21caa6622b ipv6: use addrconf_get_prefix_route for prefix route lookup [v2]
Replace ip6_route_lookup() with addrconf_get_prefix_route() when
looking up for a prefix route. This ensures that the connected prefix
is looked up in the main table, and avoids the selection of other
matching routes located in different tables as well as blackhole
or prohibited entries.

In addition, this fixes an Opps introduced by commit 64c6d08e (ipv6:
del unreachable route when an addr is deleted on lo), that would occur
when a blackhole or prohibited entry is selected by ip6_route_lookup().
Such entries have a NULL rt6i_table argument, which is accessed by
__ip6_del_rt() when trying to lock rt6i_table->tb6_lock.

The function addrconf_is_prefix_route() is not used anymore and is
removed.

[v2] Minor indentation cleanup and log updates.

Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 14:22:54 -08:00
Romain Kuntz
85da53bf1c ipv6: fix the noflags test in addrconf_get_prefix_route
The tests on the flags in addrconf_get_prefix_route() does no make
much sense: the 'noflags' parameter contains the set of flags that
must not match with the route flags, so the test must be done
against 'noflags', and not against 'flags'.

Signed-off-by: Romain Kuntz <r.kuntz@ipflavors.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 14:13:33 -08:00
YOSHIFUJI Hideaki / 吉藤英明
6c40d100ce ipv6: Use container_of macro instead of magic number to get ipv6 header.
In ipv6_recv_error(), addr_offset points to daddr field of the ip header.
To get ipv6 header, use container_of() macro instead of substracting magic
number (24).

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-09 23:59:53 -08:00
YOSHIFUJI Hideaki / 吉藤英明
ba96bcbcd2 ipv6: Use FIELD_SIZEOF() in inet6_init().
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-09 23:38:23 -08:00
Cong Wang
acb3e04119 ipv6: move csum_ipv6_magic() and udp6_csum_init() into static library
As suggested by David, udp6_csum_init() is too big to be inlined,
move it to ipv6 static library, net/ipv6/ip6_checksum.c.

And the generic csum_ipv6_magic() too.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-08 17:56:10 -08:00
Li RongQing
a9403f8aeb ah6/esp6: set transport header correctly for IPsec tunnel mode.
IPsec tunnel does not set ECN field to CE in inner header when
the ECN field in the outer header is CE, and the ECN field in
the inner header is ECT(0) or ECT(1).

The cause is ipip6_hdr() does not return the correct address of
inner header since skb->transport-header is not the inner header
after esp6_input_done2(), or ah6_input().

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-01-08 12:41:30 +01:00
Hannes Frederic Sowa
5d134f1c1f tcp: make sysctl_tcp_ecn namespace aware
As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl
namespace aware.  The reason behind this patch is to ease the testing
of ecn problems on the internet and allows applications to tune their
own use of ecn.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-06 21:09:56 -08:00
YOSHIFUJI Hideaki / 吉藤英明
71bcdba06d ndisc: Use struct rd_msg for redirect message.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-06 21:08:38 -08:00
YOSHIFUJI Hideaki / 吉藤英明
b7dc8c3959 ndisc: Remove unused space at tail of skb for ndisc messages. (TAKE 3)
Currently, the size of skb allocated for NDISC is MAX_HEADER +
LL_RESERVED_SPACE(dev) + packet length + dev->needed_tailroom,
but only LL_RESERVED_SPACE(dev) bytes is "reserved" for headers.
As a result, the skb looks like this (after construction of the
message):

head       data                   tail                       end
+--------------------------------------------------------------+
+           |                      |          |                |
+--------------------------------------------------------------+
|<-hlen---->|<---ipv6 packet------>|<--tlen-->|<--MAX_HEADER-->|
    =LL_                               = dev
     RESERVED_                           ->needed_
     SPACE(dev)                            tailroom

As the name implies, "MAX_HEADER" is used for headers, and should
be "reserved" in prior to packet construction.  Or, if some space
is really required at the tail of ther skb, it should be
explicitly documented.

We have several option after construction of NDISC message:

Option 1:

head       data                   tail       end
+---------------------------------------------+
+           |                      |          |
+---------------------------------------------+
|<-hlen---->|<---ipv6 packet------>|<--tlen-->|
   =LL_                                = dev
    RESERVED_                           ->needed_
    SPACE(dev)                            tailroom

Option 2:

head            data                   tail       end
+--------------------------------------------------+
+                |                      |          |
+--------------------------------------------------+
|<--MAX_HEADER-->|<---ipv6 packet------>|<--tlen-->|
                                            = dev
                                             ->needed_
                                               tailroom

Option 3:

head                        data                   tail       end
+--------------------------------------------------------------+
+                |           |                      |          |
+--------------------------------------------------------------+
|<--MAX_HEADER-->|<-hlen---->|<---ipv6 packet------>|<--tlen-->|
                    =LL_                                = dev
                     RESERVED_                          ->needed_
                     SPACE(dev)                           tailroom

Our tunnel drivers try expanding headroom and the space for tunnel
encapsulation was not a mandatory space -- so we are not seeing
bugs here --, but just for optimization for performance critial
situations.

Since NDISC messages are not performance critical unlike TCP,
and as we know outgoing device, LL_RESERVED_SPACE(dev) should be
just enough for the device in most (if not all) cases:
  LL_RESERVED_SPACE(dev) <= LL_MAX_HEADER <= MAX_HEADER
Note that LL_RESERVED_SPACE(dev) is also enough for NDISC over
SIT (e.g., ISATAP).

So, I think Option 1 is just fine here.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-04 15:16:44 -08:00
Ulrich Weber
429da4c0b1 netfilter: ip6t_NPT: fix IPv6 NTP checksum calculation
csum16_add() has a broken carry detection, should be:
sum += sum < (__force u16)b;

Instead of fixing csum16_add, remove the custom checksum
functions and use the generic csum_add/csum_sub ones.

Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-01-04 20:03:02 +01:00
David S. Miller
ac196f8c92 Merge branch 'master' of git://1984.lsi.us.es/nf
Pablo Neira Ayuso says:

====================
The following batch contains Netfilter fixes for 3.8-rc1. They are
a mixture of old bugs that have passed unnoticed (I'll pass these to
stable) and more fresh ones from the previous merge window, they are:

* Fix for MAC address in 6in4 tunnels via NFLOG that results in ulogd
  showing up wrong address, from Bob Hockney.

* Fix a comment in nf_conntrack_ipv6, from Florent Fourcot.

* Fix a leak an error path in ctnetlink while creating an expectation,
  from Jesper Juhl.

* Fix missing ICMP time exceeded in the IPv6 defragmentation code, from
  Haibo Xi.

* Fix inconsistent handling of routing changes in MASQUERADE for the
  new connections case, from Andrew Collins.

* Fix a missing skb_reset_transport in ip[6]t_REJECT that leads to
  crashes in the ixgbe driver (since it seems to access the transport
  header with TSO enabled), from Mukund Jampala.

* Recover obsoleted NOTRACK target by including it into the CT and spot
  a warning via printk about being obsoleted. Many people don't check the
  scheduled to be removal file under Documentation, so we follow some
  less agressive approach to kill this in a year or so. Spotted by Florian
  Westphal, patch from myself.

* Fix race condition in xt_hashlimit that allows to create two or more
  entries, from myself.

* Fix crash if the CT is used due to the recently added facilities to
  consult the dying and unconfirmed conntrack lists, from myself.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-28 14:28:17 -08:00
Isaku Yamahata
ae782bb16c ipv6/ip6_gre: set transport header correctly
ip6gre_xmit2() incorrectly sets transport header to inner payload
instead of GRE header. It seems copy-and-pasted from ipip.c.
Set transport header to gre header.
(In ipip case the transport header is the inner ip header, so that's
correct.)

Found by inspection. In practice the incorrect transport header
doesn't matter because the skb usually is sent to another net_device
or socket, so the transport header isn't referenced.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-26 15:19:56 -08:00
Cong Ding
bd7790286b ipv6: addrconf.c: remove unnecessary "if"
the value of err is always negative if it goes to errout, so we don't need to
check the value of err.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-19 12:50:06 -08:00
Haibo Xi
97cf00e93c netfilter: nf_ct_reasm: fix conntrack reassembly expire code
Commit b836c99fd6 (ipv6: unify conntrack reassembly expire
code with standard one) use the standard IPv6 reassembly
code(ip6_expire_frag_queue) to handle conntrack reassembly expire.

In ip6_expire_frag_queue, it invoke dev_get_by_index_rcu to get
which device received this expired packet.so we must save ifindex
when NF_conntrack get this packet.

With this patch applied, I can see ICMP Time Exceeded sent
from the receiver when the sender sent out 1/2 fragmented
IPv6 packet.

Signed-off-by: Haibo Xi <haibbo@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-12-16 23:41:25 +01:00
Florent Fourcot
d7a769ff0e netfilter: nf_conntrack_ipv6: fix comment for packets without data
Remove ambiguity of double negation.

Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-12-16 23:28:31 +01:00
Andrew Collins
c65ef8dc7b netfilter: nf_nat: Also handle non-ESTABLISHED routing changes in MASQUERADE
Since (a0ecb85 netfilter: nf_nat: Handle routing changes in MASQUERADE
target), the MASQUERADE target handles routing changes which affect
the output interface of a connection, but only for ESTABLISHED
connections.  It is also possible for NEW connections which
already have a conntrack entry to be affected by routing changes.

This adds a check to drop entries in the NEW+conntrack state
when the oif has changed.

Signed-off-by: Andrew Collins <bsderandrew@gmail.com>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-12-16 23:28:30 +01:00
Mukund Jampala
c6f408996c netfilter: ip[6]t_REJECT: fix wrong transport header pointer in TCP reset
The problem occurs when iptables constructs the tcp reset packet.
It doesn't initialize the pointer to the tcp header within the skb.
When the skb is passed to the ixgbe driver for transmit, the ixgbe
driver attempts to access the tcp header and crashes.
Currently, other drivers (such as our 1G e1000e or igb drivers) don't
access the tcp header on transmit unless the TSO option is turned on.

<1>BUG: unable to handle kernel NULL pointer dereference at 0000000d
<1>IP: [<d081621c>] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
<4>*pdpt = 0000000085e5d001 *pde = 0000000000000000
<0>Oops: 0000 [#1] SMP
[...]
<4>Pid: 0, comm: swapper Tainted: P            2.6.35.12 #1 Greencity/Thurley
<4>EIP: 0060:[<d081621c>] EFLAGS: 00010246 CPU: 16
<4>EIP is at ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe]
<4>EAX: c7628820 EBX: 00000007 ECX: 00000000 EDX: 00000000
<4>ESI: 00000008 EDI: c6882180 EBP: dfc6b000 ESP: ced95c48
<4> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
<0>Process swapper (pid: 0, ti=ced94000 task=ced73bd0 task.ti=ced94000)
<0>Stack:
<4> cbec7418 c779e0d8 c77cc888 c77cc8a8 0903010a 00000000 c77c0008 00000002
<4><0> cd4997c0 00000010 dfc6b000 00000000 d0d176c9 c77cc8d8 c6882180 cbec7318
<4><0> 00000004 00000004 cbec7230 cbec7110 00000000 cbec70c0 c779e000 00000002
<0>Call Trace:
<4> [<d0d176c9>] ? 0xd0d176c9
<4> [<d0d18a4d>] ? 0xd0d18a4d
<4> [<411e243e>] ? dev_hard_start_xmit+0x218/0x2d7
<4> [<411f03d7>] ? sch_direct_xmit+0x4b/0x114
<4> [<411f056a>] ? __qdisc_run+0xca/0xe0
<4> [<411e28b0>] ? dev_queue_xmit+0x2d1/0x3d0
<4> [<411e8120>] ? neigh_resolve_output+0x1c5/0x20f
<4> [<411e94a1>] ? neigh_update+0x29c/0x330
<4> [<4121cf29>] ? arp_process+0x49c/0x4cd
<4> [<411f80c9>] ? nf_hook_slow+0x3f/0xac
<4> [<4121ca8d>] ? arp_process+0x0/0x4cd
<4> [<4121ca8d>] ? arp_process+0x0/0x4cd
<4> [<4121c6d5>] ? T.901+0x38/0x3b
<4> [<4121c918>] ? arp_rcv+0xa3/0xb4
<4> [<4121ca8d>] ? arp_process+0x0/0x4cd
<4> [<411e1173>] ? __netif_receive_skb+0x32b/0x346
<4> [<411e19e1>] ? netif_receive_skb+0x5a/0x5f
<4> [<411e1ea9>] ? napi_skb_finish+0x1b/0x30
<4> [<d0816eb4>] ? ixgbe_xmit_frame_ring+0x1564/0x2260 [ixgbe]
<4> [<41013468>] ? lapic_next_event+0x13/0x16
<4> [<410429b2>] ? clockevents_program_event+0xd2/0xe4
<4> [<411e1b03>] ? net_rx_action+0x55/0x127
<4> [<4102da1a>] ? __do_softirq+0x77/0xeb
<4> [<4102dab1>] ? do_softirq+0x23/0x27
<4> [<41003a67>] ? do_IRQ+0x7d/0x8e
<4> [<41002a69>] ? common_interrupt+0x29/0x30
<4> [<41007bcf>] ? mwait_idle+0x48/0x4d
<4> [<4100193b>] ? cpu_idle+0x37/0x4c
<0>Code: df 09 d7 0f 94 c2 0f b6 d2 e9 e7 fb ff ff 31 db 31 c0 e9 38
ff ff ff 80 78 06 06 0f 85 3e fb ff ff 8b 7c 24 38 8b 8f b8 00 00 00
<0f> b6 51 0d f6 c2 01 0f 85 27 fb ff ff 80 e2 02 75 0d 8b 6c 24
<0>EIP: [<d081621c>] ixgbe_xmit_frame_ring+0x8cc/0x2260 [ixgbe] SS:ESP

Signed-off-by: Mukund Jampala <jbmukund@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-12-16 23:27:35 +01:00
Simon Arlott
df48419107 ipv6: Fix Makefile offload objects
The following commit breaks IPv6 TCP transmission for me:
	Commit 75fe83c322
	Author: Vlad Yasevich <vyasevic@redhat.com>
	Date:   Fri Nov 16 09:41:21 2012 +0000
	ipv6: Preserve ipv6 functionality needed by NET

This patch fixes the typo "ipv6_offload" which should be
"ipv6-offload".

I don't know why not including the offload modules should
break TCP. Disabling all offload options on the NIC didn't
help. Outgoing pulseaudio traffic kept stalling.

Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-16 09:15:53 -08:00
Christoph Paasch
e337e24d66 inet: Fix kmemleak in tcp_v4/6_syn_recv_sock and dccp_v4/6_request_recv_sock
If in either of the above functions inet_csk_route_child_sock() or
__inet_inherit_port() fails, the newsk will not be freed:

unreferenced object 0xffff88022e8a92c0 (size 1592):
  comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
  hex dump (first 32 bytes):
    0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00  ................
    02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8153d190>] kmemleak_alloc+0x21/0x3e
    [<ffffffff810ab3e7>] kmem_cache_alloc+0xb5/0xc5
    [<ffffffff8149b65b>] sk_prot_alloc.isra.53+0x2b/0xcd
    [<ffffffff8149b784>] sk_clone_lock+0x16/0x21e
    [<ffffffff814d711a>] inet_csk_clone_lock+0x10/0x7b
    [<ffffffff814ebbc3>] tcp_create_openreq_child+0x21/0x481
    [<ffffffff814e8fa5>] tcp_v4_syn_recv_sock+0x3a/0x23b
    [<ffffffff814ec5ba>] tcp_check_req+0x29f/0x416
    [<ffffffff814e8e10>] tcp_v4_do_rcv+0x161/0x2bc
    [<ffffffff814eb917>] tcp_v4_rcv+0x6c9/0x701
    [<ffffffff814cea9f>] ip_local_deliver_finish+0x70/0xc4
    [<ffffffff814cec20>] ip_local_deliver+0x4e/0x7f
    [<ffffffff814ce9f8>] ip_rcv_finish+0x1fc/0x233
    [<ffffffff814cee68>] ip_rcv+0x217/0x267
    [<ffffffff814a7bbe>] __netif_receive_skb+0x49e/0x553
    [<ffffffff814a7cc3>] netif_receive_skb+0x50/0x82

This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
a single sock_put() is not enough to free the memory. Additionally, things
like xfrm, memcg, cookie_values,... may have been initialized.
We have to free them properly.

This is fixed by forcing a call to tcp_done(), ending up in
inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
xfrm,...

Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
force it entering inet_csk_destroy_sock. To avoid the warning in
inet_csk_destroy_sock, inet_num has to be set to 0.
As inet_csk_destroy_sock does a dec on orphan_count, we first have to
increase it.

Calling tcp_done() allows us to remove the calls to
tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

A similar approach is taken for dccp by calling dccp_done().

This is in the kernel since 093d282321 (tproxy: fix hash locking issue
when using port redirection in __inet_inherit_port()), thus since
version >= 2.6.37.

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-14 13:14:07 -05:00
Duan Jiong
093d04d42f ipv6: Change skb->data before using icmpv6_notify() to propagate redirect
In function ndisc_redirect_rcv(), the skb->data points to the transport
header, but function icmpv6_notify() need the skb->data points to the
inner IP packet. So before using icmpv6_notify() to propagate redirect,
change skb->data to point the inner IP packet that triggered the sending
of the Redirect, and introduce struct rd_msg to make it easy.

Signed-off-by: Duan Jiong <djduanjiong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-14 13:14:07 -05:00
YOSHIFUJI Hideaki / 吉藤英明
7bdc1b4aba ndisc: Fix padding error in link-layer address option.
If a natural number n exists where 2 + data_len <= 8n < 2 + data_len + pad,
post padding is not initialized correctly.

(Un)fortunately, the only type that requires pad is Infiniband,
whose pad is 2 and data_len is 20, and this logical error has not
become obvious, but it is better to fix.

Note that ndisc_opt_addr_space() handles the situation described
above correctly.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-13 12:58:11 -05:00
YOSHIFUJI Hideaki
fd0ea7dbfa ndisc: Unexport ndisc_{build,send}_skb().
These symbols were exported for bonding device by commit 305d552a
("bonding: send IPv6 neighbor advertisement on failover").

It bacame obsolete by commit 7c899432 ("bonding, ipv4, ipv6, vlan: Handle
NETDEV_BONDING_FAILOVER like NETDEV_NOTIFY_PEERS") and removed by
commit 4f5762ec ("bonding: Remove obsolete source file 'bond_ipv6.c'").

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-12 12:42:29 -05:00
Eric Dumazet
0e1efe9d5e ipv6: avoid taking locks at socket dismantle
ipv6_sock_mc_close() is called for ipv6 sockets at close time, and most
of them don't use multicast.

Add a test to avoid contention on a shared spinlock.

Same heuristic applies for ipv6_sock_ac_close(), to avoid contention
on a shared rwlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-05 16:01:28 -05:00
David S. Miller
b1afce9538 ipv6: Protect ->mc_forwarding access with CONFIG_IPV6_MROUTE
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 14:46:34 -05:00
Nicolas Dichtel
193c1e478c ip6mr: fix rtm_family of rtnl msg
We talk about IPv6, hence the family is RTNL_FAMILY_IP6MR!
rtnl_register() is already called with RTNL_FAMILY_IP6MR.

The bug is here since the beginning of this function (commit 5b285cac35).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:27:24 -05:00
Nicolas Dichtel
812e44dd18 ip6mr: advertise new mfc entries via rtnl
This patch allows to monitor mf6c activities via rtnetlink.
To avoid parsing two times the mf6c oifs, we use maxvif to allocate the rtnl
msg, thus we may allocate some superfluous space.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:08:11 -05:00
Nicolas Dichtel
1eb99af52c ipmr/ip6mr: allow to get unresolved cache via netlink
/proc/net/ip[6]_mr_cache allows to get all mfc entries, even if they are put in
the unresolved list (mfc[6]_unres_queue). But only the table RT_TABLE_DEFAULT is
displayed.
This patch adds the parsing of the unresolved list when the dump is made via
rtnetlink, hence each table can be checked.

In IPv6, we set rtm_type in ip6mr_fill_mroute(), because in case of unresolved
mfc __ip6mr_fill_mroute() will not set it. In IPv4, it is already done.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:08:11 -05:00
Nicolas Dichtel
9a68ac72a4 ipmr/ip6mr: report origin of mfc entry into rtnl msg
A mfc entry can be static or not (added via the mroute_sk socket). The patch
reports MFC_STATIC flag into rtm_protocol by setting rtm_protocol to
RTPROT_STATIC or RTPROT_MROUTED.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:08:11 -05:00
Nicolas Dichtel
adfa85e45d ipmr/ip6mr: advertise mfc stats via rtnetlink
These statistics can be checked only via /proc/net/ip_mr_cache or
SIOCGETSGCNT[_IN6] and thus only for the table RT_TABLE_DEFAULT.
Advertising them via rtnetlink allows to get statistics for all cache entries,
whatever the table is.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:08:10 -05:00
Nicolas Dichtel
70b386a0cc ip6mr: use nla_nest_* helpers
This patch removes the skb manipulations when nested attributes are added by
using standard helpers.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:08:10 -05:00
Nicolas Dichtel
d67b8c616b netconf: advertise mc_forwarding status
This patch advertise the MC_FORWARDING status for IPv4 and IPv6.
This field is readonly, only multicast engine in the kernel updates it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:08:10 -05:00
David S. Miller
e8ad1a8fab Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
* Remove limitation in the maximum number of supported sets in ipset.
  Now ipset automagically increments the number of slots in the array
  of sets by 64 new spare slots, from Jozsef Kadlecsik.

* Partially remove the generic queue infrastructure now that ip_queue
  is gone. Its only client is nfnetlink_queue now, from Florian
  Westphal.

* Add missing attribute policy checkings in ctnetlink, from Florian
  Westphal.

* Automagically kill conntrack entries that use the wrong output
  interface for the masquerading case in case of routing changes,
  from Jozsef Kadlecsik.

* Two patches two improve ct object traceability. Now ct objects are
  always placed in any of the existing lists. This allows us to dump
  the content of unconfirmed and dying conntracks via ctnetlink as
  a way to provide more instrumentation in case you suspect leaks,
  from myself.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-04 13:01:19 -05:00
Paul Marks
a5a81f0b90 ipv6: Fix default route failover when CONFIG_IPV6_ROUTER_PREF=n
I believe this commit from 2008 was incorrect:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=398bcbebb6f721ac308df1e3d658c0029bb74503

When CONFIG_IPV6_ROUTER_PREF is disabled, the kernel should follow
RFC4861 section 6.3.6: if no route is NUD_VALID, then traffic should be
sprayed across all routers (indirectly triggering NUD) until one of them
becomes NUD_VALID.

However, the following experiment demonstrates that this does not work:

1) Connect to an IPv6 network.
2) Change the router's MAC (and link-local) address.

The kernel will lock onto the first router and never try the new one, even
if the first becomes unreachable.  This patch fixes the problem by
allowing rt6_check_neigh() to return 0; if all routers return 0, then
rt6_select() will fall back to round-robin behavior.

This patch should have no effect when CONFIG_IPV6_ROUTER_PREF=y.

Note that rt6_check_neigh() is only used in a boolean context, so I've
changed its return type accordingly.

Signed-off-by: Paul Marks <pmarks@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 15:34:47 -05:00
Shmulik Ladkani
9ba2add3cf ipv6: Make 'addrconf_rs_timer' send Router Solicitations (and re-arm itself) if Router Advertisements are accepted
As of 026359b [ipv6: Send ICMPv6 RSes only when RAs are accepted],
Router Solicitations are sent whenever kernel accepts Router
Advertisements on the interface.

However, this logic isn't reflected in 'addrconf_rs_timer'.

The timer fails to issue subsequent RS messages (and fails to re-arm
itself) if forwarding is enabled and the special hybrid mode is
enabled (accept_ra=2).

Fix the condition determining whether next RS should be sent, by using
'ipv6_accept_ra()'.

Reported-by: Ami Koren <amikoren@yahoo.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 13:59:57 -05:00
Jozsef Kadlecsik
a0ecb85a2c netfilter: nf_nat: Handle routing changes in MASQUERADE target
When the route changes (backup default route, VPNs) which affect a
masqueraded target, the packets were sent out with the outdated source
address. The patch addresses the issue by comparing the outgoing interface
directly with the masqueraded interface in the nat table.

Events are inefficient in this case, because it'd require adding route
events to the network core and then scanning the whole conntrack table
and re-checking the route for all entry.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-12-03 15:14:20 +01:00
Shmulik Ladkani
aeaf6e9d2f ipv6: unify logic evaluating inet6_dev's accept_ra property
As of 026359b [ipv6: Send ICMPv6 RSes only when RAs are accepted], the
logic determining whether to send Router Solicitations is identical
to the logic determining whether kernel accepts Router Advertisements.

However the condition itself is repeated in several code locations.

Unify it by introducing 'ipv6_accept_ra()' accessor.

Also, simplify the condition expression, making it more readable.
No semantic change.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-01 11:36:37 -05:00
Eric Dumazet
ce43b03e88 net: move inet_dport/inet_num in sock_common
commit 68835aba4d (net: optimize INET input path further)
moved some fields used for tcp/udp sockets lookup in the first cache
line of struct sock_common.

This patch moves inet_dport/inet_num as well, filling a 32bit hole
on 64 bit arches and reducing number of cache line misses in lookups.

Also change INET_MATCH()/INET_TW_MATCH() to perform the ports match
before addresses match, as this check is more discriminant.

Remove the hash check from MATCH() macros because we dont need to
re validate the hash value after taking a refcount on socket, and
use likely/unlikely compiler hints, as the sk_hash/hash check
makes the following conditional tests 100% predicted by cpu.

Introduce skc_addrpair/skc_portpair pair values to better
document the alignment requirements of the port/addr pairs
used in the various MATCH() macros, and remove some casts.

The namespace check can also be done at last.

This slightly improves TCP/UDP lookup times.

IP/TCP early demux needs inet->rx_dst_ifindex and
TCP needs inet->min_ttl, lets group them together in same cache line.

With help from Ben Hutchings & Joe Perches.

Idea of this patch came after Ling Ma proposal to move skc_hash
to the beginning of struct sock_common, and should allow him
to submit a final version of his patch. My tests show an improvement
doing so.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Joe Perches <joe@perches.com>
Cc: Ling Ma <ling.ma.program@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-30 15:02:56 -05:00
David S. Miller
e7165030db Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Conflicts:
	net/ipv6/exthdrs_core.c

Jesse Gross says:

====================
This series of improvements for 3.8/net-next contains four components:
 * Support for modifying IPv6 headers
 * Support for matching and setting skb->mark for better integration with
   things like iptables
 * Ability to recognize the EtherType for RARP packets
 * Two small performance enhancements

The movement of ipv6_find_hdr() into exthdrs_core.c causes two small merge
conflicts.  I left it as is but can do the merge if you want.  The conflicts
are:
 * ipv6_find_hdr() and ipv6_find_tlv() were both moved to the bottom of
   exthdrs_core.c.  Both should stay.
 * A new use of ipv6_find_hdr() was added to net/netfilter/ipvs/ip_vs_core.c
   after this patch.  The IPVS user has two instances of the old constant
   name IP6T_FH_F_FRAG which has been renamed to IP6_FH_F_FRAG.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-30 12:01:30 -05:00
Nicolas Dichtel
f4e0b4c5e1 ip6tnl/sit: drop packet if ECN present with not-ECT
This patch reports the change made by Stephen Hemminger in ipip and gre[6] in
commit eccc1bb8d4 (tunnel: drop packet if ECN present with not-ECT).

Goal is to handle RFC6040, Section 4.2:

Default Tunnel Egress Behaviour.
 o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
      propagate any other ECN codepoint onwards.  This is because the
      inner Not-ECT marking is set by transports that rely on dropped
      packets as an indication of congestion and would not understand or
      respond to any other ECN codepoint [RFC4774].  Specifically:

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         CE, the decapsulator MUST drop the packet.

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
         outgoing packet with the ECN field cleared to Not-ECT.

The patch takes benefits from common function added in net/inet_ecn.h.

Like it was done for Xin4 tunnels, it adds logging to allow detecting broken
systems that set ECN bits incorrectly when tunneling (or an intermediate
router might be changing the header). Errors are also tracked via
rx_frame_error.

CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-28 11:37:11 -05:00
Joe Perches
03f52a0a55 ip6mr: Add sizeof verification to MRT6_ASSERT and MT6_PIM
Verify the length of the user-space arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-26 17:35:58 -05:00
Joe Perches
53d6841d22 ipv4/ipmr and ipv6/ip6mr: Convert int mroute_do_<foo> to bool
Save a few bytes per table by convert mroute_do_assert and
mroute_do_pim from int to bool.

Remove !! as the compiler does that when assigning int to bool.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-25 16:34:17 -05:00
David S. Miller
24bc518a68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/iwlwifi/pcie/tx.c

Minor iwlwifi conflict in TX queue disabling between 'net', which
removed a bogus warning, and 'net-next' which added some status
register poking code.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-25 12:49:17 -05:00
Andrey Vagin
2b9164771e ipv6: adapt connect for repair move
This is work the same as for ipv4.

All other hacks about tcp repair are in common code for ipv4 and ipv6,
so this patch is enough for repairing ipv6 connections.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-22 15:30:14 -05:00
David S. Miller
242a18d137 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
This pull request is intended for net-next and contains the following changes:

1) Remove a redundant check when initializing the xfrm replay functions,
   from Ulrich Weber.
2) Use a faster per-cpu helper when allocating ipcomt transforms,
   from Shan Wei.
3) Use a static gc threshold value for ipv6, simmilar to what we do
   for ipv4 now.
4) Remove a commented out function call.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-22 15:25:55 -05:00
Eric Dumazet
b4dd006760 ipv6: fix inet6_csk_update_pmtu() return value
In case of error, inet6_csk_update_pmtu() should consistently
return NULL.

Bug added in commit 35ad9b9cf7
(ipv6: Add helper inet6_csk_update_pmtu().)

Reported-by: Lluís Batlle i Rossell <viric@viric.name>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-20 15:16:15 -05:00
Nicolas Dichtel
e2f1f072db sit: allow to configure 6rd tunnels via netlink
This patch add the support of 6RD tunnels management via netlink.
Note that netdev_state_change() is now called when 6RD parameters are updated.

6RD parameters are updated only if there is at least one 6RD attribute.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-20 13:43:28 -05:00
Eric W. Biederman
3594698a1f net: Make CAP_NET_BIND_SERVICE per user namespace
Allow privileged users in any user namespace to bind to
privileged sockets in network namespaces they control.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:33:37 -05:00
Eric W. Biederman
b51642f6d7 net: Enable a userns root rtnl calls that are safe for unprivilged users
- Only allow moving network devices to network namespaces you have
  CAP_NET_ADMIN privileges over.

- Enable creating/deleting/modifying interfaces
- Enable adding/deleting addresses
- Enable adding/setting/deleting neighbour entries
- Enable adding/removing routes
- Enable adding/removing fib rules
- Enable setting the forwarding state
- Enable adding/removing ipv6 address labels
- Enable setting bridge parameter

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:33:36 -05:00
Eric W. Biederman
c027aab4a6 net: Enable some sysctls that are safe for the userns root
- Enable the per device ipv4 sysctls:
   net/ipv4/conf/<if>/forwarding
   net/ipv4/conf/<if>/mc_forwarding
   net/ipv4/conf/<if>/accept_redirects
   net/ipv4/conf/<if>/secure_redirects
   net/ipv4/conf/<if>/shared_media
   net/ipv4/conf/<if>/rp_filter
   net/ipv4/conf/<if>/send_redirects
   net/ipv4/conf/<if>/accept_source_route
   net/ipv4/conf/<if>/accept_local
   net/ipv4/conf/<if>/src_valid_mark
   net/ipv4/conf/<if>/proxy_arp
   net/ipv4/conf/<if>/medium_id
   net/ipv4/conf/<if>/bootp_relay
   net/ipv4/conf/<if>/log_martians
   net/ipv4/conf/<if>/tag
   net/ipv4/conf/<if>/arp_filter
   net/ipv4/conf/<if>/arp_announce
   net/ipv4/conf/<if>/arp_ignore
   net/ipv4/conf/<if>/arp_accept
   net/ipv4/conf/<if>/arp_notify
   net/ipv4/conf/<if>/proxy_arp_pvlan
   net/ipv4/conf/<if>/disable_xfrm
   net/ipv4/conf/<if>/disable_policy
   net/ipv4/conf/<if>/force_igmp_version
   net/ipv4/conf/<if>/promote_secondaries
   net/ipv4/conf/<if>/route_localnet

- Enable the global ipv4 sysctl:
   net/ipv4/ip_forward

- Enable the per device ipv6 sysctls:
   net/ipv6/conf/<if>/forwarding
   net/ipv6/conf/<if>/hop_limit
   net/ipv6/conf/<if>/mtu
   net/ipv6/conf/<if>/accept_ra
   net/ipv6/conf/<if>/accept_redirects
   net/ipv6/conf/<if>/autoconf
   net/ipv6/conf/<if>/dad_transmits
   net/ipv6/conf/<if>/router_solicitations
   net/ipv6/conf/<if>/router_solicitation_interval
   net/ipv6/conf/<if>/router_solicitation_delay
   net/ipv6/conf/<if>/force_mld_version
   net/ipv6/conf/<if>/use_tempaddr
   net/ipv6/conf/<if>/temp_valid_lft
   net/ipv6/conf/<if>/temp_prefered_lft
   net/ipv6/conf/<if>/regen_max_retry
   net/ipv6/conf/<if>/max_desync_factor
   net/ipv6/conf/<if>/max_addresses
   net/ipv6/conf/<if>/accept_ra_defrtr
   net/ipv6/conf/<if>/accept_ra_pinfo
   net/ipv6/conf/<if>/accept_ra_rtr_pref
   net/ipv6/conf/<if>/router_probe_interval
   net/ipv6/conf/<if>/accept_ra_rt_info_max_plen
   net/ipv6/conf/<if>/proxy_ndp
   net/ipv6/conf/<if>/accept_source_route
   net/ipv6/conf/<if>/optimistic_dad
   net/ipv6/conf/<if>/mc_forwarding
   net/ipv6/conf/<if>/disable_ipv6
   net/ipv6/conf/<if>/accept_dad
   net/ipv6/conf/<if>/force_tllao

- Enable the global ipv6 sysctls:
   net/ipv6/bindv6only
   net/ipv6/icmp/ratelimit

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:33:00 -05:00
Eric W. Biederman
af31f412c7 net: Allow userns root to control ipv6
Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.

In general policy and network stack state changes are allowed while
resource control is left unchanged.

Allow the SIOCSIFADDR ioctl to add ipv6 addresses.
Allow the SIOCDIFADDR ioctl to delete ipv6 addresses.
Allow the SIOCADDRT ioctl to add ipv6 routes.
Allow the SIOCDELRT ioctl to delete ipv6 routes.

Allow creation of ipv6 raw sockets.

Allow setting the IPV6_JOIN_ANYCAST socket option.
Allow setting the IPV6_FL_A_RENEW parameter of the IPV6_FLOWLABEL_MGR
socket option.

Allow setting the IPV6_TRANSPARENT socket option.
Allow setting the IPV6_HOPOPTS socket option.
Allow setting the IPV6_RTHDRDSTOPTS socket option.
Allow setting the IPV6_DSTOPTS socket option.
Allow setting the IPV6_IPSEC_POLICY socket option.
Allow setting the IPV6_XFRM_POLICY socket option.

Allow sending packets with the IPV6_2292HOPOPTS control message.
Allow sending packets with the IPV6_2292DSTOPTS control message.
Allow sending packets with the IPV6_RTHDRDSTOPTS control message.

Allow setting the multicast routing socket options on non multicast
routing sockets.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, and SIOCDELTUNNEL ioctls for
setting up, changing and deleting tunnels over ipv6.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL, SIOCDELTUNNEL ioctls for
setting up, changing and deleting ipv6 over ipv4 tunnels.

Allow the SIOCADDPRL, SIOCDELPRL, SIOCCHGPRL ioctls for adding,
deleting, and changing the potential router list for ISATAP tunnels.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:32:45 -05:00
Eric W. Biederman
dfc47ef863 net: Push capable(CAP_NET_ADMIN) into the rtnl methods
- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
  to ns_capable(net->user-ns, CAP_NET_ADMIN).  Allowing unprivileged
  users to make netlink calls to modify their local network
  namespace.

- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
  that calls that are not safe for unprivileged users are still
  protected.

Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:32:44 -05:00
Eric W. Biederman
464dc801c7 net: Don't export sysctls to unprivileged users
In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.

This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 20:30:55 -05:00
Vlad Yasevich
75fe83c322 ipv6: Preserve ipv6 functionality needed by NET
Some pieces of network use core pieces of IPv6 stack.  Keep
them available while letting new GSO offload pieces depend
on CONFIG_INET.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-18 02:34:00 -05:00
David S. Miller
67f4efdce7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor line offset auto-merges.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-17 22:00:43 -05:00
David S. Miller
545b29019c Merge branch 'master' of git://1984.lsi.us.es/nf-next
Conflicts:
	net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c

Minor conflict due to some IS_ENABLED conversions done
in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-16 12:42:43 -05:00
Steffen Klassert
0afe21fdf6 xfrm6: Remove commented out function call to xfrm6_input_fini
xfrm6_input_fini() is not in the tree since more than 10 years,
so remove the commented out function call.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2012-11-16 08:07:56 +01:00
Vlad Yasevich
d4d0d3557b ipv6: Fix build error with udp_offload
Add ip6_checksum.h include.  This should resolve the following issue
that shows up on power:

net/ipv6/udp_offload.c: In function 'udp6_ufo_send_check':
net/ipv6/udp_offload.c:29:2: error: implicit declaration of function
'csum_ipv6_magic' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 22:48:32 -05:00
Vlad Yasevich
f191a1d17f net: Remove code duplication between offload structures
Move the offload callbacks into its own structure.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:39:51 -05:00
Vlad Yasevich
c6b641a4c6 ipv6: Pull IPv6 GSO registration out of the module
Sing GSO support is now separate, pull it out of the module
and make it its own init call.
Remove the cleanup functions as they are no longer called.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:39:24 -05:00
Vlad Yasevich
3c73a0368e ipv6: Update ipv6 static library with newly needed functions
UDP offload needs some additional functions to be in the static kernel
for it work correclty.  Move those functions into the core.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:39:23 -05:00
Vlad Yasevich
2207afc8bf ipv6: Move exthdr offload support into separate file
Move the exthdr offload functionality into a separeate
file in preparate for moving it out of the module

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:18 -05:00
Vlad Yasevich
5edbb07dc9 ipv6: Separate out UDP offload functionality
Pull UDP GSO code into a separate file in preparation for moving
the code out of the module.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:18 -05:00
Vlad Yasevich
8663e02aba ipv6: Separate tcp offload functionality
Pull TCPv6 offload functionality into its won file in preparation
for moving it out of the module.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:18 -05:00
Vlad Yasevich
d1da932ed4 ipv6: Separate ipv6 offload support
Separate IPv6 offload functionality into its own file
in preparation for the move out of the module

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:17 -05:00
Vlad Yasevich
3336288a9f ipv6: Switch to using new offload infrastructure.
Switch IPv6 protocol to using the new GRO/GSO calls and data.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:17 -05:00
Vlad Yasevich
8ca896cfdd ipv6: Add new offload registration infrastructure.
Create a new data structure for IPv6 protocols that holds GRO/GSO
callbacks and a new array to track the protocols that register GRO/GSO.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:17 -05:00
Vlad Yasevich
22061d8014 net: Switch to using the new packet offload infrustructure
Convert to using the new GSO/GRO registration mechanism and new
packet offload structure.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 17:36:17 -05:00
Nicolas Dichtel
1ff05fb711 ip6tnl: fix sparse warnings in ip6_tnl_netlink_parms()
This change fixes a sparse warning triggered by casting the flowinfo from
netlink messages in an u32 instead of be32. This change corrects that in order
to resolve the sparse warning.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 13:46:29 -05:00
Nicolas Dichtel
d440b72068 sit: fix sparse warnings
This change fixes several sparse warnings about endianness problem. The wrong
nla_*() functions were used.
It also fix a sparse warning about a flag test (field i_flags). This field is
used in this file like a local flag only, so it is more an u16 (gre uses it as a
be16). This sparse warning was already there before the patch that add netlink
management, the code has just been moved.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-15 13:46:29 -05:00
Steffen Klassert
c381328659 xfrm: Use a static gc threshold value for ipv6
Unlike ipv4 did, ipv6 does not handle the maximum number of cached
routes dynamically. So no need to try to handle the IPsec gc threshold
value dynamically. This patch sets the IPsec gc threshold value back to
1024 routes, as it is for non-IPsec routes.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2012-11-15 09:00:18 +01:00
Nicolas Dichtel
f372341602 sit: add support of link creation via rtnl
This patch add the support of 'ip link .. type sit'.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:39 -05:00
Nicolas Dichtel
e4c94a9cdc sit: rename rtnl functions for consistency
Functions in this file start with ipip6_.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:38 -05:00
Nicolas Dichtel
a12c9a8581 sit/rtnl: add missing parameters on dump
IFLA_IPTUN_FLAGS and IFLA_IPTUN_PMTUDISC were missing.
There is only one possible flag in i_flag: SIT_ISATAP.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:38 -05:00
Nicolas Dichtel
f9cd5a5536 sit: always notify change when params are updated
netdev_state_change() was called only when end points or link was updated. Now
that all parameters are advertised via netlink, we must advertise any change.

This patch also prepares the support of sit tunnels management via rtnl. The
code which update tunnels will be put in a new function.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:38 -05:00
Nicolas Dichtel
0b11245722 ip6tnl: add support of link creation via rtnl
This patch add the support of 'ip link .. type ip6tnl'.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:37 -05:00
Nicolas Dichtel
b58d731acc ip6tnl: rename rtnl functions for consistency
Functions in this file start with ip6_tnl_.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:37 -05:00
Nicolas Dichtel
cfa323b6b9 ip6tnl/rtnl: add IFLA_IPTUN_PROTO on dump
IPv6 tunnels can have three mode: 4in6, 6in6 and xin6.
This information was missing in the netlink message.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 22:02:37 -05:00
Li RongQing
c75ea26040 ipv6: remove obsolete comments in route.c
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 18:51:45 -05:00
Amerigo Wang
e086cadc08 net: unify for_each_ip_tunnel_rcu()
The defitions of for_each_ip_tunnel_rcu() are same,
so unify it. Also, don't hide the parameter 't'.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 18:49:50 -05:00
Amerigo Wang
aa0010f880 net: convert __IPTUNNEL_XMIT() to an inline function
__IPTUNNEL_XMIT() is an ugly macro, convert it to a static
inline function, so make it more readable.

IPTUNNEL_XMIT() is unused, just remove it.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-14 18:49:50 -05:00
Hannes Frederic Sowa
d4596bad2a ipv6: setsockopt(IPIPPROTO_IPV6, IPV6_MINHOPCOUNT) forgot to set return value
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-13 14:38:47 -05:00
Hannes Frederic Sowa
5cb04436ee ipv6: add knob to send unsolicited ND on link-layer address change
This patch introduces a new knob ndisc_notify. If enabled, the kernel
will transmit an unsolicited neighbour advertisement on link-layer address
change to update the neighbour tables of the corresponding hosts more quickly.

This is the equivalent to arp_notify in ipv4 world.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-13 14:27:45 -05:00
Florian Westphal
d3976a53ce netfilter: ipv6: only provide sk_bound_dev_if for link-local addr
yoshfuji points out that sk_bound_dev_if should only be provided
for link-local addresses.

IPv6 getpeer/sockname also has this test, i.e. we will now
only set sin6_scope_id if the original(!) destination
was a link-local address.

Reported-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-11-13 13:42:29 +01:00
Ansis Atteka
9195bb8e38 ipv6: improve ipv6_find_hdr() to skip empty routing headers
This patch prepares ipv6_find_hdr() function so that it could be
able to skip routing headers, where segements_left is 0. This is
required to handle multiple routing header case correctly when
changing IPv6 addresses.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2012-11-12 12:31:37 -08:00
YOSHIFUJI Hideaki / 吉藤英明
9fafd65ad4 ipv6 ndisc: Use pre-defined in6addr_linklocal_allnodes.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-12 15:23:21 -05:00
David S. Miller
d4185bbf62 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net.  Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-10 18:32:51 -05:00
Jesse Gross
f8f626754e ipv6: Move ipv6_find_hdr() out of Netfilter code.
Open vSwitch will soon also use ipv6_find_hdr() so this moves it
out of Netfilter-specific code into a more common location.

Signed-off-by: Jesse Gross <jesse@nicira.com>
2012-11-09 17:05:07 -08:00
Nicolas Dichtel
c075b13098 ip6tnl: advertise tunnel param via rtnl
It is usefull for daemons that monitor link event to have the full parameters of
these interfaces when a rtnl message is sent.
It allows also to dump them via rtnetlink.

It is based on what is done for GRE tunnels.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-09 19:36:20 -05:00
Nicolas Dichtel
ba3e3f50a0 sit: advertise tunnel param via rtnl
It is usefull for daemons that monitor link event to have the full parameters of
these interfaces when a rtnl message is sent.
It allows also to dump them via rtnetlink.

It is based on what is done for GRE tunnels.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-09 19:36:20 -05:00
Nicolas Dichtel
a375413311 gre6: fix rtnl dump messages
Spotted after a code review.
Introduced by c12b395a46 (gre: Support GRE over
IPv6).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-09 17:11:17 -05:00
Hannes Frederic Sowa
60713a0ca7 ipv6: send unsolicited neighbour advertisements to all-nodes
As documented in RFC4861 (Neighbor Discovery for IP version 6) 7.2.6.,
unsolicited neighbour advertisements should be sent to the all-nodes
multicast address.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-09 16:18:52 -05:00
Li RongQing
a4477c4ddb ipv6: remove rt6i_peer_genid from rt6_info and its handler
6431cbc25f(Create a mechanism for upward inetpeer propagation into routes)
introduces these codes, but this mechanism is never enabled since
rt6i_peer_genid always is zero whether it is not assigned or assigned by
rt6_peer_genid(). After 5943634fc5 (ipv4: Maintain redirect and PMTU info
in struct rtable again), the ipv4 related codes of this mechanism has been
removed, I think we maybe able to remove them now.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-08 21:16:08 -05:00
Nicolas Dichtel
b20b6d9726 ndisc: fix a typo in a comment in ndisc_recv_na()
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-07 19:03:16 -05:00
Amerigo Wang
94e187c015 ipv6: introduce ip6_rt_put()
As suggested by Eric, we could introduce a helper function
for ipv6 too, to avoid checking if rt is NULL before
dst_release().

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 14:59:05 -04:00
Amerigo Wang
1a9408355e ipv6: remove a useless NULL check
In dev_forward_change(), it is useless to check if idev->dev
is NULL, it is always non-NULL here.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 14:50:15 -04:00
Eric Dumazet
e6c022a4fa tcp: better retrans tracking for defer-accept
For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-03 14:45:00 -04:00
Nicolas Dichtel
1a72418bd7 ipv6/multipath: remove flag NLM_F_EXCL after the first nexthop
fib6_add_rt2node() will reject the nexthop if this flag is set, so
we perform the check only for the first nexthop.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-02 21:38:19 -04:00
Florian Westphal
121d1e0941 netfilter: ipv6: add getsockopt to retrieve origdst
userspace can query the original ipv4 destination address of a REDIRECTed
connection via
getsockopt(m_sock, SOL_IP, SO_ORIGINAL_DST, &m_server_addr, &addrsize)

but for ipv6 no such option existed.

This adds getsockopt(..., IPPROTO_IPV6, IP6T_SO_ORIGINAL_DST, ...).

Without this, userspace needs to parse /proc or use ctnetlink, which
appears to be overkill.

This uses option number 80 for IP6T_SO_ORIGINAL_DST, which is spare,
to use the same number we use in the IPv4 socket option SO_ORIGINAL_DST.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-11-02 12:26:32 +01:00
Amerigo Wang
07a936260a ipv6: use IS_ENABLED()
#if defined(CONFIG_FOO) || defined(CONFIG_FOO_MODULE)

can be replaced by

#if IS_ENABLED(CONFIG_FOO)

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-01 12:41:35 -04:00
David S. Miller
f8450bbe8c Merge branch 'master' of git://1984.lsi.us.es/nf
Pablo Neira Ayuso says:

====================
The following patchset contains fixes for your net tree, two of them
are due to relatively recent changes, one has been a longstanding bug,
they are:

* Fix incorrect usage of rt_gateway in the H.323 helper, from
  Julian Anastasov.

* Skip re-route in nf_nat code for ICMP traffic. If CONFIG_XFRM is
  enabled, we waste cycles to look up for the route again. This problem
  seems to be there since really long time. From Ulrich Weber.

* Fix mismatching section in nf_conntrack_reasm, from Hein Tibosch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-31 14:54:15 -04:00
Wu Fengguang
6229b75d8d netfilter: nf_nat: use PTR_RET
Use PTR_RET rather than if(IS_ERR(...)) + PTR_ERR

Generated by: coccinelle/api/ptr_ret.cocci

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-10-29 09:59:59 +01:00
Nicolas Dichtel
76f8f6cb76 rtnl/ipv6: add support of RTM_GETNETCONF
This message allows to get the devconf for an interface.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-28 19:05:00 -04:00
Nicolas Dichtel
f3a1bfb11c rtnl/ipv6: use netconf msg to advertise forwarding status
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-28 19:05:00 -04:00
Hein Tibosch
f1df1374dc netfilter: nf_defrag_ipv6: solve section mismatch in nf_conntrack_reasm
WARNING: net/ipv6/netfilter/nf_defrag_ipv6.o(.text+0xe0): Section mismatch in
reference from the function nf_ct_net_init() to the function
.init.text:nf_ct_frag6_sysctl_register()
The function nf_ct_net_init() references the function
__init nf_ct_frag6_sysctl_register().

In case nf_conntrack_ipv6 is compiled as a module, nf_ct_net_init could be
called after the init code and data are unloaded. Therefore remove the
"__net_init" annotation from nf_ct_frag6_sysctl_register().

Signed-off-by: Hein Tibosch <hein_tibosch@yahoo.es>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-10-28 22:44:15 +01:00
Ulrich Weber
38fe36a248 netfilter: nf_nat: don't check for port change on ICMP tuples
ICMP tuples have id in src and type/code in dst.
So comparing src.u.all with dst.u.all will always fail here
and ip_xfrm_me_harder() is called for every ICMP packet,
even if there was no NAT.

Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-10-28 22:43:34 +01:00
Li RongQing
14edd87dc6 ipv6: Set default hoplimit as zero.
Commit a02e4b7dae4551(Demark default hoplimit as zero) only changes the
hoplimit checking condition and default value in ip6_dst_hoplimit, not
zeros all hoplimit default value.

Keep the zeroing ip6_template_metrics[RTAX_HOPLIMIT - 1] to force it as
const, cause as a37e6e344910(net: force dst_default_metrics to const
section)

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-24 23:14:17 -04:00
Nicolas Dichtel
b3ce5ae1fb ipv6: fix sparse warnings in rt6_info_hash_nhsfn()
Adding by commit 51ebd31815 which adds the support of ECMP for IPv6.

Spotted-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-23 13:03:45 -04:00
Neal Cardwell
f3f121359c ipv6: tcp: clean up tcp_v6_early_demux() icsk variable
Remove an icsk variable, which by convention should refer to an
inet_connection_sock rather than an inet_sock. In the process, make
the tcp_v6_early_demux() code and formatting a bit more like
tcp_v4_early_demux(), to ease comparisons and maintenance.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-23 13:03:44 -04:00
Nicolas Dichtel
51ebd31815 ipv6: add support of equal cost multipath (ECMP)
Each nexthop is added like a single route in the routing table. All routes
that have the same metric/weight and destination but not the same gateway
are considering as ECMP routes. They are linked together, through a list called
rt6i_siblings.

ECMP routes can be added in one shot, with RTA_MULTIPATH attribute or one after
the other (in both case, the flag NLM_F_EXCL should not be set).

The patch is based on a previous work from
Luc Saillard <luc.saillard@6wind.com>.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-23 02:38:32 -04:00
Eric Dumazet
9f0d3c2781 ipv6: addrconf: fix /proc/net/if_inet6
Commit 1d5783030a (ipv6/addrconf: speedup /proc/net/if_inet6 filling)
added bugs hiding some devices from if_inet6 and breaking applications.

"ip -6 addr" could still display all IPv6 addresses, while "ifconfig -a"
couldnt.

One way to reproduce the bug is by starting in a shell :

unshare -n /bin/bash
ifconfig lo up

And in original net namespace, lo device disappeared from if_inet6

Reported-by: Jan Hinnerk Stosch <janhinnerk.stosch@gmail.com>
Tested-by: Jan Hinnerk Stosch <janhinnerk.stosch@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mihai Maruseac <mihai.maruseac@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-16 14:41:47 -04:00
Alexey Kuznetsov
4c67525849 tcp: resets are misrouted
After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
sock case").. tcp resets are always lost, when routing is asymmetric.
Yes, backing out that patch will result in misrouting of resets for
dead connections which used interface binding when were alive, but we
actually cannot do anything here.  What's died that's died and correct
handling normal unbound connections is obviously a priority.

Comment to comment:
> This has few benefits:
>   1. tcp_v6_send_reset already did that.

It was done to route resets for IPv6 link local addresses. It was a
mistake to do so for global addresses. The patch fixes this as well.

Actually, the problem appears to be even more serious than guaranteed
loss of resets.  As reported by Sergey Soloviev <sol@eqv.ru>, those
misrouted resets create a lot of arp traffic and huge amount of
unresolved arp entires putting down to knees NAT firewalls which use
asymmetric routing.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
2012-10-12 13:52:40 -04:00
Eric Dumazet
863472454c ipv6: gro: fix PV6_GRO_CB(skb)->proto problem
It seems IPV6_GRO_CB(skb)->proto can be destroyed in skb_gro_receive()
if a new skb is allocated (to serve as an anchor for frag_list)

We copy NAPI_GRO_CB() only (not the IPV6 specific part) in :

*NAPI_GRO_CB(nskb) = *NAPI_GRO_CB(p);

So we leave IPV6_GRO_CB(nskb)->proto to 0 (fresh skb allocation) instead
of IPPROTO_TCP (6)

ipv6_gro_complete() isnt able to call ops->gro_complete()
[ tcp6_gro_complete() ]

Fix this by moving proto in NAPI_GRO_CB() and getting rid of
IPV6_GRO_CB

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-08 15:40:43 -04:00
Eric Dumazet
51ec04038c ipv6: GRO should be ECN friendly
IPv4 side of the problem was addressed in commit a9e050f4e7
(net: tcp: GRO should be ECN friendly)

This patch does the same, but for IPv6 : A Traffic Class mismatch
doesnt mean flows are different, but instead should force a flush
of previous packets.

This patch removes artificial packet reordering problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 14:44:36 -04:00
Linus Torvalds
283dbd8205 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking changes from David Miller:
 "The most important bit in here is the fix for input route caching from
  Eric Dumazet, it's a shame we couldn't fully analyze this in time for
  3.6 as it's a 3.6 regression introduced by the routing cache removal.

  Anyways, will send quickly to -stable after you pull this in.

  Other changes of note:

   1) Fix lockdep splats in team and bonding, from Eric Dumazet.

   2) IPV6 adds link local route even when there is no link local
      address, from Nicolas Dichtel.

   3) Fix ixgbe PTP implementation, from Jacob Keller.

   4) Fix excessive stack usage in cxgb4 driver, from Vipul Pandya.

   5) MAC length computed improperly in VLAN demux, from Antonio
      Quartulli."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  ipv6: release reference of ip6_null_entry's dst entry in __ip6_del_rt
  Remove noisy printks from llcp_sock_connect
  tipc: prevent dropped connections due to rcvbuf overflow
  silence some noisy printks in irda
  team: set qdisc_tx_busylock to avoid LOCKDEP splat
  bonding: set qdisc_tx_busylock to avoid LOCKDEP splat
  sctp: check src addr when processing SACK to update transport state
  sctp: fix a typo in prototype of __sctp_rcv_lookup()
  ipv4: add a fib_type to fib_info
  can: mpc5xxx_can: fix section type conflict
  can: peak_pcmcia: fix error return code
  can: peak_pci: fix error return code
  cxgb4: Fix build error due to missing linux/vmalloc.h include.
  bnx2x: fix ring size for 10G functions
  cxgb4: Dynamically allocate memory in t4_memory_rw() and get_vpd_params()
  ixgbe: add support for X540-AT1
  ixgbe: fix poll loop for FDIRCTRL.INIT_DONE bit
  ixgbe: fix PTP ethtool timestamping function
  ixgbe: (PTP) Fix PPS interrupt code
  ixgbe: Fix PTP X540 SDP alignment code for PPS signal
  ...
2012-10-06 03:11:59 +09:00
Andi Kleen
04a6f82cf0 sections: fix section conflicts in net
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:04:45 +09:00
Gao feng
6825a26c2d ipv6: release reference of ip6_null_entry's dst entry in __ip6_del_rt
as we hold dst_entry before we call __ip6_del_rt,
so we should alse call dst_release not only return
-ENOENT when the rt6_info is ip6_null_entry.

and we already hold the dst entry, so I think it's
safe to call dst_release out of the write-read lock.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-04 16:00:07 -04:00
Nicolas Dichtel
62b54dd915 ipv6: don't add link local route when there is no link local address
When an address is added on loopback (ip -6 a a 2002::1/128 dev lo), a route
to fe80::/64 is added in the main table:
  unreachable fe80::/64 dev lo  proto kernel  metric 256  error -101

This route does not match any prefix (no fe80:: address on lo). In fact,
addrconf_dev_config() will not add link local address because this function
filters interfaces by type. If the link local address is added manually, the
route to the link local prefix will be automatically added by
addrconf_add_linklocal().
Note also, that this route is not deleted when the address is removed.

After looking at the code, it seems that addrconf_add_lroute() is redundant with
addrconf_add_linklocal(), because this function will add the link local route
when the link local address is configured.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-02 22:36:23 -04:00
Eric Dumazet
861b650101 tcp: gro: add checksuming helpers
skb with CHECKSUM_NONE cant currently be handled by GRO, and
we notice this deep in GRO stack in tcp[46]_gro_receive()

But there are cases where GRO can be a benefit, even with a lack
of checksums.

This preliminary work is needed to add GRO support
to tunnels.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 17:00:27 -04:00
Nicolas Dichtel
64c6d08e64 ipv6: del unreachable route when an addr is deleted on lo
When an address is added on loopback (ip -6 a a 2002::1/128 dev lo), two routes
are added:
 - one in the local table:
    local 2002::1 via :: dev lo  proto none  metric 0
 - one the in main table (for the prefix):
    unreachable 2002::1 dev lo  proto kernel  metric 256  error -101

When the address is deleted, the route inserted in the main table remains
because we use rt6_lookup(), which returns NULL when dst->error is set, which
is the case here! Thus, it is better to use ip6_route_lookup() to avoid this
kind of filter.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-01 16:49:23 -04:00
Lin Ming
188c517a05 ipv6: return errno pointers consistently for fib6_add_1()
fib6_add_1() should consistently return errno pointers,
rather than a mixture of NULL and errno pointers.

Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 18:48:28 -04:00
David S. Miller
6a06e5e1bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/team/team.c
	drivers/net/usb/qmi_wwan.c
	net/batman-adv/bat_iv_ogm.c
	net/ipv4/fib_frontend.c
	net/ipv4/route.c
	net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-28 14:40:49 -04:00
Eric Dumazet
bcc452935d ipv6: gre: remove ip6gre_header_parse()
dev_parse_header() callers provide 8 bytes of storage,
so it's not possible to store an IPv6 address.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:49:22 -04:00
Konstantin Khlebnikov
4b7cc7fc26 nf_defrag_ipv6: fix oops on module unloading
fix copy-paste error introduced in linux-next commit
"ipv6: add a new namespace for nf_conntrack_reasm"

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:14:55 -04:00
stephen hemminger
eccc1bb8d4 tunnel: drop packet if ECN present with not-ECT
Linux tunnels were written before RFC6040 and therefore never
implemented the corner case of ECN getting set in the outer header
and the inner header not being ready for it.

Section 4.2.  Default Tunnel Egress Behaviour.
 o If the inner ECN field is Not-ECT, the decapsulator MUST NOT
      propagate any other ECN codepoint onwards.  This is because the
      inner Not-ECT marking is set by transports that rely on dropped
      packets as an indication of congestion and would not understand or
      respond to any other ECN codepoint [RFC4774].  Specifically:

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         CE, the decapsulator MUST drop the packet.

      *  If the inner ECN field is Not-ECT and the outer ECN field is
         Not-ECT, ECT(0), or ECT(1), the decapsulator MUST forward the
         outgoing packet with the ECN field cleared to Not-ECT.

This patch moves the ECN decap logic out of the individual tunnels
into a common place.

It also adds logging to allow detecting broken systems that
set ECN bits incorrectly when tunneling (or an intermediate
router might be changing the header).

Overloads rx_frame_error to keep track of ECN related error.

Thanks to Chris Wright who caught this while reviewing the new VXLAN
tunnel.

This code was tested by injecting faulty logic in other end GRE
to send incorrectly encapsulated packets.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:12:37 -04:00
stephen hemminger
b0558ef24a xfrm: remove extranous rcu_read_lock
The handlers for xfrm_tunnel are always invoked with rcu read lock
already.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:12:37 -04:00
stephen hemminger
0c5794a66c gre: remove unnecessary rcu_read_lock/unlock
The gre function pointers for receive and error handling are
always called (from gre.c) with rcu_read_lock already held.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-27 18:12:37 -04:00
Eric Dumazet
96af69ea2a ipv6: mip6: fix mip6_mh_filter()
mip6_mh_filter() should not modify its input, or else its caller
would need to recompute ipv6_hdr() if skb->head is reallocated.

Use skb_header_pointer() instead of pskb_may_pull()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-25 16:04:44 -04:00
Eric Dumazet
1b05c4b50e ipv6: raw: fix icmpv6_filter()
icmpv6_filter() should not modify its input, or else its caller
would need to recompute ipv6_hdr() if skb->head is reallocated.

Use skb_header_pointer() instead of pskb_may_pull() and
change the prototype to make clear both sk and skb are const.

Also, if icmpv6 header cannot be found, do not deliver the packet,
as we do in IPv4.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-25 13:21:49 -04:00
Eric Dumazet
5640f76858 net: use a per task frag allocator
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 16:31:37 -04:00
David S. Miller
ae4735166e Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
This patchset contains updates for your net-next tree, they are:

* Mostly fixes for the recently pushed IPv6 NAT support:

- Fix crash while removing nf_nat modules from Patrick McHardy.
- Fix unbalanced rcu_read_unlock from Ulrich Weber.
- Merge NETMAP and REDIRECT into one single xt_target module, from
  Jan Engelhardt.
- Fix Kconfig for IPv6 NAT, which allows inconsistent configurations,
  from myself.

* Updates for ipset, all of the from Jozsef Kadlecsik:

- Add the new "nomatch" option to obtain reverse set matching.
- Support for /0 CIDR in hash:net,iface set type.
- One non-critical fix for a rare crash due to pass really
  wrong configuration parameters.
- Coding style cleanups.
- Sparse fixes.
- Add set revision supported via modinfo.i

* One extension for the xt_time match, to support matching during
  the transition between two days with one single rule, from
  Florian Westphal.

* Fix maximum packet length supported by nfnetlink_queue and add
  NFQA_CAP_LEN attribute, from myself.

You can notice that this batch contains a couple of fixes that may
go to 3.6-rc but I don't consider them critical to push them:

* The ipset fix for the /0 cidr case, which is triggered with one
  inconsistent command line invocation of ipset.

* The nfnetlink_queue maximum packet length supported since it requires
  the new NFQA_CAP_LEN attribute to provide a full workaround for the
  described problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-24 15:42:04 -04:00
Neal Cardwell
016818d076 tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS
When taking SYNACK RTT samples for servers using TCP Fast Open, fix
the code to ensure that we only call tcp_valid_rtt_meas() after we
receive the ACK that completes the 3-way handshake.

Previously we were always taking an RTT sample in
tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
to take the RTT sample.

To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
already ensures that we only take a SYNACK RTT sample if snt_synack is
non-zero. To be careful, we only take a snt_synack timestamp when
a SYNACK transmit or retransmit succeeds.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22 15:47:10 -04:00
Neal Cardwell
623df484a7 tcp: extract code to compute SYNACK RTT
In preparation for adding another spot where we compute the SYNACK
RTT, extract this code so that it can be shared.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22 15:47:10 -04:00
Wei Yongjun
f950c0ecc7 ipv6: fix return value check in fib6_add()
In case of error, the function fib6_add_1() returns ERR_PTR()
or NULL pointer. The ERR_PTR() case check is missing in fib6_add().

dpatch engine is used to generated this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-21 13:43:52 -04:00
Jan Engelhardt
2cbc78a29e netfilter: combine ipt_REDIRECT and ip6t_REDIRECT
Combine more modules since the actual code is so small anyway that the
kmod metadata and the module in its loaded state totally outweighs the
combined actual code size.

IP_NF_TARGET_REDIRECT becomes a compat option; IP6_NF_TARGET_REDIRECT
is completely eliminated since it has not see a release yet.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-09-21 12:12:05 +02:00
Jan Engelhardt
b3d54b3e40 netfilter: combine ipt_NETMAP and ip6t_NETMAP
Combine more modules since the actual code is so small anyway that the
kmod metadata and the module in its loaded state totally outweighs the
combined actual code size.

IP_NF_TARGET_NETMAP becomes a compat option; IP6_NF_TARGET_NETMAP
is completely eliminated since it has not see a release yet.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-09-21 12:11:08 +02:00
Pablo Neira Ayuso
b0041d1b8e netfilter: fix IPv6 NAT dependencies in Kconfig
* NF_NAT_IPV6 requires IP6_NF_IPTABLES

* IP6_NF_TARGET_MASQUERADE, IP6_NF_TARGET_NETMAP, IP6_NF_TARGET_REDIRECT
  and IP6_NF_TARGET_NPT require NF_NAT_IPV6.

This change just mirrors what IPv4 does in Kconfig, for consistency.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-09-21 11:33:19 +02:00
Amerigo Wang
6b102865e7 ipv6: unify fragment thresh handling code
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Amerigo Wang
d4915c087f ipv6: make ip6_frag_nqueues() and ip6_frag_mem() static inline
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Amerigo Wang
b836c99fd6 ipv6: unify conntrack reassembly expire code with standard one
Two years ago, Shan Wei tried to fix this:
http://patchwork.ozlabs.org/patch/43905/

The problem is that RFC2460 requires an ICMP Time
Exceeded -- Fragment Reassembly Time Exceeded message should be
sent to the source of that fragment, if the defragmentation
times out.

"
   If insufficient fragments are received to complete reassembly of a
   packet within 60 seconds of the reception of the first-arriving
   fragment of that packet, reassembly of that packet must be
   abandoned and all the fragments that have been received for that
   packet must be discarded.  If the first fragment (i.e., the one
   with a Fragment Offset of zero) has been received, an ICMP Time
   Exceeded -- Fragment Reassembly Time Exceeded message should be
   sent to the source of that fragment.
"

As Herbert suggested, we could actually use the standard IPv6
reassembly code which follows RFC2460.

With this patch applied, I can see ICMP Time Exceeded sent
from the receiver when the sender sent out 3/4 fragmented
IPv6 UDP packet.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Amerigo Wang
c038a767cd ipv6: add a new namespace for nf_conntrack_reasm
As pointed by Michal, it is necessary to add a new
namespace for nf_conntrack_reasm code, this prepares
for the second patch.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: David Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 17:23:28 -04:00
Li RongQing
3fd91fb358 ipv6: recursive check rt->dst.from when call rt6_check_expired
If dst cache dst_a copies from dst_b, and dst_b copies from dst_c, check
if dst_a is expired or not, we should not end with dst_a->dst.from, dst_b,
we should check dst_c.

CC: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19 15:35:33 -04:00
Nicolas Dichtel
2c20cbd7e3 ipv6: use DST_* macro to set obselete field
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 15:57:04 -04:00
Nicolas Dichtel
6f3118b571 ipv6: use net->rt_genid to check dst validity
IPv6 dst should take care of rt_genid too. When a xfrm policy is inserted or
deleted, all dst should be invalidated.
To force the validation, dst entries should be created with ->obsolete set to
DST_OBSOLETE_FORCE_CHK. This was already the case for all functions calling
ip6_dst_alloc(), except for ip6_rt_copy().

As a consequence, we can remove the specific code in inet6_connection_sock.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18 15:57:03 -04:00
David S. Miller
b48b63a1f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/netfilter/nfnetlink_log.c
	net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-15 11:43:53 -04:00
Li RongQing
5744dd9b71 ipv6: replace write lock with read lock when get route info
geting route info does not write rt->rt6i_table, so replace
write lock with read lock

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 16:53:46 -04:00
Eric Dumazet
fb0af4c74f ipv6: route templates can be const
We kmemdup() templates, so they can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 16:52:55 -04:00
YOSHIFUJI Hideaki / 吉藤英明
91b4b04ff8 ipv6: Compare addresses only bits up to the prefix length (RFC6724).
Compare bits up to the source address's prefix length only to
allows DNS load balancing to continue to be used as a tie breaker.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 16:34:03 -04:00
YOSHIFUJI Hideaki / 吉藤英明
417962a02b ipv6: Add labels for site-local and 6bone testing addresses (RFC6724)
Added labels for site-local addresses (fec0::/10) and 6bone testing
addresses (3ffe::/16) in order to depreference them.

Note that the RFC introduced new rows for Teredo, ULA and 6to4 addresses
in the default policy table.  Some of them have different labels from ours.
For backward compatibility, we do not change the "default" labels.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-13 16:34:03 -04:00
Amerigo Wang
fdd6681d92 ipv6: remove some useless RCU read lock
After this commit:
	commit 97cac0821a
	Author: David S. Miller <davem@davemloft.net>
	Date:   Mon Jul 2 22:43:47 2012 -0700

	    ipv6: Store route neighbour in rt6_info struct.

we no longer use RCU to protect route neighbour.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 16:31:18 -04:00
Eric W. Biederman
15e473046c netlink: Rename pid to portid to avoid confusion
It is a frequent mistake to confuse the netlink port identifier with a
process identifier.  Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-10 15:30:41 -04:00
Nicolas Dichtel
b4949ab269 ipv6: fix handling of throw routes
It's the same problem that previous fix about blackhole and prohibit routes.

When adding a throw route, it was handled like a classic route.
Moreover, it was only possible to add this kind of routes by specifying
an interface.

Before the patch:
  $ ip route add throw 2001::2/128
  RTNETLINK answers: No such device
  $ ip route add throw 2001::2/128 dev eth0
  $ ip -6 route | grep 2001::2
  2001::2 dev eth0  metric 1024

After:
  $ ip route add throw 2001::2/128
  $ ip -6 route | grep 2001::2
  throw 2001::2 dev lo  metric 1024  error -11

Reported-by: Markus Stenberg <markus.stenberg@iki.fi>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07 14:17:10 -04:00
Eric Dumazet
979402b16c udp: increment UDP_MIB_INERRORS if copy failed
In UDP recvmsg(), we miss an increase of UDP_MIB_INERRORS if the copy
of skb to userspace failed for whatever reason.

Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07 12:56:00 -04:00
Nicolas Dichtel
ef2c7d7b59 ipv6: fix handling of blackhole and prohibit routes
When adding a blackhole or a prohibit route, they were handling like classic
routes. Moreover, it was only possible to add this kind of routes by specifying
an interface.

Bug already reported here:
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=498498

Before the patch:
  $ ip route add blackhole 2001::1/128
  RTNETLINK answers: No such device
  $ ip route add blackhole 2001::1/128 dev eth0
  $ ip -6 route | grep 2001
  2001::1 dev eth0  metric 1024

After:
  $ ip route add blackhole 2001::1/128
  $ ip -6 route | grep 2001
  blackhole 2001::1 dev lo  metric 1024  error -22

v2: wrong patch
v3: add a field fc_type in struct fib6_config to store RTN_* type

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05 17:49:28 -04:00
Stephen Rothwell
2c96932229 netfilter: ipv6: using csum_ipv6_magic requires net/ip6_checksum.h
Fixes this build error:

net/ipv6/netfilter/nf_nat_l3proto_ipv6.c: In function 'nf_nat_ipv6_csum_recalc':
net/ipv6/netfilter/nf_nat_l3proto_ipv6.c:144:4: error: implicit declaration of function 'csum_ipv6_magic' [-Werror=implicit-function-declaration]

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05 17:46:06 -04:00
Julian Anastasov
d013ef2aba tcp: fix possible socket refcount problem for ipv6
commit 144d56e910
("tcp: fix possible socket refcount problem") is missing
the IPv6 part. As tcp_release_cb is shared by both protocols
we should hold sock reference for the TCP_MTU_REDUCED_DEFERRED
bit.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-05 17:16:25 -04:00
Pablo Neira Ayuso
ace1fe1231 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
This merges (3f509c6 netfilter: nf_nat_sip: fix incorrect handling
of EBUSY for RTCP expectation) to Patrick McHardy's IPv6 NAT changes.
2012-09-03 15:34:51 +02:00
Jerry Chu
8336886f78 tcp: TCP Fast Open Server - support TFO listeners
This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 20:02:19 -04:00
Sorin Dumitru
eb7e057596 ipv6: remove some deadcode
__ipv6_regen_rndid no longer returns anything other than 0
so there's no point in verifying what it returns

Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 16:33:32 -04:00
Julia Lawall
48f125ce1c net: ipv6: fix error return code
Initialize return variable before exiting on an error path.

The initial initialization of the return variable is also dropped, because
that value is never used.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}

// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31 16:27:48 -04:00
Patrick McHardy
8a91bb0c30 netfilter: ip6tables: add stateless IPv6-to-IPv6 Network Prefix Translation target
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:25 +02:00
Patrick McHardy
ed72d9e294 netfilter: ip6tables: add NETMAP target
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:19 +02:00
Patrick McHardy
115e23ac78 netfilter: ip6tables: add REDIRECT target
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:19 +02:00
Patrick McHardy
b3f644fc82 netfilter: ip6tables: add MASQUERADE target
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:18 +02:00
Patrick McHardy
58a317f106 netfilter: ipv6: add IPv6 NAT support
Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:17 +02:00
Patrick McHardy
0ad352cb43 netfilter: ipv6: expand skb head in ip6_route_me_harder after oif change
Expand the skb headroom if the oif changed due to rerouting similar to
how IPv4 packets are handled.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:15 +02:00
Patrick McHardy
2b60af0178 netfilter: nf_conntrack_ipv6: fix tracking of ICMPv6 error messages containing fragments
ICMPv6 error messages are tracked by extracting the conntrack tuple of
the inner packet and looking up the corresponding conntrack entry. Tuple
extraction uses the ->get_l4proto() callback, which in case of fragments
returns NEXTHDR_FRAGMENT instead of the upper protocol, even for the
first fragment when the entire next header is present, resulting in a
failure to find the correct connection tracking entry.

This patch changes ipv6_get_l4proto() to use ipv6_skip_exthdr() instead
of nf_ct_ipv6_skip_exthdr() in order to skip fragment headers when the
fragment offset is zero.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:11 +02:00
Patrick McHardy
4cdd34084d netfilter: nf_conntrack_ipv6: improve fragmentation handling
The IPv6 conntrack fragmentation currently has a couple of shortcomings.
Fragmentes are collected in PREROUTING/OUTPUT, are defragmented, the
defragmented packet is then passed to conntrack, the resulting conntrack
information is attached to each original fragment and the fragments then
continue their way through the stack.

Helper invocation occurs in the POSTROUTING hook, at which point only
the original fragments are available. The result of this is that
fragmented packets are never passed to helpers.

This patch improves the situation in the following way:

- If a reassembled packet belongs to a connection that has a helper
  assigned, the reassembled packet is passed through the stack instead
  of the original fragments.

- During defragmentation, the largest received fragment size is stored.
  On output, the packet is refragmented if required. If the largest
  received fragment size exceeds the outgoing MTU, a "packet too big"
  message is generated, thus behaving as if the original fragments
  were passed through the stack from an outside point of view.

- The ipv6_helper() hook function can't receive fragments anymore for
  connections using a helper, so it is switched to use ipv6_skip_exthdr()
  instead of the netfilter specific nf_ct_ipv6_skip_exthdr() and the
  reassembled packets are passed to connection tracking helpers.

The result of this is that we can properly track fragmented packets, but
still generate ICMPv6 Packet too big messages if we would have before.

This patch is also required as a precondition for IPv6 NAT, where NAT
helpers might enlarge packets up to a point that they require
fragmentation. In that case we can't generate Packet too big messages
since the proper MTU can't be calculated in all cases (f.i. when
changing textual representation of a variable amount of addresses),
so the packet is transparently fragmented iff the original packet or
fragments would have fit the outgoing MTU.

IPVS parts by Jesper Dangaard Brouer <brouer@redhat.com>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2012-08-30 03:00:10 +02:00
David S. Miller
e6acb38480 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24 18:54:37 -04:00
Eric Dumazet
748e2d9396 net: reinstate rtnl in call_netdevice_notifiers()
Eric Biederman pointed out that not holding RTNL while calling
call_netdevice_notifiers() was racy.

This patch is a direct transcription his feedback
against commit 0115e8e30d (net: remove delay at device dismantle)

Thanks Eric !

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23 09:24:42 -07:00
Eric Dumazet
b87fb39e39 ipv6: gre: fix ip6gre_err()
ip6gre_err() miscomputes grehlen (sizeof(ipv6h) is 4 or 8,
not 40 as expected), and should take into account 'offset' parameter.

Also uses pskb_may_pull() to cope with some fragged skbs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Kozlov <xeb@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 22:48:32 -07:00
Eric Dumazet
0115e8e30d net: remove delay at device dismantle
I noticed extra one second delay in device dismantle, tracked down to
a call to dst_dev_event() while some call_rcu() are still in RCU queues.

These call_rcu() were posted by rt_free(struct rtable *rt) calls.

We then wait a little (but one second) in netdev_wait_allrefs() before
kicking again NETDEV_UNREGISTER.

As the call_rcu() are now completed, dst_dev_event() can do the needed
device swap on busy dst.

To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
after a rcu_barrier(), but outside of RTNL lock.

Use NETDEV_UNREGISTER_FINAL with care !

Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
IP cache removal.

With help from Gao feng

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 21:50:36 -07:00
David S. Miller
bf277b0cce Merge git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:

====================
This is the first batch of Netfilter and IPVS updates for your
net-next tree. Mostly cleanups for the Netfilter side. They are:

* Remove unnecessary RTNL locking now that we have support
  for namespace in nf_conntrack, from Patrick McHardy.

* Cleanup to eliminate unnecessary goto in the initialization
  path of several Netfilter tables, from Jean Sacren.

* Another cleanup from Wu Fengguang, this time to PTR_RET instead
  of if IS_ERR then return PTR_ERR.

* Use list_for_each_entry_continue_rcu in nf_iterate, from
  Michael Wang.

* Add pmtu_disc sysctl option to disable PMTU in their tunneling
  transmitter, from Julian Anastasov.

* Generalize application protocol registration in IPVS and modify
  IPVS FTP helper to use it, from Julian Anastasov.

* update Kconfig. The IPVS FTP helper depends on the Netfilter FTP
  helper for NAT support, from Julian Anastasov.

* Add logic to update PMTU for IPIP packets in IPVS, again
  from Julian Anastasov.

* A couple of sparse warning fixes for IPVS and Netfilter from
  Claudiu Ghioc and Patrick McHardy respectively.

Patrick's IPv6 NAT changes will follow after this batch, I need
to flush this batch first before refreshing my tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22 18:48:52 -07:00
David S. Miller
1304a7343b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-08-22 14:21:38 -07:00
Neal Cardwell
fae6ef87fa net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child()
This commit removes the sk_rx_dst_set calls from
tcp_create_openreq_child(), because at that point the icsk_af_ops
field of ipv6_mapped TCP sockets has not been set to its proper final
value.

Instead, to make sure we get the right sk_rx_dst_set variant
appropriate for the address family of the new connection, we have
tcp_v{4,6}_syn_recv_sock() directly call the appropriate function
shortly after the call to tcp_create_openreq_child() returns.

This also moves inet6_sk_rx_dst_set() to avoid a forward declaration
with the new approach.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 03:03:33 -07:00
Patrick McHardy
9d7b0fc1ef net: ipv6: fix oops in inet_putpeer()
Commit 97bab73f (inet: Hide route peer accesses behind helpers.) introduced
a bug in xfrm6_policy_destroy(). The xfrm_dst's _rt6i_peer member is not
initialized, causing a false positive result from inetpeer_ptr_is_peer(),
which in turn causes a NULL pointer dereference in inet_putpeer().

Pid: 314, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #17 To Be Filled By O.E.M. To Be Filled By O.E.M./P4S800D-X
EIP: 0060:[<c03abf93>] EFLAGS: 00010246 CPU: 0
EIP is at inet_putpeer+0xe/0x16
EAX: 00000000 EBX: f3481700 ECX: 00000000 EDX: 000dd641
ESI: f3481700 EDI: c05e949c EBP: f551def4 ESP: f551def4
 DS: 007b ES: 007b FS: 0000 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 00000070 CR3: 3243d000 CR4: 00000750
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 f551df04 c0423de1 00000000 f3481700 f551df18 c038d5f7 f254b9f8 f551df28
 f34f85d8 f551df20 c03ef48d f551df3c c0396870 f30697e8 f24e1738 c05e98f4
 f5509540 c05cd2b4 f551df7c c0142d2b c043feb5 f5509540 00000000 c05cd2e8
 [<c0423de1>] xfrm6_dst_destroy+0x42/0xdb
 [<c038d5f7>] dst_destroy+0x1d/0xa4
 [<c03ef48d>] xfrm_bundle_flo_delete+0x2b/0x36
 [<c0396870>] flow_cache_gc_task+0x85/0x9f
 [<c0142d2b>] process_one_work+0x122/0x441
 [<c043feb5>] ? apic_timer_interrupt+0x31/0x38
 [<c03967eb>] ? flow_cache_new_hashrnd+0x2b/0x2b
 [<c0143e2d>] worker_thread+0x113/0x3cc

Fix by adding a init_dst() callback to struct xfrm_policy_afinfo to
properly initialize the dst's peer pointer.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 02:56:56 -07:00
Dan Carpenter
5ef5d6c569 gre: information leak in ip6_tnl_ioctl()
There is a one byte hole between p->hop_limit and p->flowinfo where
stack memory is leaked to the user.  This was introduced in c12b395a46
"gre: Support GRE over IPv6".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-08-20 02:21:30 -07:00
Dan Carpenter
898132ae76 ipv6: move dereference after check in fl_free()
There is a dereference before checking for NULL bug here.  Generally
free() functions should accept NULL pointers.  For example, fl_create()
can pass a NULL pointer to fl_free() on the error path.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-16 16:04:42 -07:00
Eric W. Biederman
4f82f45730 net ip6 flowlabel: Make owner a union of struct pid * and kuid_t
Correct a long standing omission and use struct pid in the owner
field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
This guarantees we don't have issues when pid wraparound occurs.

Use a kuid_t in the owner field of struct ip6_flowlabel when the
share type is IPV6_FL_S_USER to add user namespace support.

In /proc/net/ip6_flowlabel capture the current pid namespace when
opening the file and release the pid namespace when the file is
closed ensuring we print the pid owner value that is meaning to
the reader of the file.  Similarly use from_kuid_munged to print
uid values that are meaningful to the reader of the file.

This requires exporting pid_nr_ns so that ipv6 can continue to built
as a module.  Yoiks what silliness

Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:49:25 -07:00
Eric W. Biederman
a7cb5a49bf userns: Print out socket uids in a user namespace aware fashion.
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-08-14 21:48:06 -07:00
Ben Hutchings
4acd4945cd ipv6: addrconf: Avoid calling netdevice notifiers with RCU read-side lock
Cong Wang reports that lockdep detected suspicious RCU usage while
enabling IPV6 forwarding:

 [ 1123.310275] ===============================
 [ 1123.442202] [ INFO: suspicious RCU usage. ]
 [ 1123.558207] 3.6.0-rc1+ #109 Not tainted
 [ 1123.665204] -------------------------------
 [ 1123.768254] include/linux/rcupdate.h:430 Illegal context switch in RCU read-side critical section!
 [ 1123.992320]
 [ 1123.992320] other info that might help us debug this:
 [ 1123.992320]
 [ 1124.307382]
 [ 1124.307382] rcu_scheduler_active = 1, debug_locks = 0
 [ 1124.522220] 2 locks held by sysctl/5710:
 [ 1124.648364]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81768498>] rtnl_trylock+0x15/0x17
 [ 1124.882211]  #1:  (rcu_read_lock){.+.+.+}, at: [<ffffffff81871df8>] rcu_lock_acquire+0x0/0x29
 [ 1125.085209]
 [ 1125.085209] stack backtrace:
 [ 1125.332213] Pid: 5710, comm: sysctl Not tainted 3.6.0-rc1+ #109
 [ 1125.441291] Call Trace:
 [ 1125.545281]  [<ffffffff8109d915>] lockdep_rcu_suspicious+0x109/0x112
 [ 1125.667212]  [<ffffffff8107c240>] rcu_preempt_sleep_check+0x45/0x47
 [ 1125.781838]  [<ffffffff8107c260>] __might_sleep+0x1e/0x19b
[...]
 [ 1127.445223]  [<ffffffff81757ac5>] call_netdevice_notifiers+0x4a/0x4f
[...]
 [ 1127.772188]  [<ffffffff8175e125>] dev_disable_lro+0x32/0x6b
 [ 1127.885174]  [<ffffffff81872d26>] dev_forward_change+0x30/0xcb
 [ 1128.013214]  [<ffffffff818738c4>] addrconf_forward_change+0x85/0xc5
[...]

addrconf_forward_change() uses RCU iteration over the netdev list,
which is unnecessary since it already holds the RTNL lock.  We also
cannot reasonably require netdevice notifier functions not to sleep.

Reported-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14 17:02:12 -07:00