Commit Graph

4445 Commits

Author SHA1 Message Date
Maciej Żenczykowski
6cc7a765c2 net: allow CAP_NET_RAW to set socket options IP{,V6}_TRANSPARENT
Up till now the IP{,V6}_TRANSPARENT socket options (which actually set
the same bit in the socket struct) have required CAP_NET_ADMIN
privileges to set or clear the option.

- we make clearing the bit not require any privileges.
- we allow CAP_NET_ADMIN to set the bit (as before this change)
- we allow CAP_NET_RAW to set this bit, because raw
  sockets already pretty much effectively allow you
  to emulate socket transparency.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-20 18:21:36 -04:00
Eric Dumazet
20c4cb792d tcp: remove unused tcp_fin() parameters
tcp_fin() only needs socket pointer, we can remove skb and th params.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-20 17:44:03 -04:00
Eric Dumazet
e9266a02b7 tcp: use TCP_DEFAULT_INIT_RCVWND in tcp_fixup_rcvbuf()
Since commit 356f039822 (TCP: increase default initial receive
window.), we allow sender to send 10 (TCP_DEFAULT_INIT_RCVWND) segments.

Change tcp_fixup_rcvbuf() to reflect this change, even if no real change
is expected, since sysctl_tcp_rmem[1] = 87380 and this value
is bigger than tcp_fixup_rcvbuf() computed rcvmem (~23720)

Note: Since commit 356f039822 limited default window to maximum of
10*1460 and 2*MSS, we use same heuristic in this patch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-20 16:54:51 -04:00
Eric Dumazet
113ab386c7 ip_gre: dont increase dev->needed_headroom on a live device
It seems ip_gre is able to change dev->needed_headroom on the fly.

Its is not legal unfortunately and triggers a BUG in raw_sendmsg()

skb = sock_alloc_send_skb(sk, ... + LL_ALLOCATED_SPACE(rt->dst.dev)

< another cpu change dev->needed_headromm (making it bigger)

...
skb_reserve(skb, LL_RESERVED_SPACE(rt->dst.dev));

We end with LL_RESERVED_SPACE() being bigger than LL_ALLOCATED_SPACE()
-> we crash later because skb head is exhausted.

Bug introduced in commit 243aad83 in 2.6.34 (ip_gre: include route
header_len in max_headroom calculation)

Reported-by: Elmar Vonlanthen <evonlanthen@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Timo Teräs <timo.teras@iki.fi>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-20 16:20:30 -04:00
Gerrit Renker
686dc6b64b ipv4: compat_ioctl is local to af_inet.c, make it static
ipv4: compat_ioctl is local to af_inet.c, make it static

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 19:24:39 -04:00
Eric Dumazet
06a59ecb92 tcp: use TCP_INIT_CWND in tcp_fixup_sndbuf()
Initial cwnd being 10 (TCP_INIT_CWND) instead of 3, change
tcp_fixup_sndbuf() to get more than 16384 bytes (sysctl_tcp_wmem[1]) in
initial sk_sndbuf

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 16:53:30 -04:00
KOVACS Krisztian
58af19e387 tproxy: copy transparent flag when creating a time wait
The transparent socket option setting was not copied to the time wait
socket when an inet socket was being replaced by a time wait socket. This
broke the --transparent option of the socket match and may have caused
that FIN packets belonging to sockets in FIN_WAIT2 or TIME_WAIT state
were being dropped by the packet filter.

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 03:21:35 -04:00
Eric Dumazet
9e903e0852 net: add skb frag size accessors
To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.

Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-19 03:10:46 -04:00
Eric Dumazet
bc416d9768 macvlan: handle fragmented multicast frames
Fragmented multicast frames are delivered to a single macvlan port,
because ip defrag logic considers other samples are redundant.

Implement a defrag step before trying to send the multicast frame.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-18 23:22:07 -04:00
Eric Dumazet
87fb4b7b53 net: more accurate skb truesize
skb truesize currently accounts for sk_buff struct and part of skb head.
kmalloc() roundings are also ignored.

Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.

This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.

At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.

Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
aligned to cache lines, as before.

Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <ak@linux.intel.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-13 16:05:07 -04:00
Murali Raja
3ceca74966 net-netlink: Add a new attribute to expose TOS values via netlink
This patch exposes the tos value for the TCP sockets when the TOS flag
is requested in the ext_flags for the inet_diag request. This would mainly be
used to expose TOS values for both for TCP and UDP sockets. Currently it is
supported for TCP. When netlink support for UDP would be added the support
to expose the TOS values would alse be done. For IPV4 tos value is exposed
and for IPV6 tclass value is exposed.

Signed-off-by: Murali Raja <muralira@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-12 19:09:18 -04:00
Dan Carpenter
5675592410 cipso: remove an unneeded NULL check in cipso_v4_doi_add()
We dereference doi_def on the line before the NULL check.  It has
been this way since 2008.  I checked all the callers and doi_def is
always non-NULL here.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-11 18:43:53 -04:00
David S. Miller
88c5100c28 Merge branch 'master' of github.com:davem330/net
Conflicts:
	net/batman-adv/soft-interface.c
2011-10-07 13:38:43 -04:00
Yan, Zheng
1e5289e121 tcp: properly update lost_cnt_hint during shifting
lost_skb_hint is used by tcp_mark_head_lost() to mark the first unhandled skb.
lost_cnt_hint is the number of packets or sacked packets before the lost_skb_hint;
When shifting a skb that is before the lost_skb_hint, if tcp_is_fack() is ture,
the skb has already been counted in the lost_cnt_hint; if tcp_is_fack() is false,
tcp_sacktag_one() will increase the lost_cnt_hint. So tcp_shifted_skb() does not
need to adjust the lost_cnt_hint by itself. When shifting a skb that is equal to
lost_skb_hint, the shifted packets will not be counted by tcp_mark_head_lost().
So tcp_shifted_skb() should adjust the lost_cnt_hint even tcp_is_fack(tp) is true.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-04 23:31:24 -04:00
Yan, Zheng
260fcbeb1a tcp: properly handle md5sig_pool references
tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
increases use count of md5sig_pool for each peer. This patch
makes tcp_v4_md5_do_add() only increases use count for the first
tcp md5sig peer.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-04 23:31:24 -04:00
Vasily Averin
349d2895cc ipv4: NET_IPV4_ROUTE_GC_INTERVAL removal
removing obsoleted sysctl,
ip_rt_gc_interval variable no longer used since 2.6.38

Signed-off-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03 14:13:01 -04:00
Eric Dumazet
b5c5693bb7 tcp: report ECN_SEEN in tcp_info
Allows ss command (iproute2) to display "ecnseen" if at least one packet
with ECT(0) or ECT(1) or ECN was received by this socket.

"ecn" means ECN was negotiated at session establishment (TCP level)

"ecnseen" means we received at least one packet with ECT fields set (IP
level)

ss -i
...
ESTAB      0      0   192.168.20.110:22  192.168.20.144:38016
ino:5950 sk:f178e400
	 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210
rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03 14:01:21 -04:00
Eric Dumazet
4de075e043 tcp: rename tcp_skb_cb flags
Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and
maintenance.

Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 13:25:05 -04:00
Eric Dumazet
b82d1bb4fd tcp: unalias tcp_skb_cb flags and ip_dsfield
struct tcp_skb_cb contains a "flags" field containing either tcp flags
or IP dsfield depending on context (input or output path)

Introduce ip_dsfield to make the difference clear and ease maintenance.
If later we want to save space, we can union flags/ip_dsfield

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 02:20:08 -04:00
Eric Dumazet
7a269ffad7 tcp: ECN blackhole should not force quickack mode
While playing with a new ADSL box at home, I discovered that ECN
blackhole can trigger suboptimal quickack mode on linux : We send one
ACK for each incoming data frame, without any delay and eventual
piggyback.

This is because TCP_ECN_check_ce() considers that if no ECT is seen on a
segment, this is because this segment was a retransmit.

Refine this heuristic and apply it only if we seen ECT in a previous
segment, to detect ECN blackhole at IP level.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Jerry Chu <hkchu@google.com>
CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
CC: Jim Gettys <jg@freedesktop.org>
CC: Dave Taht <dave.taht@gmail.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-27 00:58:44 -04:00
David S. Miller
8decf86879 Merge branch 'master' of github.com:davem330/net
Conflicts:
	MAINTAINERS
	drivers/net/Kconfig
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
	drivers/net/ethernet/broadcom/tg3.c
	drivers/net/wireless/iwlwifi/iwl-pci.c
	drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c
	drivers/net/wireless/rt2x00/rt2800usb.c
	drivers/net/wireless/wl12xx/main.c
2011-09-22 03:23:13 -04:00
Zheng Yan
f779b2d60a tcp: fix validation of D-SACK
D-SACK is allowed to reside below snd_una. But the corresponding check
in tcp_is_sackblock_valid() is the exact opposite. It looks like a typo.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-18 22:37:34 -04:00
Eric Dumazet
765cf9976e tcp: md5: remove one indirection level in tcp_md5sig_pool
tcp_md5sig_pool is currently an 'array' (a percpu object) of pointers to
struct tcp_md5sig_pool. Only the pointers are NUMA aware, but objects
themselves are all allocated on a single node.

Remove this extra indirection to get proper percpu memory (NUMA aware)
and make code simpler.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-17 01:15:46 -04:00
Yan, Zheng
19c1ea14c9 ipv4: Fix fib_info->fib_metrics leak
Commit 4670994d(net,rcu: convert call_rcu(fc_rport_free_rcu) to
kfree_rcu()) introduced a memory leak. This patch reverts it.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-16 17:42:26 -04:00
David S. Miller
52b9aca7ae Merge branch 'master' of ../netdev/ 2011-09-16 01:09:02 -04:00
Eric Dumazet
946cedccbd tcp: Change possible SYN flooding messages
"Possible SYN flooding on port xxxx " messages can fill logs on servers.

Change logic to log the message only once per listener, and add two new
SNMP counters to track :

TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

TCPReqQFullDrop : number of times a SYN request was dropped because
syncookies were not enabled.

Based on a prior patch from Tom Herbert, and suggestions from David.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-09-15 14:49:43 -04:00
Eric Dumazet
29c486df6a net: ipv4: relax AF_INET check in bind()
commit d0733d2e29 (Check for mistakenly passed in non-IPv4 address)
added regression on legacy apps that use bind() with AF_UNSPEC family.

Relax the check, but make sure the bind() is done on INADDR_ANY
addresses, as AF_UNSPEC has probably no sane meaning for other
addresses.

Bugzilla reference : https://bugzilla.kernel.org/show_bug.cgi?id=42012

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-bisected-by: Rene Meier <r_meier@freenet.de>
CC: Marcus Meissner <meissner@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-30 18:57:00 -04:00
David S. Miller
7858241655 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2011-08-30 17:43:56 -04:00
Florian Westphal
c6675233f9 netfilter: nf_queue: reject NF_STOLEN verdicts from userspace
A userspace listener may send (bogus) NF_STOLEN verdict, which causes skb leak.

This problem was previously fixed via
64507fdbc2 (netfilter:
nf_queue: fix NF_STOLEN skb leak) but this had to be reverted because
NF_STOLEN can also be returned by a netfilter hook when iterating the
rules in nf_reinject.

Reject userspace NF_STOLEN verdict, as suggested by Michal Miroslaw.

This is complementary to commit fad5444043
(netfilter: avoid double free in nf_reinject).

Cc: Julian Anastasov <ja@ssi.bg>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-08-30 15:01:20 +02:00
Nandita Dukkipati
a262f0cdf1 Proportional Rate Reduction for TCP.
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.

The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
  1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
  2) losses near the end of short flows where application runs out of
data to send.

As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.

A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 19:40:40 -07:00
Ian Campbell
aff65da0f1 net: ipv4: convert to SKB frag APIs
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 17:52:11 -07:00
Yan, Zheng
e05c4ad3ed mcast: Fix source address selection for multicast listener report
Should check use count of include mode filter instead of total number
of include mode filters.

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-24 17:46:15 -07:00
David S. Miller
823dcd2506 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-08-20 10:39:12 -07:00
Jiri Pirko
b81693d914 net: remove ndo_set_multicast_list callback
Remove no longer used operation.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:22:03 -07:00
Tom Herbert
bdeab99191 rps: Add flag to skb to indicate rxhash is based on L4 tuple
The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet.  This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-17 20:06:03 -07:00
Eric Dumazet
33d480ce6d net: cleanup some rcu_dereference_raw
RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-12 02:55:28 -07:00
Julian Anastasov
97a8041020 ipv4: some rt_iif -> rt_route_iif conversions
As rt_iif represents input device even for packets
coming from loopback with output route, it is not an unique
key specific to input routes. Now rt_route_iif has such role,
it was fl.iif in 2.6.38, so better to change the checks at
some places to save CPU cycles and to restore 2.6.38 semantics.

compare_keys:
	- input routes: only rt_route_iif matters, rt_iif is same
	- output routes: only rt_oif matters, rt_iif is not
		used for matching in __ip_route_output_key
	- now we are back to 2.6.38 state

ip_route_input_common:
	- matching rt_route_iif implies input route
	- compared to 2.6.38 we eliminated one rth->fl.oif check
	because it was not needed even for 2.6.38

compare_hash_inputs:
	Only the change here is not an optimization, it has
	effect only for output routes. I assume I'm restoring
	the original intention to ignore oif, it was using fl.iif
	- now we are back to 2.6.38 state

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-11 05:58:59 -07:00
Mike Waychison
f0e3d0689d tcp: initialize variable ecn_ok in syncookies path
Using a gcc 4.4.3, warnings are emitted for a possibly uninitialized use
of ecn_ok.

This can happen if cookie_check_timestamp() returns due to not having
seen a timestamp.  Defaulting to ecn off seems like a reasonable thing
to do in this case, so initialized ecn_ok to false.

Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-10 21:59:57 -07:00
David S. Miller
19fd61785a Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-08-07 23:20:26 -07:00
Julian Anastasov
d52fbfc9e5 ipv4: use dst with ref during bcast/mcast loopback
Make sure skb dst has reference when moving to
another context. Currently, I don't see protocols that can
hit it when sending broadcasts/multicasts to loopback using
noref dsts, so it is just a precaution.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-07 22:52:32 -07:00
Julian Anastasov
47670b767b ipv4: route non-local sources for raw socket
The raw sockets can provide source address for
routing but their privileges are not considered. We
can provide non-local source address, make sure the
FLOWI_FLAG_ANYSRC flag is set if socket has privileges
for this, i.e. based on hdrincl (IP_HDRINCL) and
transparent flags.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-07 22:52:32 -07:00
Julian Anastasov
797fd3913a netfilter: TCP and raw fix for ip_route_me_harder
TCP in some cases uses different global (raw) socket
to send RST and ACK. The transparent flag is not set there.
Currently, it is a problem for rerouting after the previous
change.

	Fix it by simplifying the checks in ip_route_me_harder
and use FLOWI_FLAG_ANYSRC even for sockets. It looks safe
because the initial routing allowed this source address to
be used and now we just have to make sure the packet is rerouted.

	As a side effect this also allows rerouting for normal
raw sockets that use spoofed source addresses which was not possible
even before we eliminated the ip_route_input call.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-07 22:52:32 -07:00
Daniel Baluta
dd23198e58 ipv4: Fix ip_getsockopt for IP_PKTOPTIONS
IP_PKTOPTIONS is broken for 32-bit applications running
in COMPAT mode on 64-bit kernels.

This happens because msghdr's msg_flags field is always
set to zero. When running in COMPAT mode this should be
set to MSG_CMSG_COMPAT instead.

Signed-off-by: Tiberiu Szocs-Mihai <tszocs@ixiacom.com>
Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-07 22:31:07 -07:00
Julian Anastasov
d547f727df ipv4: fix the reusing of routing cache entries
compare_keys and ip_route_input_common rely on
rt_oif for distinguishing of input and output routes
with same keys values. But sometimes the input route has
also same hash chain (keyed by iif != 0) with the output
routes (keyed by orig_oif=0). Problem visible if running
with small number of rhash_entries.

	Fix them to use rt_route_iif instead. By this way
input route can not be returned to users that request
output route.

	The patch fixes the ip_rt_bug errors that were
reported in ip_local_out context, mostly for 255.255.255.255
destinations.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-07 22:20:20 -07:00
David S. Miller
6e5714eaf7 net: Compute protocol sequence numbers and fragment IDs using MD5.
Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation.  So the periodic
regeneration and 8-bit counter have been removed.  We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.

Reported-by: Dan Kaminsky <dan@doxpara.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-06 18:33:19 -07:00
Eric Dumazet
f2c31e32b3 net: fix NULL dereferences in check_peer_redir()
Gergely Kalman reported crashes in check_peer_redir().

It appears commit f39925dbde (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.

Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.

Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.

As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)

Many thanks to Gergely for providing a pretty report pointing to the
bug.

Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-03 03:34:12 -07:00
Stephen Hemminger
a9b3cd7f32 rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER
When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-02 04:29:23 -07:00
Julia Lawall
a1889c0d20 net: adjust array index
Convert array index from the loop bound to the loop index.

A simplified version of the semantic patch that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e1,e2,ar;
@@

for(e1 = 0; e1 < e2; e1++) { <...
  ar[
- e2
+ e1
  ]
  ...> }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-01 02:27:21 -07:00
Jesper Juhl
91c66c6893 netfilter: ip_queue: Fix small leak in ipq_build_packet_message()
ipq_build_packet_message() in net/ipv4/netfilter/ip_queue.c and
net/ipv6/netfilter/ip6_queue.c contain a small potential mem leak as
far as I can tell.

We allocate memory for 'skb' with alloc_skb() annd then call
 nlh = NLMSG_PUT(skb, 0, 0, IPQM_PACKET, size - sizeof(*nlh));

NLMSG_PUT is a macro
 NLMSG_PUT(skb, pid, seq, type, len) \
  		NLMSG_NEW(skb, pid, seq, type, len, 0)

that expands to NLMSG_NEW, which is also a macro which expands to:
 NLMSG_NEW(skb, pid, seq, type, len, flags) \
  	({	if (unlikely(skb_tailroom(skb) < (int)NLMSG_SPACE(len))) \
  			goto nlmsg_failure; \
  		__nlmsg_put(skb, pid, seq, type, len, flags); })

If we take the true branch of the 'if' statement and 'goto
nlmsg_failure', then we'll, at that point, return from
ipq_build_packet_message() without having assigned 'skb' to anything
and we'll leak the memory we allocated for it when it goes out of
scope.

Fix this by placing a 'kfree(skb)' at 'nlmsg_failure'.

I admit that I do not know how likely this to actually happen or even
if there's something that guarantees that it will never happen - I'm
not that familiar with this code, but if that is so, I've not been
able to spot it.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-07-29 16:38:49 +02:00
Linus Torvalds
d5eab9152a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
  tg3: Remove 5719 jumbo frames and TSO blocks
  tg3: Break larger frags into 4k chunks for 5719
  tg3: Add tx BD budgeting code
  tg3: Consolidate code that calls tg3_tx_set_bd()
  tg3: Add partial fragment unmapping code
  tg3: Generalize tg3_skb_error_unmap()
  tg3: Remove short DMA check for 1st fragment
  tg3: Simplify tx bd assignments
  tg3: Reintroduce tg3_tx_ring_info
  ASIX: Use only 11 bits of header for data size
  ASIX: Simplify condition in rx_fixup()
  Fix cdc-phonet build
  bonding: reduce noise during init
  bonding: fix string comparison errors
  net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared
  net: add IFF_SKB_TX_SHARED flag to priv_flags
  net: sock_sendmsg_nosec() is static
  forcedeth: fix vlans
  gianfar: fix bug caused by 87c288c6e9
  gro: Only reset frag0 when skb can be pulled
  ...
2011-07-28 05:58:19 -07:00
Arun Sharma
60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Zoltan Kiss
b76d0789c9 IPv4: Send gratuitous ARP for secondary IP addresses also
If a device event generates gratuitous ARP messages, only primary
address is used for sending. This patch iterates through the whole
list. Tested with 2 IP addresses configuration on bonding interface.

Signed-off-by: Zoltan Kiss <schaman@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-25 16:16:00 -07:00
xeb@mail.ru
559fafb94a gre: fix improper error handling
Fix improper protocol err_handler, current implementation is fully
unapplicable and may cause kernel crash due to double kfree_skb.

Signed-off-by: Dmitry Kozlov <xeb@mail.ru>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-23 20:06:00 -07:00
Julian Anastasov
b0fe4a3184 ipv4: use RT_TOS after some rt_tos conversions
rt_tos was changed to iph->tos but it must be filtered by RT_TOS

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-23 20:05:31 -07:00
David S. Miller
415b3334a2 icmp: Fix regression in nexthop resolution during replies.
icmp_route_lookup() uses the wrong flow parameters if the reverse
session route lookup isn't used.

So do not commit to the re-decoded flow until we actually make a
final decision to use a real route saved in 'rt2'.

Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-22 06:22:10 -07:00
Bill Sommerfeld
d9be4f7a6f ipv4: Constrain UFO fragment sizes to multiples of 8 bytes
Because the ip fragment offset field counts 8-byte chunks, ip
fragments other than the last must contain a multiple of 8 bytes of
payload.  ip_ufo_append_data wasn't respecting this constraint and,
depending on the MTU and ip option sizes, could create malformed
non-final fragments.

Google-Bug-Id: 5009328
Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 21:31:41 -07:00
Eric Dumazet
87c48fa3b4 ipv6: make fragment identifications less predictable
IPv6 fragment identification generation is way beyond what we use for
IPv4 : It uses a single generator. Its not scalable and allows DOS
attacks.

Now inetpeer is IPv6 aware, we can use it to provide a more secure and
scalable frag ident generator (per destination, instead of system wide)

This patch :
1) defines a new secure_ipv6_id() helper
2) extends inet_getid() to provide 32bit results
3) extends ipv6_select_ident() with a new dest parameter

Reported-by: Fernando Gont <fernando@gont.com.ar>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 21:25:58 -07:00
Jiri Pirko
9fea03302a lro: do vlan cleanup
- remove useless vlan parameters and pointers

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 13:47:54 -07:00
Jiri Pirko
0f7257281d lro: kill lro_vlan_hwaccel_receive_frags
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 13:47:54 -07:00
Jiri Pirko
7756a96e19 lro: kill lro_vlan_hwaccel_receive_skb
no longer used

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-21 13:47:54 -07:00
Eric Dumazet
5c74501f76 ipv4: save cpu cycles from check_leaf()
Compiler is not smart enough to avoid double BSWAP instructions in
ntohl(inet_make_mask(plen)).

Lets cache this value in struct leaf_info, (fill a hole on 64bit arches)

With route cache disabled, this saves ~2% of cpu in udpflood bench on
x86_64 machine.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-18 10:41:18 -07:00
David S. Miller
d3aaeb38c4 net: Add ->neigh_lookup() operation to dst_ops
In the future dst entries will be neigh-less.  In that environment we
need to have an easy transition point for current users of
dst->neighbour outside of the packet output fast path.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-18 00:40:17 -07:00
David S. Miller
69cce1d140 net: Abstract dst->neighbour accesses behind helpers.
dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:35 -07:00
David S. Miller
8f40b161de neigh: Pass neighbour entry to output ops.
This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.

We will also be able to make dst entries neigh-less.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-17 23:11:17 -07:00
David S. Miller
542d4d685f neigh: Kill ndisc_ops->queue_xmit
It is always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 18:30:59 -07:00
David S. Miller
b23b5455b6 neigh: Kill hh_cache->hh_output
It's just taking on one of two possible values, either
neigh_ops->output or dev_queue_xmit().  And this is purely depending
upon whether nud_state has NUD_CONNECTED set or not.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 17:45:02 -07:00
David S. Miller
47ec132a40 neigh: Kill neigh_ops->hh_output
It's always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 17:39:57 -07:00
David S. Miller
05e3aa0949 net: Create and use new helper, neigh_output().
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 17:26:00 -07:00
David S. Miller
fec8292d9c ipv4: Use calculated 'neigh' instead of re-evaluating dst->neighbour
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-16 14:25:54 -07:00
David S. Miller
6a7ebdf2fd Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	net/bluetooth/l2cap_core.c
2011-07-14 07:56:40 -07:00
David S. Miller
f6b72b6217 net: Embed hh_cache inside of struct neighbour.
Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:

1) dynamic allocation
2) attachment to dst->hh
3) refcounting

Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-14 07:53:20 -07:00
David Miller
3769cffb1c ipv4: Inline neigh binding.
Get rid of all of the useless and costly indirection
by doing the neigh hash table lookup directly inside
of the neighbour binding.

Rename from arp_bind_neighbour to rt_bind_neighbour.

Use new helpers {__,}ipv4_neigh_lookup()

In rt_bind_neighbour() get rid of useless tests which
are never true in the context this function is called,
namely dev is never NULL and the dst->neighbour is
always NULL.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-13 01:12:28 -07:00
Eric Dumazet
6d1a3e042f inetpeer: kill inet_putpeer race
We currently can free inetpeer entries too early :

[  782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c)
[  782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000
[  782.636686]  i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u
[  782.636694]                          ^
[  782.636696]
[  782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h
[  782.636702] EIP: 0060:[<c13fefbb>] EFLAGS: 00010286 CPU: 0
[  782.636707] EIP is at inet_getpeer+0x25b/0x5a0
[  782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30
[  782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c
[  782.636712]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0
[  782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  782.636717] DR6: ffff4ff0 DR7: 00000400
[  782.636718]  [<c13fbf76>] rt_set_nexthop.clone.45+0x56/0x220
[  782.636722]  [<c13fc449>] __ip_route_output_key+0x309/0x860
[  782.636724]  [<c141dc54>] tcp_v4_connect+0x124/0x450
[  782.636728]  [<c142ce43>] inet_stream_connect+0xa3/0x270
[  782.636731]  [<c13a8da1>] sys_connect+0xa1/0xb0
[  782.636733]  [<c13a99dd>] sys_socketcall+0x25d/0x2a0
[  782.636736]  [<c149deb8>] sysenter_do_call+0x12/0x28
[  782.636738]  [<ffffffff>] 0xffffffff

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-11 20:25:04 -07:00
David S. Miller
f610b74b14 ipv4: Use universal hash for ARP.
We need to make sure the multiplier is odd.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-11 01:37:28 -07:00
Eric Dumazet
f03d78db65 net: refine {udp|tcp|sctp}_mem limits
Current tcp/udp/sctp global memory limits are not taking into account
hugepages allocations, and allow 50% of ram to be used by buffers of a
single protocol [ not counting space used by sockets / inodes ...]

Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
per protocol, and a minimum of 128 pages.
Heavy duty machines sysadmins probably need to tweak limits anyway.


References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032
Reported-by: starlight <starlight@binnacle.cx>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-07 00:27:05 -07:00
David S. Miller
e12fe68ce3 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2011-07-05 23:23:37 -07:00
David S. Miller
595fc71baa ipv4: Add ip_defrag() agent IP_DEFRAG_AF_PACKET.
Elide the ICMP on frag queue timeouts unconditionally for
this user.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-05 22:34:52 -07:00
Marcus Meissner
c349a528cd net: bind() fix error return on wrong address family
Hi,

Reinhard Max also pointed out that the error should EAFNOSUPPORT according
to POSIX.

The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.

Other protocols error values in their af bind() methods in current mainline git as far
as a brief look shows:
	EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
	EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
	No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip

Ciao, Marcus

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-04 21:37:41 -07:00
Steffen Klassert
b00897b881 xfrm4: Don't call icmp_send on local error
Calling icmp_send() on a local message size error leads to
an incorrect update of the path mtu. So use ip_local_error()
instead to notify the socket about the error.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 17:33:19 -07:00
Steffen Klassert
c146066ab8 ipv4: Don't use ufo handling on later transformed packets
We might call ip_ufo_append_data() for packets that will be IPsec
transformed later. This function should be used just for real
udp packets. So we check for rt->dst.header_len which is only
nonzero on IPsec handling and call ip_ufo_append_data() just
if rt->dst.header_len is zero.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 17:33:19 -07:00
Joe Perches
4500ebf8d1 ipv4: Reduce switch/case indent
Make the case labels the same indent as the switch.

git diff -w shows no difference.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 16:11:16 -07:00
Joe Perches
181b1e9ce1 netfilter: Reduce switch/case indent
Make the case labels the same indent as the switch.

git diff -w shows miscellaneous 80 column wrapping,
comment reflowing and a comment for a useless gcc
warning for an otherwise unused default: case.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 16:11:15 -07:00
Joe Perches
1d67a51682 ipconfig: Reduce switch/case indent
Make the case labels the same indent as the switch.

git diff -w shows miscellaneous 80 column wrapping.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-07-01 16:11:15 -07:00
Julian Anastasov
ed6e4ef836 netfilter: Fix ip_route_me_harder triggering ip_rt_bug
Avoid creating input routes with ip_route_me_harder.
It does not work for locally generated packets. Instead,
restrict sockets to provide valid saddr for output route (or
unicast saddr for transparent proxy). For other traffic
allow saddr to be unicast or local but if callers forget
to check saddr type use 0 for the output route.

	The resulting handling should be:

- REJECT TCP:
	- in INPUT we can provide addr_type = RTN_LOCAL but
	better allow rejecting traffic delivered with
	local route (no IP address => use RTN_UNSPEC to
	allow also RTN_UNICAST).
	- FORWARD: RTN_UNSPEC => allow RTN_LOCAL/RTN_UNICAST
	saddr, add fix to ignore RTN_BROADCAST and RTN_MULTICAST
	- OUTPUT: RTN_UNSPEC

- NAT, mangle, ip_queue, nf_ip_reroute: RTN_UNSPEC in LOCAL_OUT

- IPVS:
	- use RTN_LOCAL in LOCAL_OUT and FORWARD after SNAT
	to restrict saddr to be local

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-29 05:47:32 -07:00
Steffen Klassert
353e5c9abd ipv4: Fix IPsec slowpath fragmentation problem
ip_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path).
On IPsec the effective mtu is lower because we need to add the protocol
headers and trailers later when we do the IPsec transformations. So after
the IPsec transformations the packet might be too big, which leads to a
slowpath fragmentation then. This patch fixes this by building the packets
based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr
handling to this.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-27 20:34:26 -07:00
Steffen Klassert
33f99dc7fd ipv4: Fix packet size calculation in __ip_append_data
Git commit 59104f06 (ip: take care of last fragment in ip_append_data)
added a check to see if we exceed the mtu when we add trailer_len.
However, the mtu is already subtracted by the trailer length when the
xfrm transfomation bundles are set up. So IPsec packets with mtu
size get fragmented, or if the DF bit is set the packets will not
be send even though they match the mtu perfectly fine. This patch
actually reverts commit 59104f06.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-27 20:34:25 -07:00
Xufeng Zhang
9cfaa8def1 udp/recvmsg: Clear MSG_TRUNC flag when starting over for a new packet
Consider this scenario: When the size of the first received udp packet
is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags.
However, if checksum error happens and this is a blocking socket, it will
goto try_again loop to receive the next packet.  But if the size of the
next udp packet is smaller than receive buffer, MSG_TRUNC flag should not
be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before
receive the next packet, MSG_TRUNC is still set, which is wrong.

Fix this problem by clearing MSG_TRUNC flag when starting over for a
new packet.

Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 22:34:27 -07:00
Paul Gortmaker
56f8a75c17 ip: introduce ip_is_fragment helper inline function
There are enough instances of this:

    iph->frag_off & htons(IP_MF | IP_OFFSET)

that a helper function is probably warranted.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 20:33:34 -07:00
Satoru Moriya
296f7ea75b udp: add tracepoints for queueing skb to rcvbuf
This patch adds a tracepoint to __udp_queue_rcv_skb to get the
return value of ip_queue_rcv_skb. It indicates why kernel drops
a packet at this point.

ip_queue_rcv_skb returns following values in the packet drop case:

rcvbuf is full                 : -ENOMEM
sk_filter returns error        : -EINVAL, -EACCESS, -ENOMEM, etc.
__sk_mem_schedule returns error: -ENOBUF

Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 16:06:10 -07:00
Jesper Juhl
dec17b7451 Remove redundant linux/version.h includes from net/
It was suggested by "make versioncheck" that the follwing includes of
linux/version.h are redundant:

  /home/jj/src/linux-2.6/net/caif/caif_dev.c: 14 linux/version.h not needed.
  /home/jj/src/linux-2.6/net/caif/chnl_net.c: 10 linux/version.h not needed.
  /home/jj/src/linux-2.6/net/ipv4/gre.c: 19 linux/version.h not needed.
  /home/jj/src/linux-2.6/net/netfilter/ipset/ip_set_core.c: 20 linux/version.h not needed.
  /home/jj/src/linux-2.6/net/netfilter/xt_set.c: 16 linux/version.h not needed.

and it seems that it is right.

Beyond manually inspecting the source files I also did a few build
tests with various configs to confirm that including the header in
those files is indeed not needed.

Here's a patch to remove the pointless includes.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-21 16:03:17 -07:00
David S. Miller
9f6ec8d697 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
Conflicts:
	drivers/net/wireless/iwlwifi/iwl-agn-rxon.c
	drivers/net/wireless/rtlwifi/pci.c
	net/netfilter/ipvs/ip_vs_core.c
2011-06-20 22:29:08 -07:00
Jesper Juhl
8ad2475e35 ipv4, ping: Remove duplicate icmp.h include
Remove the duplicate inclusion of net/icmp.h from net/ipv4/ping.c

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-20 13:04:38 -07:00
Eric Dumazet
9aa3c94ce5 ipv4: fix multicast losses
Knut Tidemann found that first packet of a multicast flow was not
correctly received, and bisected the regression to commit b23dd4fe42
(Make output route lookup return rtable directly.)

Special thanks to Knut, who provided a very nice bug report, including
sample programs to demonstrate the bug.

Reported-and-bisectedby: Knut Tidemann <knut.andre.tidemann@jotron.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-18 11:59:18 -07:00
Eric Dumazet
eeb1497277 inet_diag: fix inet_diag_bc_audit()
A malicious user or buggy application can inject code and trigger an
infinite loop in inet_diag_bc_audit()

Also make sure each instruction is aligned on 4 bytes boundary, to avoid
unaligned accesses.

Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-06-17 16:25:39 -04:00
Eric Dumazet
1eddceadb0 net: rfs: enable RFS before first data packet is received
Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 17 Jun 2011 00:50:46 +0100
>
> > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
> >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> >>  			goto discard;
> >>
> >>  		if (nsk != sk) {
> >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
> >>  			if (tcp_child_process(sk, nsk, skb)) {
> >>  				rsk = nsk;
> >>  				goto reset;
> >>
> >
> > I haven't tried this, but it looks reasonable to me.
> >
> > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
>
> Indeed ipv6 side needs the same fix.
>
> Eric please add that part and resubmit.  And in fact I might stick
> this into net-2.6 instead of net-next-2.6
>

OK, here is the net-2.6 based one then, thanks !

[PATCH v2] net: rfs: enable RFS before first data packet is received

First packet received on a passive tcp flow is not correctly RFS
steered.

One sock_rps_record_flow() call is missing in inet_accept()

But before that, we also must record rxhash when child socket is setup.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
2011-06-17 15:27:31 -04:00
David S. Miller
3009adf5ac Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6 2011-06-16 21:38:01 -04:00
Julian Anastasov
42c1edd345 netfilter: nf_nat: avoid double seq_adjust for loopback
Avoid double seq adjustment for loopback traffic
because it causes silent repetition of TCP data. One
example is passive FTP with DNAT rule and difference in the
length of IP addresses.

	This patch adds check if packet is sent and
received via loopback device. As the same conntrack is
used both for outgoing and incoming direction, we restrict
seq adjustment to happen only in POSTROUTING.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-06-16 17:29:22 +02:00
Nicolas Cavallari
2c38de4c1f netfilter: fix looped (broad|multi)cast's MAC handling
By default, when broadcast or multicast packet are sent from a local
application, they are sent to the interface then looped by the kernel
to other local applications, going throught netfilter hooks in the
process.

These looped packet have their MAC header removed from the skb by the
kernel looping code. This confuse various netfilter's netlink queue,
netlink log and the legacy ip_queue, because they try to extract a
hardware address from these packets, but extracts a part of the IP
header instead.

This patch prevent NFQUEUE, NFLOG and ip_QUEUE to include a MAC header
if there is none in the packet.

Signed-off-by: Nicolas Cavallari <cavallar@lri.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-06-16 17:27:04 +02:00
Patrick McHardy
db898aa2ef netfilter: ipt_ecn: fix inversion for IP header ECN match
Userspace allows to specify inversion for IP header ECN matches, the
kernel silently accepts it, but doesn't invert the match result.

Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-06-16 17:24:55 +02:00
Patrick McHardy
58d5a0257d netfilter: ipt_ecn: fix protocol check in ecn_mt_check()
Check for protocol inversion in ecn_mt_check() and remove the
unnecessary runtime check for IPPROTO_TCP in ecn_mt().

Signed-off-by: Patrick McHardy <kaber@trash.net>
2011-06-16 17:24:17 +02:00