Al Viro spotted a bogus use of u64 on the input sequence number which
is big-endian. This patch fixes it by giving the input sequence number
its own member in the xfrm_skb_cb structure.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Frank Blaschka provided the bug report and the initial suggested fix
for this bug. He also validated this version of this fix.
The problem is that the access to neigh->arp_queue is inconsistent, we
grab references when dropping the lock lock to call
neigh->ops->solicit() but this does not prevent other threads of
control from trying to send out that packet at the same time causing
corruptions because both code paths believe they have exclusive access
to the skb.
The best option seems to be to hold the write lock on neigh->lock
during the ->solicit() call. I looked at all of the ndisc_ops
implementations and this seems workable. The only case that needs
special care is the IPV4 ARP implementation of arp_solicit(). It
wants to take neigh->lock as a reader to protect the header entry in
neigh->ha during the emission of the soliciation. We can simply
remove the read lock calls to take care of that since holding the lock
as a writer at the caller providers a superset of the protection
afforded by the existing read locking.
The rest of the ->solicit() implementations don't care whether the
neigh is locked or not.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use key/offset caching to change /proc/net/route (use by iputils route)
from O(n^2) to O(n). This improves performance from 30sec with 160,000
routes to 1sec.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes possible problems when trie_firstleaf() returns NULL
to trie_leafindex().
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Various RFCs have all sorts of things to say about the CS field of the
DSCP value. In particular they try to make the distinction between
values that should be used by "user applications" and things like
routing daemons.
This seems to have influenced the CAP_NET_ADMIN check which exists for
IP_TOS socket option settings, but in fact it has an off-by-one error
so it wasn't allowing CS5 which is meant for "user applications" as
well.
Further adding to the inconsistency and brokenness here, IPV6 does not
validate the DSCP values specified for the IPV6_TCLASS socket option.
The real actual uses of these TOS values are system specific in the
final analysis, and these RFC recommendations are just that, "a
recommendation". In fact the standards very purposefully use
"SHOULD" and "SHOULD NOT" when describing how these values can be
used.
In the final analysis the only clean way to provide consistency here
is to remove the CAP_NET_ADMIN check. The alternatives just don't
work out:
1) If we add the CAP_NET_ADMIN check to ipv6, this can break existing
setups.
2) If we just fix the off-by-one error in the class comparison in
IPV4, certain DSCP values can be used in IPV6 but not IPV4 by
default. So people will just ask for a sysctl asking to
override that.
I checked several other freely available kernel trees and they
do not make any privilege checks in this area like we do. For
the BSD stacks, this goes back all the way to Stevens Volume 2
and beyond.
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_me_harder() may call ip_route_input() with skbs that don't
have skb->dev set for skbs rerouted in LOCAL_OUT and TCP resets
generated by the REJECT target, resulting in a crash when dereferencing
skb->dev->nd_net. Since ip_route_input() has an input device argument,
it seems correct to use that one anyway.
Bug introduced in b5921910a1 (Routing cache virtualization).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ->move operation has two bugs:
- It is called with the same extension as source and destination,
so it doesn't update the new extension.
- The address of the old extension is calculated incorrectly,
instead of (void *)ct->ext + ct->ext->offset[i] it uses
ct->ext + ct->ext->offset[i].
Fixes a crash on x86_64 reported by Chuck Ebbert <cebbert@redhat.com>
and Thomas Woerner <twoerner@redhat.com>.
Tested-by: Thomas Woerner <twoerner@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
No available servers is more an error message than something informational. It
should also be rate-limited, else we're going to flood our logs on a busy
director, if all real servers are out of order with a weight of zero.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new set of configuration functions to the NetLabel/LSM API so that
LSMs can perform their own configuration of the NetLabel subsystem without
relying on assistance from userspace.
Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Somewhere along the development of my ICMP relookup patch the header
length check went AWOL on the non-IPsec path. This patch restores the
check.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The port offset calculations depend on the protocol family, but, as
Adrian noticed, I broke this logic with the commit
5ee31fc1ec
[INET]: Consolidate inet(6)_hash_connect.
Return this logic back, by passing the port offset directly into the
consolidated function.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Noticed-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The line in the /proc/net/fib_trie for route with TOS specified
- has extra \n at the end
- does not have a space after route scope
like below.
|-- 1.1.1.1
/32 universe UNICASTtos =1
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A bug every C programmer makes at some point in time...
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we can remove TCP and DCCP specific versions of
sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly
struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.
Now only the lookup routines receive as a parameter a struct inet_hashtable.
With this we further reuse code, reducing the difference among INET transport
protocols.
Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.
net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191
net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3
net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15
net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7
net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267
net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8
net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432
vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7
/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47
/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241
/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33
/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The namespace is not available in the fib_sync_down_addr, add it as a
parameter.
Looking up a device by the pointer to it is OK. Looking up using a
result from fib_trie/fib_hash table lookup is also safe. No need to
fix that at all. So, just fix lookup by address and insertion to the
hash table path.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is required to make fib_info lookups namespace aware. In the
other case initial namespace devices are marked as dead in the local
routing table during other namespace stop.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_sync_down can be called with an address and with a device. In
reality it is called either with address OR with a device. The
codepath inside is completely different, so lets separate it into two
calls for these two cases.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The namespace is available when required except rtm_to_ifaddr. Add
namespace argument to it.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove error code assignment inside brackets on failure. The code
looks better if the error is assigned before condition check. Also,
the compiler treats this better.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
net->ipv4.fib_table_hash is not freed when fib4_rules_init failed.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the no longer used
EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor).
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ipv4_devconf can now become static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current ip route cache implementation is not suited to large caches.
We can consume a lot of CPU when cache must be invalidated, since we
currently need to evict all cache entries, and this eviction is
sometimes asynchronous. min_delay & max_delay can somewhat control this
asynchronism behavior, but whole thing is a kludge, regularly triggering
infamous soft lockup messages. When entries are still in use, this also
consumes a lot of ram, filling dst_garbage.list.
A better scheme is to use a generation identifier on each entry,
so that cache invalidation can be performed by changing the table
identifier, without having to scan all entries.
No more delayed flushing, no more stalling when secret_interval expires.
Invalidated entries will then be freed at GC time (controled by
ip_rt_gc_timeout or stress), or when an invalidated entry is found
in a chain when an insert is done.
Thus we keep a normal equilibrium.
This patch :
- renames rt_hash_rnd to rt_genid (and makes it an atomic_t)
- Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit)
- Checks entry->rt_genid at appropriate places :
In strategy_allowed_congestion_control of the 2.6.24 kernel, when
sysctl_string return 1 on success,it should call
tcp_set_allowed_congestion_control to set the allowed congestion
control.But, it don't. the sysctl_string return 1 on success,
otherwise return negative, never return 0.The patch fix the problem.
Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally during a dump the key of the last dumped entry is used for
continuation, but since lock is dropped it might be lost. In that case
fallback to the old counter based N^2 behaviour. This means the dump
will end up skipping some routes which matches what FIB_HASH does.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the net parameter to udp_get_port family of calls and
udp_lookup one and use it to filter sockets.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a net argument to inet6_lookup and propagate it further.
Actually, this is tcp-v6 implementation of what was done for
tcp-v4 sockets in a previous patch.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a net argument to inet_lookup and propagate it further
into lookup calls. Plus tune the __inet_check_established.
The dccp and inet_diag, which use that lookup functions
pass the init_net into them.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This tags the inet_bind_bucket struct with net pointer,
initializes it during creation and makes a filtering
during lookup.
A better hashfn, that takes the net into account is to
be done in the future, but currently all bind buckets
with similar port will be in one hash chain.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These two functions are the same except for what they call
to "check_established" and "hash" for a socket.
This saves half-a-kilo for ipv4 and ipv6.
add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546)
function old new delta
__inet_hash_connect - 577 +577
arp_ignore 108 113 +5
static.hint 8 4 -4
rt_worker_func 376 372 -4
inet6_hash_connect 584 25 -559
inet_hash_connect 586 25 -561
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported by Ingo Molnar:
net/built-in.o: In function `ip_queue_init':
ip_queue.c:(.init.text+0x322c): undefined reference to `net_ipv4_ctl_path'
Fix the build error and also handle CONFIG_PROC_FS=n properly.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Constify a few data tables use const qualifiers on variables where
possible in the nf_conntrack_icmp* sources.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Constify a few data tables use const qualifiers on variables where
possible in the nf_*_proto_tcp sources.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Constify data tables (predominantly in nf_conntrack_h323_types.c, but
also a few in nf_conntrack_h323_asn1.c) and use const qualifiers on
variables where possible in the h323 sources.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's unused static inline.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we're using RCU, all users of nf_nat_lock take a write_lock.
Switch it to a spinlock.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename all "conntrack" variables to "ct" for more consistency and
avoiding some overly long lines.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use RCU for expectation hash. This doesn't buy much for conntrack
runtime performance, but allows to reduce the use of nf_conntrack_lock
for /proc and nf_netlink_conntrack.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>