This makes nf_conntrack_icmpv6 check that ICMPv6 type isn't < 128
to avoid accessing out of array valid_new[] and invmap[].
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_nat_initialized() takes enum ip_nat_manip_type as it's second argument,
not a hook number.
Noticed and initial patch by Marcus Sundberg <marcus@ingate.com>.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The elements on rpci->in_upcall are tracked by the filp->private_data,
which will ensure that they get released when the file is closed.
The exception is if rpc_close_pipes() gets called first, since that
sets rpci->ops to NULL.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Modified to match inet_create() bug fix by Herbert Xu -DaveM ]
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a coding error in inet_create that causes it to always return
ESOCKTNOSUPPORT. It should return EPROTONOSUPPORT when there are
protocols registered for a given socket type but none of them match
the requested protocol.
This is based on a patch by Jayachandran C.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: David Stevens <dlstevens@us.ibm.com>
As explained at:
http://www.cs.ucsb.edu/~krishna/igmp_dos/
With IGMP version 1 and 2 it is possible to inject a unicast
report to a client which will make it ignore multicast
reports sent later by the router.
The fix is to only accept the report if is was sent to a
multicast or unicast address.
Signed-off-by: David S. Miller <davem@davemloft.net>
an ipv4 socket.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue where it is possible to get valid data after
a ENOTCONN error. It returns socket errors only after data queued on
socket receive queue is consumed.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The receive path for fib_lookup netlink messages is lacking sanity
checks for header and payload and is thus vulnerable to malformed
netlink messages causing illegal memory references.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Around jiffies wrap time (i.e. within first 5 mins after boot), recent
match rules which contain both --seconds and --hitcount arguments
experience false matches.
This is because the last_pkts array is filled with zeros on creation, and
when comparing 'now' to 0 (+ --seconds argument), time_before_eq thinks it
has found a hit.
Below patch adds a break if the packet value is zero. This has the
unfortunate side effect of causing mismatches if a packet was received
when jiffies really was equal to zero. The odds of that happening are
slim compared to the problems caused by not adding the break however.
Plus, the author used this same method just below, so it is "good enough".
This fixes netfilter bugs #383 and #395.
Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mounting NFS file systems after a (warm) reboot could take a long time if
firewalling and connection tracking was enabled.
The reason is that the NFS clients tends to use the same ports (800 and
counting down). Now on reboot, the server would still have a TCB for an
existing TCP connection client:800 -> server:2049. The client sends a
SYN from port 800 to server:2049, which elicits an ACK from the server.
The firewall on the client drops the ACK because (from its point of
view) the connection is still in half-open state, and it expects to see
a SYNACK.
The client will eventually time out after several minutes.
The following patch corrects this, by accepting ACKs on half open
connections as well.
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes two needlessly global functions static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch contains the following cleanups:
- make needlessly global code static
- ip_conntrack_core.c: ip_conntrack_flush() -> ip_conntrack_flush(void)
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes two needlessly global functions static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
the patch below marks various variables const in net/; the goal is to
move them to the .rodata section so that they can't false-share
cachelines with things that get written to, as well as potentially
helping gcc a bit with optimisations. (these were found using a gcc
patch to warn about such variables)
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
atm_dev_deregister() removes device from atm_dev list immediately to
prevent operations on a phantom device. Decision to free device based
only on ->refcnt now. Remove shutdown_atm_dev() use atm_dev_deregister()
instead. atm_dev_deregister() also asynchronously releases all vccs
related to device.
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use semaphore to protect atm_devs list, as no one need access to it from
interrupt context. Avoid race conditions between atm_dev_register(),
atm_dev_lookup() and atm_dev_deregister(). Fix double spin_unlock() bug.
Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl>
Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mitchell Blank Jr <mitch@sfgoth.com>
Signed-off-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tcp_ehash hash table gets too big on systems with really big memory.
It is worse on systems with pages larger than 4KB. It wastes memory that
could be better used. It also makes the netstat command slow because reading
/proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.
The default value should not be larger for larger page sizes. It seems
that the effect of page size is an unintended error dating back a long
time. I also wonder if the default value really should be a larger
fraction of memory for systems with more memory. While systems with
really big ram can afford more space for hash tables, it is not clear to
me that they benefit from increasing the allocation ratio for this table.
The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
mm/page_alloc.c:alloc_large_system_hash.
tcp_init calls alloc_large_system_hash passing parameters-
bucketsize=sizeof(struct tcp_ehash_bucket)
numentries=thash_entries
scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
limit=0
On i386, PAGE_SHIFT is 12 for a page size of 4K
On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K
The num_physpages test above makes the allocation take a larger fraction
of the total memory on systems with larger memory. The threshold size
for a i386 system is 512MB. For an ia64 system with 16KB pages the
threshold is 2GB.
For smaller memory systems-
On i386, scale = (27 - 12) = 15
On ia64, scale = (27 - 14) = 13
For larger memory systems-
On i386, scale = (25 - 12) = 13
On ia64, scale = (25 - 14) = 11
For the rest of this discussion, I'll just track the larger memory case.
The default behavior has numentries=thash_entries=0, so the allocated
size is determined by either scale or by the default limit of 1/16 of
total memory.
In alloc_large_system_hash-
| numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
| numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
| numentries >>= 20 - PAGE_SHIFT;
| numentries <<= 20 - PAGE_SHIFT;
At this point, numentries is pages for all of memory, rounded up to the
nearest megabyte boundary.
| /* limit to 1 bucket per 2^scale bytes of low memory */
| if (scale > PAGE_SHIFT)
| numentries >>= (scale - PAGE_SHIFT);
| else
| numentries <<= (PAGE_SHIFT - scale);
On i386, numentries >>= (13 - 12), so numentries is 1/8196 of
bytes of total memory.
On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of
bytes of total memory.
| log2qty = long_log2(numentries);
|
| do {
| size = bucketsize << log2qty;
bucketsize is 16, so size is 16 times numentries, rounded
down to a power of two.
On i386, size is 1/512 of bytes of total memory.
On ia64, size is 1/128 of bytes of total memory.
For smaller systems the results are
On i386, size is 1/2048 of bytes of total memory.
On ia64, size is 1/512 of bytes of total memory.
The large page effect can be removed by just replacing
the use of PAGE_SHIFT with a constant of 12 in the calls to
alloc_large_system_hash. That makes them more like the other uses of
that function from fs/inode.c and fs/dcache.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Ensure to update hiscore.rule in dummy rule 4 in ipv6_dev_get_saddr().
Pointed out by Yan Zheng <yanzheng@21cn.com>.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __rpc_purge_upcall (net/sunrpc/rpc_pipe.c), the newer code to clean up
the in_upcall list has a typo.
Thanks to Vince Busam <vbusam@google.com> for spotting this!
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We must recompute bridge features everytime the list of underlying
devices changes, or we might end up with features that are not
supported by all devices (eg. NETIF_F_TSO)
This patch adds the missing recompute when adding a device to the bridge.
Signed-off-by: Olaf Rempel <razzor@kopf-tisch.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/netfilter/ip_conntrack_netlink.c: In function 'ctnetlink_dump_table':
net/ipv4/netfilter/ip_conntrack_netlink.c:409: warning: implicit declaration of function 'local_bh_disable'
net/ipv4/netfilter/ip_conntrack_netlink.c:427: warning: implicit declaration of function 'local_bh_enable'
Signed-off-by: Benoit Boissinot <benoit.boissinot@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove proto == NULL checking since ip_conntrack_[nat_]proto_find_get
always returns a valid pointer.
Fix missing ip_conntrack_proto_put in some paths.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the problem with promoting aliases when:
a) a single primary and > 1 secondary addresses
b) multiple primary addresses each with at least one secondary address
Based on earlier efforts from Brian Pomerantz <bapper@piratehaven.org>,
Patrick McHardy <kaber@trash.net> and Thomas Graf <tgraf@suug.ch>
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
- IP_NF_CONNTRACK_MARK is bool and depends on only IP_NF_CONNTRACK
which is tristate. If a variable depends on IP_NF_CONNTRACK_MARK and
doesn't care about IP_NF_CONNTRACK, it can be y. This must be avoided.
- IP_NF_CT_ACCT has same problem.
- IP_NF_TARGET_CLUSTERIP also depends on IP_NF_MANGLE.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Don't show local table to behave similar to fib_hash.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
addrconf_verify(...) only traverse address hash table when
addrconf_hash_lock is held for writing, and it may hold
addrconf_hash_lock for a long time. So I think it's better to acquire
addrconf_hash_lock for reading instead of writing
Signed-off-by: Yan Zheng <yanzheng@21cn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This way we don't have to check it in sk_run_filter().
Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
If two packets were queued to be sent at the same time in the future,
their order would be reversed. This would occur because the queue is
traversed back to front, and a position is found by checking whether
the new packet needs to be sent before the packet being examined. If
the new packet is to be sent at the same time of a previous packet, it
would end up before the old packet in the queue. This patch places
packets in the correct order when they are queued to be sent at a same
time in the future.
Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Thu, 17 Nov 2005, David Gmez wrote:
> I found out that if i select NET_CLS_ROUTE4, save my changes and exit
> menuconfig, execute again make menuconfig and go to QoS options, then the new
> available options are visible. So menuconfig has some problem refreshing
> contents :?
No, they were there before too, but you have to go up one level to see
them.
It's better in 2.6.15-rc1-git5, but the menu structure is still a little
messed up, the patch below properly indents all menu entries.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Noticed by Olaf Hering.
The comparisons want a u8 here (the data type on the left-hand branch
is a u8 structure member, and the constant on the right-hand branch is
"~((u8) 128)"), but C turns it into an integer so we get:
net/llc/llc_c_ac.c: In function `llc_conn_ac_inc_npta_value':
net/llc/llc_c_ac.c:998: warning: comparison is always true due to limited range of data type
net/llc/llc_c_ac.c:999: warning: large integer implicitly truncated to unsigned type
Fix this up by explicitly recasting the right-hand branch constant
into a "u8" once more.
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we've converted the ftp/irc/tftp helpers to use the new
module_parm_array() some time ago, we ware accidentially using signed data
types - thus preventing those modules from being used on ports >= 32768.
This patch fixes it by using 'ushort' module parameters.
Thanks to Jan Nijs for reporting this bug.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a compile error that crept in with the last patch of
TCP patches.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC [M] net/netfilter/nf_conntrack_core.o
net/netfilter/nf_conntrack_core.c: In function 'nf_ct_unlink_expect':
net/netfilter/nf_conntrack_core.c:390: error: 'exp_timeout' undeclared (first use in this function)
net/netfilter/nf_conntrack_core.c:390: error: (Each undeclared identifier is reported only once
net/netfilter/nf_conntrack_core.c:390: error: for each function it appears in.)
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both of ipq and frag_queue have *next and **prev, and they can be replaced
with hlist. Thanks Arnaldo Carvalho de Melo for the suggestion.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although the comment around the allocation code tells us that
the layer-3 specific protocol tables will be freed when cleaning up,
they aren't. And this makes nfsim complain loudly...
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix nf_conntrack statistics proc file removal. Looks like the old bug
was forward-ported from ip_conntrack. :-]
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Being kernel-threads, nfsd servers don't get pre-empted (depending on
CONFIG). If there is a steady stream of NFS requests that can be served
from cache, an nfsd thread may hold on to a cpu indefinitely, which isn't
very friendly.
So it is good to have a cond_resched in there (just before looking for a
new request to serve), to make sure we play nice.
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Jochen Friedrich <jochen@scram.de>
Acked-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jochen Friedrich <jochen@scram.de>
Acked-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch below fixes the following sparse warning:
net/ipv6/ipv6_sockglue.c:291:13: warning: Using plain integer as NULL pointer
Signed-off-by: Luiz Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "score.rule++" doesn't make any sense for me.
According to codes above, I think it should be "hiscore.rule++;" .
Signed-off-by: Yan Zheng<yanzheng@21cn.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes nf_conntrack_ipv6 free all IPv6 fragment queues at module
unloading time. Also introduce a BUG_ON if we ever again have leaks in
the memory accounting.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This synchronizes nf_ct_reasm with ipv6 reassembly, and fixes a possibility
of an infinite loop if CPUs evict and create nf_ct_frag6_queue in parallel.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
These variables should be unsigned. This fixes sysctl handler for
nf_ct_frag6_{low,high}_thresh.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes linux 2.4 configs in comments as TODO lists.
And this also move the entry of nf_conntrack to top like IPv4 Netfilter
Kconfig.
Based on original patch by Krzysztof Piotr Oledzki <ole@ans.pl>.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Staticaly linked nf_conntrack_ipv4 requires nf_conntrack. but currently
nf_conntrack is linked after it. This changes the order of ipv4 and netfilter
to fix this.
Signed-off-by: Krzysztof Oledzki <olenf@ans.pl>
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch unconditionally requires CAP_NET_ADMIN for all nfnetlink
messages. It also removes the per-message cap_required field, since all
existing subsystems use CAP_NET_ADMIN for all their messages anyway.
Patrick McHardy owes me a beer if we ever need to re-introduce this.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looks like the nf_conntrack TCP code was slightly mismerged: it does
not contain an else branch present in the IPv4 version. Let's add that
code and make the testsuite happy.
Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing size checks. Thanks Patrick McHardy for the hint.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make gcc-4.x happy. Use size_t instead of int. Thanks to Patrick McHardy
for the hint.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices (e.g. Qlogic iSCSI HBA hardware like QLA4010 up to firmware
3.0.0.4) initiates TCP with SYN and PUSH flags set.
The Linux TCP/IP stack deals fine with that, but the connection tracking
code doesn't.
This patch alters TCP connection tracking to accept SYN+PUSH as a valid
flag combination.
Signed-off-by: Vlad Drukker <vlad@storewiz.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent change to netlink dump "done" callback handling broke IPv6
which played dirty tricks with the "done" callback. This causes an
infinite loop during a dump.
The following patch fixes it.
This bug was reported by Jeff Garzik.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Also introduces a sysctl option to configure the receive buffer
accounting policy to be either at socket or association level.
Default is all the associations on the same socket share the
receive buffer.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket level timeout values are maintained in sctp_sock and
association level timeouts are in sctp_association. So there is
no need for ep->timeouts.
Signed-off-by: Vladislav Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible to get to sctp_v4_get_saddr() without a valid
association. This happens when processing OOTB packets and
the cached route entry is no longer valid.
However, when responding to OOTB packets we already properly
set the source address based on the information in the OOTB
packet. So, if we we get to sctp_v4_get_saddr() without an
association we can simply return.
Signed-off-by: Vladislav Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based mostly upon a patch from Olaf Kirch <okir@suse.de>
When initialization fails in inet6_init(), we should
unregister the PF_INET6 socket ops.
Also, check sock_register()'s return value for errors.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently recvmsg generates SIGPIPE whereas sendmsg does not; for the
other stacks it seems to be the other way round!
It also fixes the bug where reading from a socket whose peer has shutdown
returned -EINVAL rather than 0.
Signed-off-by: Patrick Caulfield <patrick@tykepenguin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use "hints" to speed up the SACK processing. Various forms
of this have been used by TCP developers (Web100, STCP, BIC)
to avoid the 2x linear search of outstanding segments.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a patch for discussion addressing some receive buffer growing issues.
This is partially related to the thread "Possible BUG in IPv4 TCP window
handling..." last week.
Specifically it addresses the problem of an interaction between rcvbuf
moderation (receiver autotuning) and rcv_ssthresh. The problem occurs when
sending small packets to a receiver with a larger MTU. (A very common case I
have is a host with a 1500 byte MTU sending to a host with a 9k MTU.) In
such a case, the rcv_ssthresh code is targeting a window size corresponding
to filling up the current rcvbuf, not taking into account that the new rcvbuf
moderation may increase the rcvbuf size.
One hunk makes rcv_ssthresh use tcp_rmem[2] as the size target rather than
rcvbuf. The other changes the behavior when it overflows its memory bounds
with in-order data so that it tries to grow rcvbuf (the same as with
out-of-order data).
These changes should help my problem of mixed MTUs, and should also help the
case from last week's thread I think. (In both cases though you still need
tcp_rmem[2] to be set much larger than the TCP window.) One question is if
this is too aggressive at trying to increase rcvbuf if it's under memory
stress.
Orignally-from: John Heffner <jheffner@psc.edu>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an updated version of the RFC3465 ABC patch originally
for Linux 2.6.11-rc4 by Yee-Ting Li. ABC is a way of counting
bytes ack'd rather than packets when updating congestion control.
The orignal ABC described in the RFC applied to a Reno style
algorithm. For advanced congestion control there is little
change after leaving slow start.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move all the code that does linear TCP slowstart to one
inline function to ease later patch to add ABC support.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify the code that comuputes microsecond rtt estimate used
by TCP Vegas. Move the callback out of the RTT sampler and into
the end of the ack cleanup.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP peformance with TSO over networks with delay is awful.
On a 100Mbit link with 150ms delay, we get 4Mbits/sec with TSO and
50Mbits/sec without TSO.
The problem is with TSO, we intentionally do not keep the maximum
number of packets in flight to fill the window, we hold out to until
we can send a MSS chunk. But, we also don't update the congestion window
unless we have filled, as per RFC2861.
This patch replaces the check for the congestion window being full
with something smarter that accounts for TSO.
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults. If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name. In future it can turn off RX checksum.
I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:
* Those places where checksums are done bit by bit. These will call
netdev_rx_csum_fault directly.
* The following have not been completely checked/converted:
ipmr
ip_vs
netfilter
dccp
This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the call to xprt_transmit() fails due to socket buffer space
exhaustion, we do not need to re-encode the RPC message when we
loop back through call_transmit.
Re-encoding can actually end up triggering the WARN_ON() in
call_decode() if we re-encode something like a read() request and
auth->au_rslack has changed.
It can also cause us to increment the RPCSEC_GSS sequence number
beyond the limits of the allowed window.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The generic netlink family builds on top of netlink and provides
simplifies access for the less demanding netlink users. It solves
the problem of protocol numbers running out by introducing a so
called controller taking care of id management and name resolving.
Generic netlink modules register themself after filling out their
id card (struct genl_family), after successful registration the
modules are able to register callbacks to command numbers by
filling out a struct genl_ops and calling genl_register_op(). The
registered callbacks are invoked with attributes parsed making
life of simple modules a lot easier.
Although generic netlink modules can request static identifiers,
it is recommended to use GENL_ID_GENERATE and to let the controller
assign a unique identifier to the module. Userspace applications
will then ask the controller and lookup the idenfier by the module
name.
Due to the current multicast implementation of netlink, the number
of generic netlink modules is restricted to 1024 to avoid wasting
memory for the per socket multiacst subscription bitmask.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces netlink_run_queue() to handle the receive queue of
a netlink socket in a generic way. Processes as much as there
was in the queue upon entry and invokes a callback function
for each netlink message found. The callback function may
refuse a message by returning a negative error code but setting
the error pointer to 0 in which case netlink_run_queue() will
return with a qlen != 0.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most netlink families make no use of the done() callback, making
it optional gets rid of all unnecessary dummy implementations.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces a new type-safe interface for netlink message and
attributes handling. The interface is fully binary compatible
with the old interface towards userspace. Besides type safety,
this interface features attribute validation capabilities,
simplified message contstruction, and documentation.
The resulting netlink code should be smaller, less error prone
and easier to understand.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing connection tracking subsystem in netfilter can only
handle ipv4. There were basically two choices present to add
connection tracking support for ipv6. We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.
In fact nf_conntrack is capable of working with any layer 3
protocol.
The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here. For example, these issues include:
1) ICMPv6 handling, which is used for neighbour discovery in
ipv6 thus some messages such as these should not participate
in connection tracking since effectively they are like ARP
messages
2) fragmentation must be handled differently in ipv6, because
the simplistic "defrag, connection track and NAT, refrag"
(which the existing ipv4 connection tracking does) approach simply
isn't feasible in ipv6
3) ipv6 extension header parsing must occur at the correct spots
before and after connection tracking decisions, and there were
no provisions for this in the existing connection tracking
design
4) ipv6 has no need for stateful NAT
The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete. Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Trying to build today's 2.6.14+git snapshot gives undefined references
to use_tempaddr
Looks like an ifdef got left out.
Signed-off-by: Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an userspace triggered oops. If there is no ICMP_ID
info the reference to attr will be NULL.
Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Propagate the error to userspace instead of returning -EPERM if the get
conntrack operation fails.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return -EINVAL if the size isn't OK instead of -EPERM.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently connection tracking handles ICMP error like normal packets
if it failed to get related connection. But it fails that after all.
This makes connection tracking stop tracking ICMP error at early point.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this patch, any user can cause nfnetlink subsystems to be
autoloaded. Those subsystems however could add significant processing
overhead to packet processing, and would refuse any configuration messages
from non-CAP_NET_ADMIN processes anyway.
This patch follows a suggestion from Patrick McHardy.
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reply tuple of the PNS->PAC expectation was using the wrong call id.
So we had the following situation:
- PNS behind NAT firewall
- PNS call id requires NATing
- PNS->PAC gre packet arrives first
then the PNS->PAC expectation is matched, and the other expectation
is deleted, but the PAC->PNS gre packets do not match the gre conntrack
because the call id is wrong.
We also cannot use ip_nat_follow_master().
Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ctnetlink_get_conntrack is always called from user context, so GFP_KERNEL
is enough.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>