Save namespace context on the fib rule at the rule creation time and
call routing lookup in the correct namespace.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The backward link from FIB rules operations to the network namespace
will allow to simplify the API a bit.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
During network namespace stop process kernel side netlink sockets
belonging to a namespace should be closed. They should not prevent
namespace to stop, so they do not increment namespace usage
counter. Though this counter will be put during last sock_put.
The raplacement of the correct netns for init_ns solves the problem
only partial as socket to be stoped until proper stop is a valid
netlink kernel socket and can be looked up by the user processes. This
is not a problem until it resides in initial namespace (no processes
inside this net), but this is not true for init_net.
So, hold the referrence for a socket, remove it from lookup tables and
only after that change namespace and perform a last put.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a specific helper for netlink kernel socket disposal. This just
let the code look better and provides a ground for proper disposal
inside a namespace.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Network namespace allocates 2 kernel netlink sockets, fibnl &
rtnl. These sockets should be disposed properly, i.e. by
sock_release. Plain sock_put is not enough.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The garbage collection function receive the dst_ops structure as
parameter. This is useful for the next incoming patchset because it
will need the dst_ops (there will be several instances) and the
network namespace pointer (contained in the dst_ops).
The protocols which do not take care of the namespaces will not be
impacted by this change (expect for the function signature), they do
just ignore the parameter.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, sizeof(struct fib_alias) is 24 or 48 bytes on 32/64 bits
arches.
Because of SLAB_HWCACHE_ALIGN requirement, these are rounded to 32 and
64 bytes respectively.
This patch moves rcu to the end of fib_alias, and conditionally
defines it only for CONFIG_IP_FIB_TRIE.
We also remove SLAB_HWCACHE_ALIGN requirement for fib_alias and
fib_node objects because it is not necessary.
(BTW SLUB currently denies it for objects smaller than
cache_line_size() / 2, but not SLAB)
Finally, sizeof(fib_alias) go back to 16 and 32 bytes.
Then, we can embed one fib_alias on each fib_node, to favor locality.
Most of the time access to the fib_alias will be free because one
cache line contains both the list head (fn_alias) and (one of) the
list element.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
node_parent() and tnode_get_child() currently use rcu_dereference().
These functions are called from both
- readers only paths (where rcu_dereference() is needed), and
- writer path (where rcu_dereference() is not needed)
To make explicit where rcu_dereference() is really needed, I
introduced new node_parent_rcu() and tnode_get_child_rcu() functions
which use rcu_dereference(), while node_parent() and tnode_get_child()
dont use it.
Then I changed calling sites where rcu_dereference() was really needed
to call the _rcu() variants.
This should have no impact but for alpha architecture, and may help
future sparse checks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there now is generic support for shared sysctl paths, the only
remains are the net/netfilter and net/ipv4/netfilter paths. Move them
to net/netfilter/core.c and net/ipv4/netfilter.c and kill nf_sysctl.c.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current TCP RST construction reuses the old packet and can't
deal with IP options as a consequence of that. Construct the
RST from scratch instead.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes inlines except those which are used
by packet matching code and thus are performance-critical.
Before:
$ size */*/*/ip*tables*.o
text data bss dec hex filename
6402 500 16 6918 1b06 net/ipv4/netfilter/ip_tables.o
7130 500 16 7646 1dde net/ipv6/netfilter/ip6_tables.o
After:
$ size */*/*/ip*tables*.o
text data bss dec hex filename
6307 500 16 6823 1aa7 net/ipv4/netfilter/ip_tables.o
7010 500 16 7526 1d66 net/ipv6/netfilter/ip6_tables.o
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves ipt_iprange to xt_iprange, in preparation for adding
IPv6 support to xt_iprange.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Updates the MODULE_DESCRIPTION() tags for all Netfilter modules,
actually describing what the module does and not just
"netfilter XYZ target".
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 88c85d81f74f92371745158aebc5cbf490412002 forgot to remove the
old ipt_TOS file (whose code has been merged into xt_DSCP). Remove
it now.
Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the netfilter modules are not considered experimental anymore,
the only ones I want to keep marked as EXPERIMENTAL are:
- TCPOPTSTRIP target, which is brand new.
- SANE helper, which is quite new.
- CLUSTERIP target, which I believe hasn't had much testing despite
being in the kernel for quite a long time.
- SCTP match and conntrack protocol, which are a mess and need to
be reviewed and cleaned up before I would trust them.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialization of the slab cache's should be done when IP is
initialized to make sure of available memory, and that code can be
marked __init.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Show number of entries in trie, the size field was being set but never used,
but it only counted leaves, not all entries. Refactor the two cases in
fib_triestat_seq_show into a single routine.
Note: the stat structure was being malloc'd but the stack usage isn't so
high (288 bytes) that it is worth the additional complexity.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_trie_seq_show() uses two helper functions, rtn_scope() and
rtn_type() that can write to static storage without locking.
Just pass to them a temporary buffer to avoid potential corruption
(probably not triggerable but still...)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_confirm_addr can be called with NULL in_dev from arp_ignore iff
scope is RT_SCOPE_LINK.
Lets always pass the device and check for RT_SCOPE_LINK scope inside
inet_confirm_addr. This let us take network namespace from in_device a
need for an additional argument.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
arp_ignore has two arguments: dev & in_dev. dev is used for
inet_confirm_addr calling only.
inet_confirm_addr, in turn, either gets in_dev from the device passed
or iterates over all network devices if the device passed is NULL. It
seems logical to directly pass in_dev into inet_confirm_addr.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some calls in the arp.c have network namespace as an argument. Getting
init_net inside these functions is simply inconsistent. Fix this.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The neighbour entry will be destroyed in the case of error, so it is
pointless to perform constly routing table lookup in this case.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
To do so, just register the proper subsystem and create files in
->init callbacks.
No other special per-namespace handling for raw sockets is required.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Happily, in all the rest places (->bind callbacks only), that require the
struct net, we have a socket, so get the net from it.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull the struct net pointer up to the showing functions
to filter the sockets depending on their namespaces.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This requires just to pass the appropriate struct net pointer
into __raw_v[46]_lookup and skip sockets that do not belong
to a needed namespace.
The proper net is get from skb->dev in all the cases.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If declared as unsigned short, these fields can overflow, and whole
trie logic is broken. I could not make the machine crash, but some
tnode can never be freed.
Note for 64 bit arches : By reordering t_key and parent in [node,
leaf, tnode] structures, we can use 32 bits hole after t_key so that
sizeof(struct tnode) doesnt change after this patch.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
tnode_alloc() already clears allocated memory, using kcalloc() or
alloc_pages(GFP_KERNEL|__GFP_ZERO, ...)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In struct tnode, we use two fields of 5 bits for 'pos' and 'bits'.
Switching to plain 'unsigned char' (8 bits) take the same space
because of compiler alignments, and reduce text size by 435 bytes
on i386.
On i386 :
$ size net/ipv4/fib_trie.o.before_patch net/ipv4/fib_trie.o
text data bss dec hex filename
13714 4 64 13782 35d6 net/ipv4/fib_trie.o.before
13279 4 64 13347 3423 net/ipv4/fib_trie.o
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make FIB TRIE go through sparse checker without warnings.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FIB TRIE code has a bunch of statistics, but the code is hidden
behind an ifdef that was never implemented. Since it was dead code, it
was broken as well.
This patch fixes that by making it a config option.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
printk related cleanups:
* Get rid of unused printk wrappers.
* Make bug checks into KERN_WARNING because KERN_DEBUG gets ignored
* Turn one cryptic old message into something real
* Make sure all messages have KERN_XXX
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only error from fib_insert_node is if memory allocation fails, so
instead of passing by reference, just use the convention of returning
NULL.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use %u instead of %d when printing unsigned values.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The revision element must of been part of an earlier design, because
currently it is set but never used.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
trie_init is worthless it is just zeroing stuff that is already zero!
Move the memset() down to make it obvious.
Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Otherwise we beat heavily on the global tcp_memory atomics
when all of the sockets in the system are slowly sending
perioding packet clumps.
Noticed and suggested by Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
I.e. remove the net != &init_net checks from the places, that now can
handle other-than-init net namespace.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
... up to rtentry_to_fib_config
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes the netlink socket to be per namespace. That allows
to have each namespace its own socket for routing queries.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The preparatory work has been done. All we need is to substitute
fib_table_hash with net->ipv4.fib_table_hash. Netns context is
available when required.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The final trick for rules: place fib4_rules_ops into struct net and
modify initialization path for this.
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>