Instead of keeping candidate tunnel device from all categories,
keep only one candidate with best score. This optimizes stack
usage and speeds up exit code.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (92 commits)
gianfar: Revive VLAN support
vlan: Export symbols as non GPL symbols.
bnx2x: tx_has_work should not wait for FW
netxen: reduce memory footprint
netxen: fix vlan tso/checksum offload
net: Fix linux/if_frad.h's suitability for userspace.
net: Move config NET_NS to from net/Kconfig to init/Kconfig
isdn: Fix missing ifdef in isdn_ppp
networking: document "nc" in addition to "netcat" in netconsole.txt
e1000e: workaround hw errata
af_key: initialize xfrm encap_oa
virtio_net: Fix MAX_PACKET_LEN to support 802.1Q VLANs
lcs: fix compilation for !CONFIG_IP_MULTICAST
rtl8187: Add termination packet to prevent stall
iwlwifi: fix rs_get_rate WARN_ON()
p54usb: fix packet loss with first generation devices
sctp: Fix another socket race during accept/peeloff
sctp: Properly timestamp outgoing data chunks for rtx purposes
sctp: Correctly start rtx timer on new packet transmissions.
sctp: Fix crc32c calculations on big-endian arhes.
...
In previous kernels, any kernel module could get access to the
'real-device' and the VLAN-ID for a particular VLAN. In more recent
kernels, the code was restructured such that this is hard to do
without accessing private .h files for any module that cannot use
GPL-only symbols.
Attached is a patch to once again allow non-GPL modules the ability to
access the real-device and VLAN id for VLANs.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make NET_NS available underneath the generic Namespaces config option
since all of the other namespace options are there.
Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently encap_oa is left uninitialized, so it contains garbage data which
is visible to userland via Netlink. Initialize it by zeroing it out.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race between sctp_rcv() and sctp_accept() where we
have moved the association from the listening socket to the
accepted socket, but sctp_rcv() processing cached the old
socket and continues to use it.
The easy solution is to check for the socket mismatch once we've
grabed the socket lock. If we hit a mis-match, that means
that were are currently holding the lock on the listening socket,
but the association is refrencing a newly accepted socket. We need
to drop the lock on the old socket and grab the lock on the new one.
A more proper solution might be to create accepted sockets when
the new association is established, similar to TCP. That would
eliminate the race for 1-to-1 style sockets, but it would still
existing for 1-to-many sockets where a user wished to peeloff an
association. For now, we'll live with this easy solution as
it addresses the problem.
Reported-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Karsten Keil <kkeil@suse.de>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes to the retransmit code exposed a long standing
bug where it was possible for a chunk to be time stamped
after the retransmit timer was reset. This caused a rare
situation where the retrnamist timer has expired, but
nothing was marked for retrnasmission because all of
timesamps on data were less then 1 rto ago. As result,
the timer was never restarted since nothing was retransmitted,
and this resulted in a hung association that did couldn't
complete the data transfer. The solution is to timestamp
the chunk when it's added to the packet for transmission
purposes. After the packet is trsnmitted the rtx timer
is restarted. This guarantees that when the timer expires,
there will be data to retransmit.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 62aeaff5cc
(sctp: Start T3-RTX timer when fast retransmitting lowest TSN)
introduced a regression where it was possible to forcibly
restart the sctp retransmit timer at the transmission of any
new chunk. This resulted in much longer timeout times and
sometimes hung sctp connections.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This last patch makes the appropriate changes to use and propagate the
network namespace where needed in IPv4 multicast routing code.
This consists mainly in replacing all the remaining init_net occurences
with current netns pointer retrieved from sockets, net devices or
mfc_caches depending on the routines' contexts.
Some routines receive a new 'struct net' parameter to propagate the current
netns:
* vif_add/vif_delete
* ipmr_new_tunnel
* mroute_clean_tables
* ipmr_cache_find
* ipmr_cache_report
* ipmr_cache_unresolved
* ipmr_mfc_add/ipmr_mfc_delete
* ipmr_get_route
* rt_fill_info (in route.c)
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Declare IPv4 multicast forwarding /proc/net entries per-namespace:
/proc/net/ip_mr_vif
/proc/net/ip_mr_cache
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv4 multicast routing netns-aware.
Declare variable 'reg_vif_num' per-namespace, move into struct netns_ipv4.
At the moment, this variable is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv4 multicast routing netns-aware.
Declare IPv multicast routing variables 'mroute_do_assert' and
'mroute_do_pim' per-namespace in struct netns_ipv4.
At the moment, these variables are only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv4 multicast routing netns-aware.
Declare variable cache_resolve_queue_len per-namespace: move it into
struct netns_ipv4.
This variable counts the number of unresolved cache entries queued in the
list mfc_unres_queue. This list is kept global to all netns as the number
of entries per namespace is limited to 10 (hardcoded in routine
ipmr_cache_unresolved).
Entries belonging to different namespaces in mfc_unres_queue will be
identified by matching the mfc_net member introduced previously in
struct mfc_cache.
Keeping this list global to all netns, also allows us to keep a single
timer (ipmr_expire_timer) to handle their expiration.
In some places cache_resolve_queue_len value was tested for arming
or deleting the timer. These tests were equivalent to testing
mfc_unres_queue value instead and are replaced in this patch.
At the moment, cache_resolve_queue_len is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv4 multicast routing netns-aware.
Dynamically allocate IPv4 multicast forwarding cache, mfc_cache_array,
and move it to struct netns_ipv4.
At the moment, mfc_cache_array is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch stores into struct mfc_cache the network namespace each
mfc_cache belongs to. The new member is mfc_net.
mfc_net is assigned at cache allocation and doesn't change during
the rest of the cache entry life.
A new net parameter is added to ipmr_cache_alloc/ipmr_cache_alloc_unres.
This will help to retrieve the current netns around the IPv4 multicast
routing code.
At the moment, all mfc_cache are allocated in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast routing netns-aware.
Dynamically allocate interface table vif_table and move it to
struct netns_ipv4, and update MIF_EXISTS() macro.
At the moment, vif_table is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv4 multicast routing netns-aware.
Make IPv4 multicast routing mroute_socket per-namespace,
moves it into struct netns_ipv4.
At the moment, mroute_socket is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
wlan0: switched to short barker preamble (BSSID=00:01:aa:bb:cc:dd)
wlan0: switched to short slot (BSSID=) <something is missing here>
should be:
wlan0: switched to short barker preamble (BSSID=00:01:aa:bb:cc:dd)
wlan0: switched to short slot (BSSID=00:01:aa:bb:cc:dd)
Signed-off-by: Christian Lamparter <chunkeey@web.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
After launching mesh discovery in tx path, reference count was not being
decremented. This was preventing module unload.
Signed-off-by: Brian Cavagnolo <brian@cozybit.com>
Signed-off-by: Andrey Yurovsky <andrey@cozybit.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Check the device on receive path and allow otherwise identical devices
as long as the physical device differs.
This is useful for NBMA tunnels, where you want to use different gre IP
for each public IP available via different physical devices.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
With simple extension to the binding mechanism, which allows to bind more
than 64k sockets (or smaller amount, depending on sysctl parameters),
we have to traverse the whole bind hash table to find out empty bucket.
And while it is not a problem for example for 32k connections, bind()
completion time grows exponentially (since after each successful binding
we have to traverse one bucket more to find empty one) even if we start
each time from random offset inside the hash table.
So, when hash table is full, and we want to add another socket, we have
to traverse the whole table no matter what, so effectivelly this will be
the worst case performance and it will be constant.
Attached picture shows bind() time depending on number of already bound
sockets.
Green area corresponds to the usual binding to zero port process, which
turns on kernel port selection as described above. Red area is the bind
process, when number of reuse-bound sockets is not limited by 64k (or
sysctl parameters). The same exponential growth (hidden by the green
area) before number of ports reaches sysctl limit.
At this time bind hash table has exactly one reuse-enbaled socket in a
bucket, but it is possible that they have different addresses. Actually
kernel selects the first port to try randomly, so at the beginning bind
will take roughly constant time, but with time number of port to check
after random start will increase. And that will have exponential growth,
but because of above random selection, not every next port selection
will necessary take longer time than previous. So we have to consider
the area below in the graph (if you could zoom it, you could find, that
there are many different times placed there), so area can hide another.
Blue area corresponds to the port selection optimization.
This is rather simple design approach: hashtable now maintains (unprecise
and racely updated) number of currently bound sockets, and when number
of such sockets becomes greater than predefined value (I use maximum
port range defined by sysctls), we stop traversing the whole bind hash
table and just stop at first matching bucket after random start. Above
limit roughly corresponds to the case, when bind hash table is full and
we turned on mechanism of allowing to bind more reuse-enabled sockets,
so it does not change behaviour of other sockets.
Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net>
Tested-by: Denys Fedoryschenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since all feature-negotiation processing now takes place in feat.c,
functions for producing verbose debugging output are concentrated
there.
New functions to print out values, entry records, and options are
provided, and also a macro is defined to not always have the function
name in the output line.
Thanks a lot to Wei Yongjun and Giuseppe Galeota for help and
discussion with an earlier revision of this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch takes care of initialising and type-checking sysctls
related to feature negotiation. Type checking is important since some
of the sysctls now directly impact the feature-negotiation process.
The sysctls are initialised with the known default values for each
feature. For the type-checking the value constraints from RFC 4340
are used:
* Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
tested and confirmed that it works up to 4294967295 - for Gbps speed;
* Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
* CCIDs are between 0 .. 255;
* request_retries, retries1, retries2 also between 0..255 for good measure;
* tx_qlen is checked to be non-negative;
* sync_ratelimit remains as before.
Notes:
------
1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.
2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in
other places, sometimes with exactly the same kind of definitions (e.g.
"static int zero;"). It may be a good idea (kernel janitors?) to consolidate
type checking. For the sake of keeping the changeset small and in order not
to affect other subsystems, I have not strived to generalise here.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.
Specifically, the following is contained in this patch:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This initialises feature negotiation from two tables, which are in
turn are initialised from sysctls.
As a novel feature, specifics of the implementation (e.g. that short
seqnos and ECN are not yet available) are advertised for robustness.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
With net_device_ops if set_mac_address is null, then error
is -EOPNOTSUPPORTED.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that stats are in net_device, use them.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caused by call to request_module() while holding nf_conntrack_lock.
Reported-and-tested-by: Kövesdi György <kgy@teledigit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous fix to paged packets broke the merging because it
reset the skb->len before we added it to the merged packet. This
wasn't detected because it simply resulted in the truncation of
the packet while the missing bit is subsequently retransmitted.
The fix is to store skb->len before we clobber it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a frag is shorter than an Ethernet header, we'd return a
zeroed packet instead of aborting. This patch fixes that.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to perform skb_postpull_rcsum after pulling the IPv6
header in order to maintain the correctness of the complete
checksum.
This patch also adds a missing iph reload after pulling.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
register_pernet_gen_subsys omits mutex_unlock in one fail path.
Fix it.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit fc8c7dc1b2.
As indicated by Jiri Klimes, this won't work. These numbers are
not only used the size validation, they are also used to locate
attributes sitting after the message.
Signed-off-by: David S. Miller <davem@davemloft.net>
The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.
The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.
But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.
The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.
The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused. Yet, that is
all that the socket send side has at this point.
This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.
The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.
This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.
With fixes from Herbert Xu.
Tested-by: Willy Tarreau <w@1wt.eu>
Foreseen-by: Changli Gao <xiaosuo@gmail.com>
Diagnosed-by: Willy Tarreau <w@1wt.eu>
Reported-by: Willy Tarreau <w@1wt.eu>
Fixed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I'm trying to track down why people're hitting the checksum warning
in skb_gso_segment. As the problem seems to be hitting lots of
people and I can't reproduce it or locate the bug, here is a patch
to print out more details which hopefully should help us to track
this down.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>