Since 2005 (c1b4a7e695)
tcp_tso_should_defer has been using tcp_max_burst() as a target limit
for deciding how large to make outgoing TSO packets when not using
sysctl_tcp_tso_win_divisor. But since 2008
(dd9e0dda66) tcp_max_burst() returns the
reordering degree. We should not have tcp_tso_should_defer attempt to
build larger segments just because there is more reordering. This
commit splits the notion of deferral size used in TSO from the notion
of burst size used in cwnd moderation, and returns the TSO deferral
limit to its original value.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Maravic reported an error caused by jump_label_dec() being called
from IRQ context :
BUG: sleeping function called from invalid context at kernel/mutex.c:271
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper
1 lock held by swapper/0:
#0: (&n->timer){+.-...}, at: [<ffffffff8107ce90>] call_timer_fn+0x0/0x340
Pid: 0, comm: swapper Not tainted 3.2.0-rc2-net-next-mpls+ #1
Call Trace:
<IRQ> [<ffffffff8104f417>] __might_sleep+0x137/0x1f0
[<ffffffff816b9a2f>] mutex_lock_nested+0x2f/0x370
[<ffffffff810a89fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8109a37f>] ? local_clock+0x6f/0x80
[<ffffffff810a90a5>] ? lock_release_holdtime.part.22+0x15/0x1a0
[<ffffffff81557929>] ? sock_def_write_space+0x59/0x160
[<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
[<ffffffff810969cd>] atomic_dec_and_mutex_lock+0x5d/0x80
[<ffffffff8112fc1d>] jump_label_dec+0x1d/0x50
[<ffffffff81566525>] net_disable_timestamp+0x15/0x20
[<ffffffff81557a75>] sock_disable_timestamp+0x45/0x50
[<ffffffff81557b00>] __sk_free+0x80/0x200
[<ffffffff815578d0>] ? sk_send_sigurg+0x70/0x70
[<ffffffff815e936e>] ? arp_error_report+0x3e/0x90
[<ffffffff81557cba>] sock_wfree+0x3a/0x70
[<ffffffff8155c2b0>] skb_release_head_state+0x70/0x120
[<ffffffff8155c0b6>] __kfree_skb+0x16/0x30
[<ffffffff8155c119>] kfree_skb+0x49/0x170
[<ffffffff815e936e>] arp_error_report+0x3e/0x90
[<ffffffff81575bd9>] neigh_invalidate+0x89/0xc0
[<ffffffff81578dbe>] neigh_timer_handler+0x9e/0x2a0
[<ffffffff81578d20>] ? neigh_update+0x640/0x640
[<ffffffff81073558>] __do_softirq+0xc8/0x3a0
Since jump_label_{inc|dec} must be called from process context only,
we must defer jump_label_dec() if net_disable_timestamp() is called
from interrupt context.
Reported-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now sk_route_caps is u64, its dangerous to use an integer to store
result of an AND operator. It wont work if NETIF_F_SG is moved on the
upper part of u64.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem: Senders were overriding cwnd values picked during an undo
by calling tcp_moderate_cwnd() in tcp_try_to_open().
The fix: Don't moderate cwnd in tcp_try_to_open() if we're in
TCP_CA_Open, since doing so is generally unnecessary and specifically
would override a DSACK-based undo of a cwnd reduction made in fast
recovery.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, SACK-enabled connections hung around in TCP_CA_Disorder
state while snd_una==high_seq, just waiting to accumulate DSACKs and
hopefully undo a cwnd reduction. This could and did lead to the
following unfortunate scenario: if some incoming ACKs advance snd_una
beyond high_seq then we were setting undo_marker to 0 and moving to
TCP_CA_Open, so if (due to reordering in the ACK return path) we
shortly thereafter received a DSACK then we were no longer able to
undo the cwnd reduction.
The change: Simplify the congestion avoidance state machine by
removing the behavior where SACK-enabled connections hung around in
the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
snd_una advances to high_seq or beyond we typically move to
TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
TCP_CA_Disorder if we later receive enough DSACKs.
Other patches in this series will provide other changes that are
necessary to fully fix this problem.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bug: When the ACK field is below snd_una (which can happen when
ACKs are reordered), senders ignored DSACKs (preventing undo) and did
not call tcp_fastretrans_alert, so they did not increment
prr_delivered to reflect newly-SACKed sequence ranges, and did not
call tcp_xmit_retransmit_queue, thus passing up chances to send out
more retransmitted and new packets based on any newly-SACKed packets.
The change: When the ACK field is below snd_una (the "old_ack" goto
label), call tcp_fastretrans_alert to allow undo based on any
newly-arrived DSACKs and try to send out more packets based on
newly-SACKed packets.
Other patches in this series will provide other changes that are
necessary to fully fix this problem.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bug: Senders ignored DSACKs after recovery when there were no
outstanding packets (a common scenario for HTTP servers).
The change: when there are no outstanding packets (the "no_queue" goto
label), call tcp_fastretrans_alert() in order to use DSACKs to undo
congestion window reductions.
Other patches in this series will provide other changes that are
necessary to fully fix this problem.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow callers to decide whether an ACK is a duplicate ACK. This is a
prerequisite to allowing fastretrans_alert to be called from new
contexts, such as the no_queue and old_ack code paths, from which we
have extra info that tells us whether an ACK is a dupack.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pmtu informations on the inetpeer are visible for output and
input routes. On packet forwarding, we might propagate a learned
pmtu to the sender. As we update the pmtu informations of the
inetpeer on demand, the original sender of the forwarded packets
might never notice when the pmtu to that inetpeer increases.
So use the mtu of the outgoing device on packet forwarding instead
of the pmtu to the final destination.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As it is, we return null as the default mtu of blackhole routes.
This may lead to a propagation of a bogus pmtu if the default_mtu
method of a blackhole route is invoked. So return dst->dev->mtu
as the default mtu instead.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can not update iph->daddr in ip_options_rcv_srr(), It is too early.
When some exception ocurred later (eg. in ip_forward() when goto
sr_failed) we need the ip header be identical to the original one as
ICMP need it.
Add a field 'nexthop' in struct ip_options to save nexthop of LSRR
or SSRR option.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When add sources to interface failure, need to roll back the sfcount[MODE]
to before state. We need to match it corresponding.
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Distributions are using this in their default scripts, so don't hide
them behind the advanced setting.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
C assignment can handle struct in6_addr copying.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The forcedeth changes had a conflict with the conversion over
to atomic u64 statistics in net-next.
The libertas cfg.c code had a conflict with the bss reference
counting fix by John Linville in net-next.
Conflicts:
drivers/net/ethernet/nvidia/forcedeth.c
drivers/net/wireless/libertas/cfg.c
commit f39925dbde (ipv4: Cache learned redirect information in
inetpeer.) introduced a regression in ICMP redirect handling.
It assumed ipv4_dst_check() would be called because all possible routes
were attached to the inetpeer we modify in ip_rt_redirect(), but thats
not true.
commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix
this but solution was not complete. (It fixed only one route)
So we must lookup existing routes (including different TOS values) and
call check_peer_redir() on them.
Reported-by: Ivan Zahariev <famzah@icdsoft.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a
frame not belonging to its own sockets.
RFC 2011 states that ICMP_MIB_INERRORS should count "the number of ICMP
messages which the entiry received but determined as having
ICMP-specific errors (bad ICMP checksums, bad length, etc.)."
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Flavio Leitner <fbl@redhat.com>
Acked-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_gre: Set needed_headroom dynamically again
Now that all needed_headroom users have been fixed up so that
we can safely increase needed_headroom, this patch restore the
dynamic update of needed_headroom.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: Remove all uses of LL_ALLOCATED_SPACE
The macro LL_ALLOCATED_SPACE was ill-conceived. It applies the
alignment to the sum of needed_headroom and needed_tailroom. As
the amount that is then reserved for head room is needed_headroom
with alignment, this means that the tail room left may be too small.
This patch replaces all uses of LL_ALLOCATED_SPACE in net/ipv4
with the macro LL_RESERVED_SPACE and direct reference to
needed_tailroom.
This also fixes the problem with needed_headroom changing between
allocating the skb and reserving the head room.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
v2: add couple missing conversions in drivers
split unexporting netdev_fix_features()
implemented %pNF
convert sock::sk_route_(no?)caps
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Kirby reported divides by zero errors in __tcp_select_window()
This happens when inet_csk_route_child_sock() returns a NULL pointer :
We free new socket while we eventually armed keepalive timer in
tcp_create_openreq_child()
Fix this by a call to tcp_clear_xmit_timers()
[ This is a followup to commit 918eb39962 (net: add missing
bh_unlock_sock() calls) ]
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
when dev_hard_header() failed, the newly allocated skb should be freed.
Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 3ceca74966 added a TOS attribute.
Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
furthermore they can have different values. As such one cannot in a
sane way expose both through a single attribute.
Signed-off-by: Maciej Żenczyowski <maze@google.com>
CC: Murali Raja <muralira@google.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> > ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit. The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.
Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.
[PATCH V5 net-next] neigh: new unresolved queue limits
unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.
$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
Signed-off-by: David S. Miller <davem@davemloft.net>
When the ahash driver returns -EBUSY, AH4/6 input functions return
NET_XMIT_DROP, presumably copied from the output code path. But
returning transmit codes on input doesn't make a lot of sense.
Since NET_XMIT_DROP is a positive int, this gets interpreted as
the next header type (i.e., success). As that can only end badly,
remove the check.
Signed-off-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :
> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>
OK I found it, I did some extra tests and believe its ready.
[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.
We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.
We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.
This removes two atomic operations per packet, and false sharing as
well.
On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.
IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
(can be ~88000 us).
This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
for all possible cpus can read 16 Mbytes of memory.
ICMP messages are not considered as fast path on a typical server, and
eventually few cpus handle them anyway. We can afford an atomic
operation instead of using percpu data.
This saves 4096 bytes per cpu and per network namespace.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When opt->srr_is_hit is set skb_rtable(skb) has been updated for
'nexthop' and iph->daddr should always equals to skb_rtable->rt_dst
holds, We need update iph->daddr either.
Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The AH4/6 ahash input callbacks read out the nexthdr field from the AH
header *after* they overwrite that header. This is obviously not going
to end well. Fix it up.
Signed-off-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The AH4/6 ahash output callbacks pass nexthdr to xfrm_output_resume
instead of the error code. This appears to be a copy+paste error from
the input case, where nexthdr is expected. This causes the driver to
continuously add AH headers to the datagram until either an allocation
fails and the packet is dropped or the ahash driver hits a synchronous
fallback and the resulting monstrosity is transmitted.
Correct this issue by simply passing the error code unadulterated.
Signed-off-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make clear that sk_clone() and inet_csk_clone() return a locked socket.
Add _lock() prefix and kerneldoc.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tunnels can force an alignment of their percpu data to reduce number of
cache lines used in fast path, or read in .ndo_get_stats()
percpu_alloc() is a very fine grained allocator, so any small hole will
be used anyway.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As we update the learned pmtu informations on demand, we might
report a nagative expiration time value to userspace if the
pmtu informations are already expired and we have not send a
packet to that inetpeer after expiration. With this patch we
send a expire time of null to userspace after expiration
until the next packet is send to that inetpeer.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_NODELAY is weaker than TCP_CORK, when TCP_CORK was set, small
segments will always pass Nagle test regardless of TCP_NODELAY option.
Signed-off-by: Feng King <kinwin2008@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
Simon Kirby reported lockdep warnings and following messages :
[104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?
[104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?
Problem comes from commit 0e734419
(ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)
If inet_csk_route_child_sock() returns NULL, we should release socket
lock before freeing it.
Another lock imbalance exists if __inet_inherit_port() returns an error
since commit 093d282321 ( tproxy: fix hash locking issue when using
port redirection in __inet_inherit_port()) a backport is also needed for
>= 2.6.37 kernels.
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Balazs Scheidler <bazsi@balabit.hu>
CC: KOVACS Krisztian <hidden@balabit.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp_queue_rcv_skb() has a possible race in encap_rcv handling, since
this pointer can be changed anytime.
We should use ACCESS_ONCE() to close the race.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
the tcp and udp code creates a set of struct file_operations at runtime
while it can also be done at compile time, with the added benefit of then
having these file operations be const.
the trickiest part was to get the "THIS_MODULE" reference right; the naive
method of declaring a struct in the place of registration would not work
for this reason.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Site specific OOM messages are duplications of a generic MM
out of memory message and aren't really useful, so just
delete them.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
With calls to modular infrastructure, these files really
needs the full module.h header. Call it out so some of the
cleanups of implicit and unrequired includes elsewhere can be
cleaned up.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
commit 66b13d99d9 (ipv4: tcp: fix TOS value in ACK messages sent from
TIME_WAIT) fixed IPv4 only.
This part is for the IPv6 side, adding a tclass param to ip6_xmit()
We alias tw_tclass and tw_tos, if socket family is INET6.
[ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill
TOS field, TCLASS is not taken into account ]
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In func ipv4_dst_check,check_peer_pmtu should be called only when peer is updated.
So,if the peer is not updated in ip_rt_frag_needed,we can not inc __rt_peer_genid.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was enabled by default and the messages guarded
by the define are useful.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.
In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.
Example of things that were broken :
- Routing using TOS as a selector
- Firewalls
- Trafic classification / shaping
We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.
Notes :
- We still reflect incoming TOS in RST messages.
- We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
- A patch is needed for IPv6
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>