Adaptative RED AQM for linux, based on paper from Sally FLoyd,
Ramakrishna Gummadi, and Scott Shenker, August 2001 :
http://icir.org/floyd/papers/adaptiveRed.pdf
Goal of Adaptative RED is to make max_p a dynamic value between 1% and
50% to reach the target average queue : (max_th - min_th) / 2
Every 500 ms:
if (avg > target and max_p <= 0.5)
increase max_p : max_p += alpha;
else if (avg < target and max_p >= 0.01)
decrease max_p : max_p *= beta;
target :[min_th + 0.4*(min_th - max_th),
min_th + 0.6*(min_th - max_th)].
alpha : min(0.01, max_p / 4)
beta : 0.9
max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)
Changes against our RED implementation are :
max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
fixed point number, to allow full range described in Adatative paper.
To deliver a random number, we now use a reciprocal divide (thats really
a multiply), but this operation is done once per marked/droped packet
when in RED_BETWEEN_TRESH window, so added cost (compared to previous
AND operation) is near zero.
dump operation gives current max_p value in a new TCA_RED_MAX_P
attribute.
Example on a 10Mbit link :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
limit 400000 min 30000 max 90000 avpkt 1000 \
burst 55 ecn adaptative bandwidth 10Mbit
# tc -s -d qdisc show dev eth3
...
qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
adaptative ewma 5 max_p=0.113335 Scell_log 15
Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
rate 9749Kbit 831pps backlog 72056b 16p requeues 0
marked 1357 early 35 pdrop 0 other 0
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
Le mercredi 30 novembre 2011 à 14:36 -0800, Stephen Hemminger a écrit :
> (Almost) nobody uses RED because they can't figure it out.
> According to Wikipedia, VJ says that:
> "there are not one, but two bugs in classic RED."
RED is useful for high throughput routers, I doubt many linux machines
act as such devices.
I was considering adding Adaptative RED (Sally Floyd, Ramakrishna
Gummadi, Scott Shender), August 2001
In this version, maxp is dynamic (from 1% to 50%), and user only have to
setup min_th (target average queue size)
(max_th and wq (burst in linux RED) are automatically setup)
By the way it seems we have a small bug in red_change()
if (skb_queue_empty(&sch->q))
red_end_of_idle_period(&q->parms);
First, if queue is empty, we should call
red_start_of_idle_period(&q->parms);
Second, since we dont use anymore sch->q, but q->qdisc, the test is
meaningless.
Oh well...
[PATCH] sch_red: fix red_change()
Now RED is classful, we must check q->qdisc->q.qlen, and if queue is empty,
we start an idle period, not end it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ERROR: "__udivdi3" [net/sched/sch_netem.ko] undefined!
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently netem is not in the ability to emulate channel bandwidth. Only static
delay (and optional random jitter) can be configured.
To emulate the channel rate the token bucket filter (sch_tbf) can be used. But
TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
no packet is transmitted. So that there is always a "positive" credit for new
packets. In real life this behavior contradicts the law of nature where
nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
seconds.
Netem is an excellent place to implement a rate limiting feature: static
delay is already implemented, tfifo already has time information and the
user can skip TBF configuration completely.
This patch implement rate feature which can be configured via tc. e.g:
tc qdisc add dev eth0 root netem rate 10kbit
To emulate a link of 5000byte/s and add an additional static delay of 10ms:
tc qdisc add dev eth0 root netem delay 10ms rate 5KBps
Note: similar to TBF the rate extension is bounded to the kernel timing
system. Depending on the architecture timer granularity, higher rates (e.g.
10mbit/s and higher) tend to transmission bursts. Also note: further queues
living in network adaptors; see ethtool(8).
Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@drr.davemloft.net>
We need rcu_read_lock() protection before using dst_get_neighbour(), and
we must cache its value (pass it to __teql_resolve())
teql_master_xmit() is called under rcu_read_lock_bh() protection, its
not enough.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create separate queue state flags so that either the stack or drivers
can turn on XOFF. Added a set of functions used in the stack to determine
if a queue is really stopped (either by stack or driver)
Signed-off-by: Tom Herbert <therbert@google.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current SFB double hashing is not fulfilling SFB theory, if two flows
share same rxhash value.
Using skb_flow_dissect() permits to really have better hash dispersion,
and get tunnelling support as well.
Double hashing point was mentioned by Florian Westphal
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.
This lack of tunnelling support was mentioned by Dan Siemon.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the /sys/class/net/DEV/queues/Q/tx_timeout attribute
containing the total number of timeout events on the given queue. It
is always available with CONFIG_SYSFS, independently of
CONFIG_RPS/XPS.
Credits to Stephen Hemminger for a preliminary version of this patch.
Tested:
without CONFIG_SYSFS (compilation only)
with sysfs and without CONFIG_RPS & CONFIG_XPS
with sysfs and without CONFIG_RPS
with sysfs and without CONFIG_XPS
with defaults
Signed-off-by: David Decotigny <david.decotigny@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the assumption that skb_get_rxhash() makes IP header and ports
linear, and use skb_header_pointer() instead in choke_match_flow()
This permits __skb_get_rxhash() to use skb_header_pointer() eventually.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
With calls to modular infrastructure, these files really
needs the full module.h header. Call it out so some of the
cleanups of implicit and unrequired includes elsewhere can be
cleaned up.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Dan Siemon would like to add tunnelling support to cls_flow
This preliminary patch introduces use of skb_header_pointer() to help
this task, while avoiding skb head reallocation because of deep packet
inspection.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
File cls_rsvp.h in /net/sched was outdated. I'm sending you patch for this
file.
[ tb[] array should be indexed by X not X-1 -DaveM ]
Signed-off-by: Igor Maravić <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case SFB queue is full (hard limit reached), there is no point
spending time to compute hash and maximum qlen/p_mark.
We instead just early drop packet.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a redirected or mirrored packet is dropped by the target
device we need to record statistics.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 07bd8df5df
(sch_sfq: fix peek() implementation) changed sfq to use generic
peek helper.
This makes HFSC complain about a non-work-conserving child qdisc, if
prio with sfq child is used within hfsc:
hfsc peeks into prio qdisc, which will then peek into sfq.
returned skb is stashed in sch->gso_skb.
Next, hfsc tries to dequeue from prio, but prio will call sfq dequeue
directly, which may return NULL instead of previously peeked-at skb.
Have prio call qdisc_dequeue_peeked, so sfq->dequeue() is
not called in this case.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 8efa885406 (sch_sfq: avoid giving spurious NET_XMIT_CN signals)
forgot to call qdisc_tree_decrease_qlen() to signal upper levels that a
packet (from another flow) was dropped, leading to various problems.
With help from Michal Soltys and Michal Pokrywka, who did a bisection.
Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=39372
Debian ref: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631945
Reported-by: Lucas Bocchi <lucas.bocchi@gmail.com>
Reported-and-bisected-by: Michal Pokrywka <wolfmoon@o2.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Michal Soltys <soltys@ziu.info>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove it, as it indirectly exposes netdev features. It's not used in
iproute2 (2.6.38) - is anything else using its interface?
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Results on dummy device can be seen in my netconf 2011
slides. These results are for a 10Gige IXGBE intel
nic - on another i5 machine, very similar specs to
the one used in the netconf2011 results.
It turns out - this is a hell lot worse than dummy
and so this patch is even more beneficial for 10G.
Test setup:
----------
System under test sending packets out.
Additional box connected directly dropping packets.
Installed prio qdisc on the eth device and default
netdev default length of 1000 used as is.
The 3 prio bands each were set to 100 (didnt factor in
the results).
5 packet runs were made and the middle 3 picked.
results
-------
The "cpu" column indicates the which cpu the sample
was taken on,
The "Pkt runx" carries the number of packets a cpu
dequeued when forced to be in the "dequeuer" role.
The "avg" for each run is the number of times each
cpu should be a "dequeuer" if the system was fair.
3.0-rc4 (plain)
cpu Pkt run1 Pkt run2 Pkt run3
================================================
cpu0 21853354 21598183 22199900
cpu1 431058 473476 393159
cpu2 481975 477529 458466
cpu3 23261406 23412299 22894315
avg 11506948 11490372 11486460
3.0-rc4 with patch and default weight 64
cpu Pkt run1 Pkt run2 Pkt run3
================================================
cpu0 13205312 13109359 13132333
cpu1 10189914 10159127 10122270
cpu2 10213871 10124367 10168722
cpu3 13165760 13164767 13096705
avg 11693714 11639405 11630008
As you can see the system is still not perfect but
is a lot better than what it was before...
At the moment we use the old backlog weight, weight_p
which is 64 packets. It seems to be reasonably fine
with that value.
The system could be made more fair if we reduce the
weight_p (as per my presentation), but we are going
to affect the shared backlog weight. Unless deemed
necessary, I think the default value is fine. If not
we could add yet another knob.
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are enough instances of this:
iph->frag_off & htons(IP_MF | IP_OFFSET)
that a helper function is probably warranted.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually).
To prevent mm.h inclusion via other channels also extract "enum dma_data_direction"
definition into separate header. This tiny piece is what gluing netdevice.h with mm.h
via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h".
Removal of mm.h from scatterlist.h was tried and was found not feasible
on most archs, so the link was cutoff earlier.
Hope people are OK with tiny include file.
Note, that mm_types.h is still dragged in, but it is a separate story.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.
Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* remove interrupt.g inclusion from netdevice.h -- not needed
* fixup fallout, add interrupt.h and hardirq.h back where needed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This interface uses a temporary buffer, but for no real reason.
And now can generate warnings like:
net/sched/sch_generic.c: In function dev_watchdog
net/sched/sch_generic.c:254:10: warning: unused variable drivername
Just return driver->name directly or "".
Reported-by: Connor Hansen <cmdkhh@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit eeaeb068f1 (sch_sfq: allow big packets and be fair),
sfq_peek() can return a different skb that would be normally dequeued by
sfq_dequeue() [ if current slot->allot is negative ]
Use generic qdisc_peek_dequeued() instead of custom implementation, to
get consistent result.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jesper Dangaard Brouer <hawk@diku.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
While chasing a possible net_sched bug, I found that IP fragments have
litle chance to pass a congestioned SFQ qdisc :
- Say SFQ qdisc is full because one flow is non responsive.
- ip_fragment() wants to send two fragments belonging to an idle flow.
- sfq_enqueue() queues first packet, but see queue limit reached :
- sfq_enqueue() drops one packet from 'big consumer', and returns
NET_XMIT_CN.
- ip_fragment() cancel remaining fragments.
This patch restores fairness, making sure we return NET_XMIT_CN only if
we dropped a packet from the same flow.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev_deactivate_many() issues one synchronize_rcu() call after qdiscs set
to noop_qdisc.
This call is here to make sure they are no outstanding qdisc-less
dev_queue_xmit calls before returning to caller.
But in dismantle phase, we dont have to wait, because we wont activate
again the device, and we are going to wait one rcu grace period later in
rollback_registered_many().
After this patch, device dismantle uses one synchronize_net() and one
rcu_barrier() call only, so we have a ~30% speedup and a smaller RTNL
latency.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>,
CC: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...
Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects
IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning
is produced, so fix it by making NET_CLS_ROUTE4 depend on INET.
warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no callback of this module maybe queued
since we use kfree_rcu(), we can safely remove the rcu_barrier().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
[PATCH 05/17] net,rcu: convert call_rcu(tcf_police_free_rcu) to kfree_rcu()
The rcu callback tcf_police_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(tcf_police_free_rcu).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback tcf_common_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(tcf_common_free_rcu).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an implementation of the Quick Fair Queue scheduler developed
by Fabio Checconi. The same algorithm is already implemented in ipfw
in FreeBSD. Fabio had an earlier version developed on Linux, I just
cleaned it up. Thanks to Eric Dumazet for testing this under load.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only necessary parts are the src/dst addresses, the
interface indexes, the TOS, and the mark.
The rest is unnecessary bloat, which amounts to nearly
50 bytes on 64-bit.
Signed-off-by: David S. Miller <davem@davemloft.net>
Because of various alignements [SLUB / qdisc], we use 512 bytes of
memory for one {p|b}fifo qdisc, instead of 256 bytes on 64bit arches and
192 bytes on 32bit ones.
Move the "u32 limit" inside "struct Qdisc" (no impact on other qdiscs)
Change qdisc_alloc(), first trying a regular allocation before an
oversized one.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>