With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the recent TCP/EDT patch series, I switched TCP and sch_fq
clocks from MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.
But sure enough, this broke some setups were the TAI clock
jumps forward (by almost 50 year...), as reported
by Leonard Crestez.
If we want to converge later, we'll probably need to add
an skb field to differentiate the clock bases, or a socket option.
In the meantime, an UDP application will need to use CLOCK_MONOTONIC
base for its SCM_TXTIME timestamps if using fq packet scheduler.
Fixes: 72b0094f91 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e419 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa7 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Leonard Crestez <leonard.crestez@nxp.com>
Tested-by: Leonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.
Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.
This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.
Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.
Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE
Tested:
Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.
The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)
MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02
After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.
We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An SKB is not on a list if skb->next is NULL.
Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally, a socket can not be freed/reused unless all its TX packets
left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
this to happen.
With commit fc59d5bdf1 ("pkt_sched: fq: clear time_next_packet for
reused flows") we cleared f->time_next_packet but took no special
action if the flow was still in the throttled rb-tree.
Since f->time_next_packet is the key used in the rb-tree searches,
blindly clearing it might break rb-tree integrity. We need to make
sure the flow is no longer in the rb-tree to avoid this problem.
Fixes: fc59d5bdf1 ("pkt_sched: fq: clear time_next_packet for reused flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for change callback for qdisc ops
structtur to prepare per-qdisc specific changes for extack.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for init callback to prepare per-qdisc
specific changes for extack.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
ignored for smaller sizes. This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.
Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)
- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
attempt to free memory at all. The most light weight mode which even
doesn't kick the background reclaim. Should be used carefully because
it might deplete the memory and the next user might hit the more
aggressive reclaim
- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
allocation without any attempt to free memory from the current
context but can wake kswapd to reclaim memory if the zone is below
the low watermark. Can be used from either atomic contexts or when
the request is a performance optimization and there is another
fallback for a slow path.
- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
non sleeping allocation with an expensive fallback so it can access
some portion of memory reserves. Usually used from interrupt/bh
context with an expensive slow path fallback.
- GFP_KERNEL - both background and direct reclaim are allowed and the
_default_ page allocator behavior is used. That means that !costly
allocation requests are basically nofail but there is no guarantee of
that behavior so failures have to be checked properly by callers
(e.g. OOM killer victim is allowed to fail currently).
- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
and all allocation requests fail early rather than cause disruptive
reclaim (one round of reclaim in this implementation). The OOM killer
is not invoked.
- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
behavior and all allocation requests try really hard. The request
will fail if the reclaim cannot make any progress. The OOM killer
won't be triggered.
- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
and all allocation requests will loop endlessly until they succeed.
This might be really dangerous especially for larger orders.
Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.
This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.
[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.
However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
to use BBR for the few TCP flows they initiate/terminate.
This patch implements an automatic fallback to internal pacing.
Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.
One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.
Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.
Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.
If MQ+pfifo_fast is used on the NIC :
$ sar -n DEV 1 5 | grep eth
14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0
Now use MQ+FQ :
lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq
$ sar -n DEV 1 5 | grep eth
14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0
As expected, number of interrupts per second is very different.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
vmalloc fallback. Use the kvmalloc variant instead. Keep the
__GFP_REPEAT flag based on explanation from Eric:
"At the time, tests on the hardware I had in my labs showed that
vmalloc() could deliver pages spread all over the memory and that was
a small penalty (once memory is fragmented enough, not at boot time)"
The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code. This is rather disruptive
for something that can be achived with the fallback. On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests. So
the effect of this patch is that requests which fit into 32kB will fall
back to vmalloc easier now.
Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
and I chose hash_32().
Linus Torvalds and George Spelvin fixed this issue, so we can
use hash_ptr() to get more entropy on 64bit arches with Terabytes
of memory, and avoid the cast games.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.
We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)
Add an EWMA of the unthrottle latency to help diagnostics.
This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.
Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &
Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52
After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52
A new iproute2 tc can output the 'unthrottle latency' :
$ tc -s qd sh dev eth0 | grep latency
0 gc, 0 highprio, 32490767 throttled, 2382 ns latency
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit adds to the fq module a low_rate_threshold parameter to
insert a delay after all packets if the socket requests a pacing rate
below the threshold.
This helps achieve more precise control of the sending rate with
low-rate paths, especially policers. The basic issue is that if a
congestion control module detects a policer at a certain rate, it may
want fq to be able to shape to that policed rate. That way the sender
can avoid policer drops by having the packets arrive at the policer at
or just under the policed rate.
The default threshold of 550Kbps was chosen analytically so that for
policers or links at 500Kbps or 512Kbps fq would very likely invoke
this mechanism, even if the pacing rate was briefly slightly above the
available bandwidth. This value was then empirically validated with
two years of production testing on YouTube video servers.
Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When fq is used on 32bit kernels, we need to lock the qdisc before
copying 64bit fields.
Otherwise "tc -s qdisc ..." might report bogus values.
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.
Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.
Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.
This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.
I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Both fq_change() and fq_reset() can use rtnl_kfree_skbs()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.
I already removed it for sch_fq in commit f2600cf02b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.
When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.
Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.
This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bottom qdisc decides to, for example, drop some packet,
it calls qdisc_tree_decrease_qlen() to update the queue length
for all its ancestors, we need to update the backlog too to
keep the stats on root qdisc accurate.
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
selinux needs few changes to accommodate fact that SYNACK messages
can be attached to a request socket, lacking sk_security pointer
(Only syncookies are still attached to a TCP_LISTEN socket)
Adds a new sk_listener() helper, and use it in selinux and sch_fq
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported by: kernel test robot <ying.huang@linux.intel.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Eric Paris <eparis@parisplace.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.
(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)
By attaching the skb to the request socket, we remove this
source of contention.
Tested:
listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.
20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct spelling of locally.
Also remove extra space before tab character in struct fq_flow.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c
The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.
In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.
In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.
In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.
Signed-off-by: David S. Miller <davem@davemloft.net>
FQ has a fast path for skb attached to a socket, as it does not
have to compute a flow hash. But for other packets, FQ being non
stochastic means that hosts exposed to random Internet traffic
can allocate million of flows structure (104 bytes each) pretty
easily. Not only host can OOM, but lookup in RB trees can take
too much cpu and memory resources.
This patch adds a new attribute, orphan_mask, that is adding
possibility of having a stochastic hash for orphaned skb.
Its default value is 1024 slots, to mimic SFQ behavior.
Note: This does not apply to locally generated TCP traffic,
and no locally generated traffic will share a flow structure
with another perfect or stochastic flow.
This patch also handles the specific case of SYNACK messages:
They are attached to the listener socket, and therefore all map
to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
hoping to have new accepted socket inherit this rate, SYNACK
might be paced and even dropped.
This is very similar to an internal patch Google have used more
than one year.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we added pacing to TCP, we decided to let sch_fq take care
of actual pacing.
All TCP had to do was to compute sk->pacing_rate using simple formula:
sk->pacing_rate = 2 * cwnd * mss / rtt
It works well for senders (bulk flows), but not very well for receivers
or even RPC :
cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
can end up pacing ACK packets, slowing down the sender.
Really, only the sender should pace, according to its own logic.
Instead of adding a new bit in skb, or call yet another flow
dissection, we tweak skb->truesize to a small value (2), and
we instruct sch_fq to use new helper and not pace pure ack.
Note this also helps TCP small queue, as ack packets present
in qdisc/NIC do not prevent sending a data packet (RPC workload)
This helps to reduce tx completion overhead, ack packets can use regular
sock_wfree() instead of tcp_wfree() which is a bit more expensive.
This has no impact in the case packets are sent to loopback interface,
as we do not coalesce ack packets (were we would detect skb->truesize
lie)
In case netem (with a delay) is used, skb_orphan_partial() also sets
skb->truesize to 1.
This patch is a combination of two patches we used for about one year at
Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Configuring fq with quantum 0 hangs the system, presumably because of a
non-interruptible infinite loop. Either way quantum 0 does not make sense.
Reproduce with:
sudo tc qdisc add dev lo root fq quantum 0 initial_quantum 0
ping 127.0.0.1
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TIME_WAIT sockets are not owning any skb.
ip_send_unicast_reply() and tcp_v6_send_response() both use
regular sockets.
We can safely remove a test in sch_fq and save one cache line miss,
as sk_state is far away from sk_pacing_rate.
Tested at Google for about one year.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FQ/pacing has a clamp of delay of 125 ms, to avoid some possible harm.
It turns out this delay is too small to allow pacing low rates :
Some ISP setup very aggressive policers as low as 16kbit.
Now TCP stack has spurious rtx prevention, it seems safe to increase
this fixed parameter, without adding a qdisc attribute.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Standard qdisc API to setup a timer implies an atomic operation on every
packet dequeue : qdisc_unthrottled()
It turns out this is not really needed for FQ, as FQ has no concept of
global qdisc throttling, being a qdisc handling many different flows,
some of them can be throttled, while others are not.
Fix is straightforward : add a 'bool throttle' to
qdisc_watchdog_schedule_ns(), and remove calls to qdisc_unthrottled()
in sch_fq.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ktime_get_ns() replaces ktime_to_ns(ktime_get())
ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is available since v3.15-rc5.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_nest_end() already has return skb->len, so replace
return skb->len with return nla_nest_end instead().
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Resizing fq hash table allocates memory while holding qdisc spinlock,
with BH disabled.
This is definitely not good, as allocation might sleep.
We can drop the lock and get it when needed, we hold RTNL so no other
changes can happen at the same time.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing name of function as part of making the hash in skbuff to be
generic property, not just for receive path.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch brings NUMA support and automatic fallback to vmalloc()
in case kmalloc() failed to allocate FQ hash table.
NUMA support depends on XPS being setup for the device before
qdisc allocation. After a XPS change, it might be worth creating
qdisc hierarchy again.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For performance reasons, sch_fq tried hard to not setup timers for every
sent packet, using a quantum based heuristic : A delay is setup only if
the flow exhausted its credit.
Problem is that application limited flows can refill their credit
for every queued packet, and they can evade pacing.
This problem can also be triggered when TCP flows use small MSS values,
as TSO auto sizing builds packets that are smaller than the default fq
quantum (3028 bytes)
This patch adds a 40 ms delay to guard flow credit refill.
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7eec4174ff ("pkt_sched: fq: fix non TCP flows pacing")
obsoleted TCA_FQ_FLOW_DEFAULT_RATE without notice for the users.
Suggested by David Miller
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initial sch_fq implementation copied code from pfifo_fast to classify
a packet as a high prio packet.
This clashes with setups using PRIO with say 7 bands, as one of the
band could be incorrectly (mis)classified by FQ.
Packets would be queued in the 'internal' queue, and no pacing ever
happen for this special queue.
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a socket is freed/reallocated, we need to clear time_next_packet
or else we can inherit a prior value and delay first packets of the
new flow.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steinar reported FQ pacing was not working for UDP flows.
It looks like the initial sk->sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)
Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.
Reported-by: Steinar H. Gunderson <sesse@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_FQ_INITIAL_QUANTUM should set q->initial_quantum
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fq_reset() should drops all packets in queue, including
throttled flows.
This patch moves code from fq_destroy() to fq_reset()
to do the cleaning.
fq_change() must stop calling fq_dequeue() if all remaining
packets are from throttled flows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
kbuild bot reported following m68k build error :
net/sched/sch_fq.c: In function 'fq_dequeue':
>> net/sched/sch_fq.c:491:2: error: implicit declaration of function
'prefetch' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
While we are fixing this, move this prefetch() call a bit earlier.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
to add per socket limitation.
Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.
TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.
This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.
Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit
cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)
15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.
In slow start or steady state, very few packets are throttled [1]
FQ gets a bunch of tunables as :
limit : max number of packets on whole Qdisc (default 10000)
flow_limit : max number of packets per flow (default 100)
quantum : the credit per RR round (default is 2 MTU)
initial_quantum : initial credit for new flows (default is 10 MTU)
maxrate : max per flow rate (default : unlimited)
buckets : number of RB trees (default : 1024) in hash table.
(consumes 8 bytes per bucket)
[no]pacing : disable/enable pacing (default is enable)
All of them can be changed on a live qdisc.
$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ]
$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
backlog 0b 0p requeues 14
511 flows, 511 inactive, 0 throttled
110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit
[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>