e3118e8359
This work adds the DataCenter TCP (DCTCP) congestion control algorithm [1], which has been first published at SIGCOMM 2010 [2], resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more recently as an informational IETF draft available at [4]). DCTCP is an enhancement to the TCP congestion control algorithm for data center networks. Typical data center workloads are i.e. i) partition/aggregate (queries; bursty, delay sensitive), ii) short messages e.g. 50KB-1MB (for coordination and control state; delay sensitive), and iii) large flows e.g. 1MB-100MB (data update; throughput sensitive). DCTCP has therefore been designed for such environments to provide/achieve the following three requirements: * High burst tolerance (incast due to partition/aggregate) * Low latency (short flows, queries) * High throughput (continuous data updates, large file transfers) with commodity, shallow buffered switches The basic idea of its design consists of two fundamentals: i) on the switch side, packets are being marked when its internal queue length > threshold K (K is chosen so that a large enough headroom for marked traffic is still available in the switch queue); ii) the sender/host side maintains a moving average of the fraction of marked packets, so each RTT, F is being updated as follows: F := X / Y, where X is # of marked ACKs, Y is total # of ACKs alpha := (1 - g) * alpha + g * F, where g is a smoothing constant The resulting alpha (iow: probability that switch queue is congested) is then being used in order to adaptively decrease the congestion window W: W := (1 - (alpha / 2)) * W The means for receiving marked packets resp. marking them on switch side in DCTCP is the use of ECN. RFC3168 describes a mechanism for using Explicit Congestion Notification from the switch for early detection of congestion, rather than waiting for segment loss to occur. However, this method only detects the presence of congestion, not the *extent*. In the presence of mild congestion, it reduces the TCP congestion window too aggressively and unnecessarily affects the throughput of long flows [4]. DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN) processing to estimate the fraction of bytes that encounter congestion, rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate [4], thus it can derive multibit feedback from the information present in the single-bit sequence of marks in its control law. And thus act in *proportion* to the extent of congestion, not its *presence*. Switches therefore set the Congestion Experienced (CE) codepoint in packets when internal queue lengths exceed threshold K. Resulting, DCTCP delivers the same or better throughput than normal TCP, while using 90% less buffer space. It was found in [2] that DCTCP enables the applications to handle 10x the current background traffic, without impacting foreground traffic. Moreover, a 10x increase in foreground traffic did not cause any timeouts, and thus largely eliminates TCP incast collapse problems. The algorithm itself has already seen deployments in large production data centers since then. We did a long-term stress-test and analysis in a data center, short summary of our TCP incast tests with iperf compared to cubic: This test measured DCTCP throughput and latency and compared it with CUBIC throughput and latency for an incast scenario. In this test, 19 senders sent at maximum rate to a single receiver. The receiver simply ran iperf -s. The senders ran iperf -c <receiver> -t 30. All senders started simultaneously (using local clocks synchronized by ntp). This test was repeated multiple times. Below shows the results from a single test. Other tests are similar. (DCTCP results were extremely consistent, CUBIC results show some variance induced by the TCP timeouts that CUBIC encountered.) For this test, we report statistics on the number of TCP timeouts, flow throughput, and traffic latency. 1) Timeouts (total over all flows, and per flow summaries): CUBIC DCTCP Total 3227 25 Mean 169.842 1.316 Median 183 1 Max 207 5 Min 123 0 Stddev 28.991 1.600 Timeout data is taken by measuring the net change in netstat -s "other TCP timeouts" reported. As a result, the timeout measurements above are not restricted to the test traffic, and we believe that it is likely that all of the "DCTCP timeouts" are actually timeouts for non-test traffic. We report them nevertheless. CUBIC will also include some non-test timeouts, but they are drawfed by bona fide test traffic timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing TCP timeouts. DCTCP reduces timeouts by at least two orders of magnitude and may well have eliminated them in this scenario. 2) Throughput (per flow in Mbps): CUBIC DCTCP Mean 521.684 521.895 Median 464 523 Max 776 527 Min 403 519 Stddev 105.891 2.601 Fairness 0.962 0.999 Throughput data was simply the average throughput for each flow reported by iperf. By avoiding TCP timeouts, DCTCP is able to achieve much better per-flow results. In CUBIC, many flows experience TCP timeouts which makes flow throughput unpredictable and unfair. DCTCP, on the other hand, provides very clean predictable throughput without incurring TCP timeouts. Thus, the standard deviation of CUBIC throughput is dramatically higher than the standard deviation of DCTCP throughput. Mean throughput is nearly identical because even though cubic flows suffer TCP timeouts, other flows will step in and fill the unused bandwidth. Note that this test is something of a best case scenario for incast under CUBIC: it allows other flows to fill in for flows experiencing a timeout. Under situations where the receiver is issuing requests and then waiting for all flows to complete, flows cannot fill in for timed out flows and throughput will drop dramatically. 3) Latency (in ms): CUBIC DCTCP Mean 4.0088 0.04219 Median 4.055 0.0395 Max 4.2 0.085 Min 3.32 0.028 Stddev 0.1666 0.01064 Latency for each protocol was computed by running "ping -i 0.2 <receiver>" from a single sender to the receiver during the incast test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure that traffic traversed the DCTCP queue and was not dropped when the queue size was greater than the marking threshold. The summary statistics above are over all ping metrics measured between the single sender, receiver pair. The latency results for this test show a dramatic difference between CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer which incurs the maximum queue latency (more buffer memory will lead to high latency.) DCTCP, on the other hand, deliberately attempts to keep queue occupancy low. The result is a two orders of magnitude reduction of latency with DCTCP - even with a switch with relatively little RAM. Switches with larger amounts of RAM will incur increasing amounts of latency for CUBIC, but not for DCTCP. 4) Convergence and stability test: This test measured the time that DCTCP took to fairly redistribute bandwidth when a new flow commences. It also measured DCTCP's ability to remain stable at a fair bandwidth distribution. DCTCP is compared with CUBIC for this test. At the commencement of this test, a single flow is sending at maximum rate (near 10 Gbps) to a single receiver. One second after that first flow commences, a new flow from a distinct server begins sending to the same receiver as the first flow. After the second flow has sent data for 10 seconds, the second flow is terminated. The first flow sends for an additional second. Ideally, the bandwidth would be evenly shared as soon as the second flow starts, and recover as soon as it stops. The results of this test are shown below. Note that the flow bandwidth for the two flows was measured near the same time, but not simultaneously. DCTCP performs nearly perfectly within the measurement limitations of this test: bandwidth is quickly distributed fairly between the two flows, remains stable throughout the duration of the test, and recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth fairly, and has trouble remaining stable. CUBIC DCTCP Seconds Flow 1 Flow 2 Seconds Flow 1 Flow 2 0 9.93 0 0 9.92 0 0.5 9.87 0 0.5 9.86 0 1 8.73 2.25 1 6.46 4.88 1.5 7.29 2.8 1.5 4.9 4.99 2 6.96 3.1 2 4.92 4.94 2.5 6.67 3.34 2.5 4.93 5 3 6.39 3.57 3 4.92 4.99 3.5 6.24 3.75 3.5 4.94 4.74 4 6 3.94 4 5.34 4.71 4.5 5.88 4.09 4.5 4.99 4.97 5 5.27 4.98 5 4.83 5.01 5.5 4.93 5.04 5.5 4.89 4.99 6 4.9 4.99 6 4.92 5.04 6.5 4.93 5.1 6.5 4.91 4.97 7 4.28 5.8 7 4.97 4.97 7.5 4.62 4.91 7.5 4.99 4.82 8 5.05 4.45 8 5.16 4.76 8.5 5.93 4.09 8.5 4.94 4.98 9 5.73 4.2 9 4.92 5.02 9.5 5.62 4.32 9.5 4.87 5.03 10 6.12 3.2 10 4.91 5.01 10.5 6.91 3.11 10.5 4.87 5.04 11 8.48 0 11 8.49 4.94 11.5 9.87 0 11.5 9.9 0 SYN/ACK ECT test: This test demonstrates the importance of ECT on SYN and SYN-ACK packets by measuring the connection probability in the presence of competing flows for a DCTCP connection attempt *without* ECT in the SYN packet. The test was repeated five times for each number of competing flows. Competing Flows 1 | 2 | 4 | 8 | 16 ------------------------------ Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0 Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0 As the number of competing flows moves beyond 1, the connection probability drops rapidly. Enabling DCTCP with this patch requires the following steps: DCTCP must be running both on the sender and receiver side in your data center, i.e.: sysctl -w net.ipv4.tcp_congestion_control=dctcp Also, ECN functionality must be enabled on all switches in your data center for DCTCP to work. The default ECN marking threshold (K) heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]). In above tests, for each switch port, traffic was segregated into two queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of 0x04 - the packet was placed into the DCTCP queue. All other packets were placed into the default drop-tail queue. For the DCTCP queue, RED/ECN marking was enabled, here, with a marking threshold of 75 KB. More details however, we refer you to the paper [2] under section 3). There are no code changes required to applications running in user space. DCTCP has been implemented in full *isolation* of the rest of the TCP code as its own congestion control module, so that it can run without a need to expose code to the core of the TCP stack, and thus nothing changes for non-DCTCP users. Changes in the CA framework code are minimal, and DCTCP algorithm operates on mechanisms that are already available in most Silicon. The gain (dctcp_shift_g) is currently a fixed constant (1/16) from the paper, but we leave the option that it can be chosen carefully to a different value by the user. In case DCTCP is being used and ECN support on peer site is off, DCTCP falls back after 3WHS to operate in normal TCP Reno mode. ss {-4,-6} -t -i diag interface: ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054 ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584 send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15 reordering:101 rcv_space:29200 ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448 cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate 325.5Mbps rcv_rtt:1.5 rcv_space:29200 More information about DCTCP can be found in [1-4]. [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00 Joint work with Florian Westphal and Glenn Judd. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
---|---|---|
.. | ||
caif | ||
mac80211_hwsim | ||
timestamping | ||
.gitignore | ||
3c509.txt | ||
6pack.txt | ||
00-INDEX | ||
alias.txt | ||
altera_tse.txt | ||
arcnet-hardware.txt | ||
arcnet.txt | ||
atm.txt | ||
ax25.txt | ||
batman-adv.txt | ||
baycom.txt | ||
bonding.txt | ||
bridge.txt | ||
can.txt | ||
cdc_mbim.txt | ||
cops.txt | ||
cs89x0.txt | ||
cxacru-cf.py | ||
cxacru.txt | ||
cxgb.txt | ||
dccp.txt | ||
dctcp.txt | ||
de4x5.txt | ||
decnet.txt | ||
dl2k.txt | ||
dm9000.txt | ||
dmfe.txt | ||
dns_resolver.txt | ||
driver.txt | ||
e100.txt | ||
e1000.txt | ||
e1000e.txt | ||
eql.txt | ||
fib_trie.txt | ||
filter.txt | ||
fore200e.txt | ||
framerelay.txt | ||
gen_stats.txt | ||
generic_netlink.txt | ||
generic-hdlc.txt | ||
gianfar.txt | ||
i40e.txt | ||
i40evf.txt | ||
ieee802154.txt | ||
igb.txt | ||
igbvf.txt | ||
ip_dynaddr.txt | ||
ip-sysctl.txt | ||
ipddp.txt | ||
iphase.txt | ||
ipsec.txt | ||
ipv6.txt | ||
ipvs-sysctl.txt | ||
irda.txt | ||
ixgb.txt | ||
ixgbe.txt | ||
ixgbevf.txt | ||
l2tp.txt | ||
lapb-module.txt | ||
LICENSE.qla3xxx | ||
LICENSE.qlcnic | ||
LICENSE.qlge | ||
ltpc.txt | ||
mac80211-auth-assoc-deauth.txt | ||
mac80211-injection.txt | ||
Makefile | ||
multiqueue.txt | ||
netconsole.txt | ||
netdev-FAQ.txt | ||
netdev-features.txt | ||
netdevices.txt | ||
netif-msg.txt | ||
netlink_mmap.txt | ||
nf_conntrack-sysctl.txt | ||
nfc.txt | ||
openvswitch.txt | ||
operstates.txt | ||
packet_mmap.txt | ||
phonet.txt | ||
phy.txt | ||
pktgen.txt | ||
PLIP.txt | ||
policy-routing.txt | ||
ppp_generic.txt | ||
proc_net_tcp.txt | ||
radiotap-headers.txt | ||
ray_cs.txt | ||
rds.txt | ||
README.ipw2100 | ||
README.ipw2200 | ||
README.sb1000 | ||
regulatory.txt | ||
rxrpc.txt | ||
s2io.txt | ||
scaling.txt | ||
sctp.txt | ||
secid.txt | ||
skfp.txt | ||
smc9.txt | ||
spider_net.txt | ||
stmmac.txt | ||
tc-actions-env-rules.txt | ||
tcp-thin.txt | ||
tcp.txt | ||
team.txt | ||
timestamping.txt | ||
tlan.txt | ||
tproxy.txt | ||
tuntap.txt | ||
udplite.txt | ||
vortex.txt | ||
vxge.txt | ||
vxlan.txt | ||
x25-iface.txt | ||
x25.txt | ||
xfrm_proc.txt | ||
xfrm_sync.txt | ||
xfrm_sysctl.txt | ||
z8530drv.txt |
sb1000 is a module network device driver for the General Instrument (also known as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card which is used by a number of cable TV companies to provide cable modem access. It's a one-way downstream-only cable modem, meaning that your upstream net link is provided by your regular phone modem. This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves a great deal of thanks for this wonderful piece of code! ----------------------------------------------------------------------------- Support for this device is now a part of the standard Linux kernel. The driver source code file is drivers/net/sb1000.c. In addition to this you will need: 1.) The "cmconfig" program. This is a utility which supplements "ifconfig" to configure the cable modem and network interface (usually called "cm0"); and 2.) Several PPP scripts which live in /etc/ppp to make connecting via your cable modem easy. These utilities can be obtained from: http://www.jacksonville.net/~fventuri/ in Franco's original source code distribution .tar.gz file. Support for the sb1000 driver can be found at: http://web.archive.org/web/*/http://home.adelphia.net/~siglercm/sb1000.html http://web.archive.org/web/*/http://linuxpower.cx/~cable/ along with these utilities. 3.) The standard isapnp tools. These are necessary to configure your SB1000 card at boot time (or afterwards by hand) since it's a PnP card. If you don't have these installed as a standard part of your Linux distribution, you can find them at: http://www.roestock.demon.co.uk/isapnptools/ or check your Linux distribution binary CD or their web site. For help with isapnp, pnpdump, or /etc/isapnp.conf, go to: http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html ----------------------------------------------------------------------------- To make the SB1000 card work, follow these steps: 1.) Run `make config', or `make menuconfig', or `make xconfig', whichever you prefer, in the top kernel tree directory to set up your kernel configuration. Make sure to say "Y" to "Prompt for development drivers" and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard networking questions to get TCP/IP and PPP networking support. 2.) *BEFORE* you build the kernel, edit drivers/net/sb1000.c. Make sure to redefine the value of READ_DATA_PORT to match the I/O address used by isapnp to access your PnP cards. This is the value of READPORT in /etc/isapnp.conf or given by the output of pnpdump. 3.) Build and install the kernel and modules as usual. 4.) Boot your new kernel following the usual procedures. 5.) Set up to configure the new SB1000 PnP card by capturing the output of "pnpdump" to a file and editing this file to set the correct I/O ports, IRQ, and DMA settings for all your PnP cards. Make sure none of the settings conflict with one another. Then test this configuration by running the "isapnp" command with your new config file as the input. Check for errors and fix as necessary. (As an aside, I use I/O ports 0x110 and 0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.) Then save the finished config file as /etc/isapnp.conf for proper configuration on subsequent reboots. 6.) Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of the others referenced above. As root, unpack it into a temporary directory and do a `make cmconfig' and then `install -c cmconfig /usr/local/sbin'. Don't do `make install' because it expects to find all the utilities built and ready for installation, not just cmconfig. 7.) As root, copy all the files under the ppp/ subdirectory in Franco's tar file into /etc/ppp, being careful not to overwrite any files that are already in there. Then modify ppp@gi-on to set the correct login name, phone number, and frequency for the cable modem. Also edit pap-secrets to specify your login name and password and any site-specific information you need. 8.) Be sure to modify /etc/ppp/firewall to use ipchains instead of the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to convert ipfwadm commands to ipchains commands: http://users.dhp.com/~whisper/ipfwadm2ipchains/ You may also wish to modify the firewall script to implement a different firewalling scheme. 9.) Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be root to do this. It's better to use a utility like sudo to execute frequently used commands like this with root permissions if possible. If you connect successfully the cable modem interface will come up and you'll see a driver message like this at the console: cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11. sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net) The "ifconfig" command should show two new interfaces, ppp0 and cm0. The command "cmconfig cm0" will give you information about the cable modem interface. 10.) Try pinging a site via `ping -c 5 www.yahoo.com', for example. You should see packets received. 11.) If you can't get site names (like www.yahoo.com) to resolve into IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file has no syntax errors and has the right nameserver IP addresses in it. If this doesn't help, try something like `ping -c 5 204.71.200.67' to see if the networking is running but the DNS resolution is where the problem lies. 12.) If you still have problems, go to the support web sites mentioned above and read the information and documentation there. ----------------------------------------------------------------------------- Common problems: 1.) Packets go out on the ppp0 interface but don't come back on the cm0 interface. It looks like I'm connected but I can't even ping any numerical IP addresses. (This happens predominantly on Debian systems due to a default boot-time configuration script.) Solution -- As root `echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter' so it can share the same IP address as the ppp0 interface. Note that this command should probably be added to the /etc/ppp/cablemodem script *right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands. You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well. If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot (in rc.local or some such) then any interfaces can share the same IP addresses. 2.) I get "unresolved symbol" error messages on executing `insmod sb1000.o'. Solution -- You probably have a non-matching kernel source tree and /usr/include/linux and /usr/include/asm header files. Make sure you install the correct versions of the header files in these two directories. Then rebuild and reinstall the kernel. 3.) When isapnp runs it reports an error, and my SB1000 card isn't working. Solution -- There's a problem with later versions of isapnp using the "(CHECK)" option in the lines that allocate the two I/O addresses for the SB1000 card. This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses. Make sure they don't conflict with any other pieces of hardware first! Then rerun isapnp and go from there. 4.) I can't execute the /etc/ppp/ppp@gi-on file. Solution -- As root do `chmod ug+x /etc/ppp/ppp@gi-on'. 5.) The firewall script isn't working (with 2.2.x and higher kernels). Solution -- Use the ipfwadm2ipchains script referenced above to convert the /etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains. 6.) I'm getting *tons* of firewall deny messages in the /var/kern.log, /var/messages, and/or /var/syslog files, and they're filling up my /var partition!!! Solution -- First, tell your ISP that you're receiving DoS (Denial of Service) and/or portscanning (UDP connection attempts) attacks! Look over the deny messages to figure out what the attack is and where it's coming from. Next, edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on to the "cmconfig" command (uncomment that line). If you're not receiving these denied packets on your broadcast interface (IP address xxx.yyy.zzz.255 typically), then someone is attacking your machine in particular. Be careful out there.... 7.) Everything seems to work fine but my computer locks up after a while (and typically during a lengthy download through the cable modem)! Solution -- You may need to add a short delay in the driver to 'slow down' the SURFboard because your PC might not be able to keep up with the transfer rate of the SB1000. To do this, it's probably best to download Franco's sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll want to edit the 'Makefile' and look for the 'SB1000_DELAY' define. Uncomment those 'CFLAGS' lines (and comment out the default ones) and try setting the delay to something like 60 microseconds with: '-DSB1000_DELAY=60'. Then do `make' and as root `make install' and try it out. If it still doesn't work or you like playing with the driver, you may try other numbers. Remember though that the higher the delay, the slower the driver (which slows down the rest of the PC too when it is actively used). Thanks to Ed Daiga for this tip! ----------------------------------------------------------------------------- Credits: This README came from Franco Venturi's original README file which is still supplied with his driver .tar.gz archive. I and all other sb1000 users owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten and Ralph Bonnell who are now managing the Linux SB1000 web site, and to the SB1000 users who reported and helped debug the common problems listed above. Clemmitt Sigler csigler@vt.edu