This patch is a requirement for enabling ECN support later on. With that change
in mind, the following preparations are done:
* renamed handle_loss() into congestion_event() since it returns true when a
congestion event happens (it will eventually also take care of ECN packets);
* lets tfrc_rx_congestion_event() always update the RX history records, since
this routine needs to be called for each non-duplicate packet anyway;
* made all involved boolean-type functions to have return type `bool';
Updating the RX history records is now only necessary for the packets received
up to sending the first feedback. The receiver code becomes again simpler.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates the computation of X_recv with regard to Errata 610/611 for
RFC 4342 and draft rfc3448bis-06, ensuring that at least an interval of 1
RTT is used to compute X_recv. The change is wrapped into a new function
ccid3_hc_rx_x_recv().
Further changes:
----------------
* feedback is not sent when no data packets arrived (bytes_recv == 0), as per
rfc3448bis-06, 6.2;
* take the timestamp for the feedback /after/ dccp_send_ack() returns, to avoid
taking the transmission time into account (in case layer-2 is busy);
* clearer handling of failure in ccid3_first_li().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This improves the receiver RTT sampling algorithm so that it tries harder to get
as many RTT samples as possible.
The algorithm is based the concepts presented in RFC 4340, 8.1, using timestamps
and the CCVal window counter. There exist 4 cases for the CCVal difference:
* == 0: less than RTT/4 passed since last packet -- unusable;
* > 4: (much) more than 1 RTT has passed since last packet -- also unusable;
* == 4: perfect sample (exactly one RTT has passed since last packet);
* 1..3: sub-optimal sample (between RTT/4 and 3*RTT/4 has passed).
In the last case the algorithm tried to optimise by storing away the candidate
and then re-trying next time. The problem is that
* a large number of samples is needed to smooth out the inaccuracies of the
algorithm;
* the sender may not be sending enough packets to warrant a "next time";
* hence it is better to use suboptimal samples whenever possible.
The algorithm now stores away the current sample only if the difference is 0.
Applicability and background
----------------------------
A realistic example is MP3 streaming where packets are sent at a rate of less
than one packet per RTT, which means that suitable samples are absent for a
very long time.
The effectiveness of using suboptimal samples (with a delta between 1 and 4) was
confirmed by instrumenting the algorithm with counters. The results of two 20
second test runs were:
* With the old algorithm and a total of 38442 function calls, only 394 of these
calls resulted in usable RTT samples (about 1%), and 378 out of these were
"perfect" samples and 28013 (unused) samples had a delta of 1..3.
* With the new algorithm and a total of 37057 function calls, 1702 usable RTT
samples were retrieved (about 4.6%), 5 out of these were "perfect" samples.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates the CCID-3 receiver in part with regard to errata 610 and 611
(http://www.rfc-editor.org/errata_list.php), which change RFC 4342 to use the
Receive Rate as specified in rfc3448bis, requiring to constantly sample the
RTT (or use a sender RTT).
Doing this requires reusing the RX history structure after dealing with a loss.
The patch does not resolve how to compute X_recv if the interval is less
than 1 RTT. A FIXME has been added (and is resolved in subsequent patch).
Furthermore, since this is all TFRC-based functionality, the RTT estimation
is now also performed by the dccp_tfrc_lib module. This further simplifies
the CCID-3 code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The only state information that the CCID-3 receiver keeps is whether initial
feedback has been sent or not. Further, this overlaps with use of feedback:
* state == TFRC_RSTATE_NO_DATA as long as no feedback has been sent;
* state == TFRC_RSTATE_DATA as soon as the first feedback has been sent.
This patch reduces the duplication, by memorising the type of the last feedback.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This migrates more TFRC-related code into the dccp_tfrc_lib:
* sampling of the packet size `s' (which is only needed until the first
loss interval is computed (ccid3_first_li));
* updating the byte-counter `bytes_recvd' in between sending feedbacks.
The result is a better separation of CCID-3 specific and TFRC specific
code, which aids future integration with ECN and e.g. CCID-4.
Further changes:
----------------
* replaced magic number of 536 with equivalent constant TCP_MIN_RCVMSS;
(this constant is also used when no estimate for `s' is available).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This changes the return type of tfrc_lh_update_i_mean() to void, since that
function returns always `false'. This is due to
len = dccp_delta_seqno(cur->li_seqno, DCCP_SKB_CB(skb)->dccpd_seq) + 1;
if (len - (s64)cur->li_length <= 0) /* duplicate or reordered */
return 0;
which means that update_i_mean can only increase the length of the open loss
interval I_0, and hence the value of I_tot0 (RFC 3448, 5.4). Consequently the
test `i_mean < old_i_mean' at the end of the function always evaluates to false.
There is no known way by which a loss interval can suddenly become shorter,
therefore the return type of the function is changed to void. (That is, under
the given circumstances step (3) in RFC 3448, 6.1 will not occur.)
Further changes:
----------------
* the function is now called from tfrc_rx_handle_loss, which is equivalent
to the previous way of calling from rx_packet_recv (it was called whenever
there was no new or pending loss, now it is also updated when there is
a pending loss - this increases the accuracy a bit);
* added a FIXME to possibly consider NDP counting as per RFC 4342 (this is
not implemented yet).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This enables the TFRC code to begin loss detection (as soon as the module
is loaded), using the latest updates from rfc3448bis-06, 6.3.1:
* when the first data packet(s) are lost or marked, set
* X_target = s/(2*R) => f(p) = s/(R * X_target) = 2,
* corresponding to a loss rate of ~ 20.64%.
The handle_loss() function is now called right at the begin of rx_packet_recv()
and thus no longer protected against duplicates: hence a call to rx_duplicate()
has been added. Such a call makes sense now, as the previous patch initialises
the first entry with a sequence number of GSR.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch
1) separates history allocation and initialisation, to facilitate early
loss detection (implemented by a subsequent patch);
2) removes duplication by using the existing tfrc_rx_hist_purge() if the
allocation fails. This is now possible, since the initialisation routine
3) zeroes out the entire history before using it.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
In the congestion-avoidance phase a decay of p towards 0 is natural once fewer
losses are encountered. Hence the warning message "p is below resolution" is
not necessary, and thus turned into a debug message by this patch.
The TFRC_SMALLEST_P is needed since in theory p never actually reaches 0. When
no further losses are encountered, the loss interval I_0 grows in length,
causing p to decrease towards 0, causing X_calc = s/(RTT * f(p)) to increase.
With the given minimum-resolution this congestion avoidance phase stops at some
fixed value, an approximation formula has been added to the documentation.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Since CCIDs are only used during the established phase of a connection,
they have very little internal state; this specifically reduces to:
* "no packet sent" if and only if s == 0, for the TX packet size s;
* when the first packet has been sent (i.e. `s' > 0), the question is whether
or not feedback has been received:
- if a feedback packet is received, "feedback = yes" is set,
- if the nofeedback timer expires, "feedback = no" is set.
Thus the CCID only needs to remember state about whether or not feedback
has been received. This is now implemented using a boolean flag, which is
toggled when a feedback packet arrives or the nofeedback timer expires.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The DCCP base time resolution is 10 microseconds (RFC 4340, 13.1 ... 13.3).
Using a timer with a lower resolution was found to trigger the following
bug warnings/problems on high-speed networks (e.g. local loopback):
* RTT samples are rounded down to 0 if below resolution;
* in some cases, negative RTT samples were observed;
* the CCID-3 feedback timer complains that the feedback interval is 0,
since the feedback interval is in the order of 1 RTT or less and RTT
measurement rounded this down to 0;
On an Intel computer this will for instance happen when using a
boot-time parameter of "clocksource=jiffies".
The following system log messages were observed:
11:24:00 kernel: BUG: delta (0) <= 0 at ccid3_hc_rx_send_feedback()
11:26:12 kernel: BUG: delta (0) <= 0 at ccid3_hc_rx_send_feedback()
11:26:30 kernel: dccp_sample_rtt: unusable RTT sample 0, using min
11:26:30 last message repeated 5 times
This patch defines a global constant for the time resolution, adds this in
timer.c, and checks the available clock resolution at CCID-3 module load time.
When the resolution is worse than 10 microseconds, module loading exits with
a message "socket type not supported".
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch consolidates the code common to TCP and CCID-2:
* TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and
* CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the wrappers around the sk timer functions as it makes the code
clearer and not much is gained from using wrappers: the BUG_ON in
start_rto_timer will never trigger since that function was called only when
* the RTO timer expired (rto_expire, and then timer_pending() is false);
* in tx_packet_sent only if !timer_pending() (BUG_ON is redundant here);
* previously in new_ack, after stopping the timer (timer_pending() false).
One further motive behind this patch is to replace the RTO timer with the
icsk retransmission timer, as it is already part of the DCCP socket.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The current CCID-2 RTT estimator code is in parts broken and lags behind the
suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR.
That code is replaced by the present patch, which reuses the Linux TCP RTT
estimator code - reasons for this code duplication are given below.
Further details:
----------------
1. The minimum RTO of previously one second has been replaced with TCP's, since
RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4)
is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's
concept of a default RTT (RFC 4340, 3.4).
2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with
RFC2988, (2.5).
3. De-inlined the function ccid2_new_ack().
4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will
give the wrong estimate. It should be replaced with one sample per Ack.
However, at the moment this can not be resolved easily, since
- it depends on TX history code (which also needs some work),
- the cleanest solution is not to use the `sent' time at all (saves 4 bytes
per entry) and use DCCP timestamps / elapsed time to estimated the RTT,
which however is non-trivial to get right (but needs to be done).
Reasons for reusing the Linux TCP estimator algorithm:
------------------------------------------------------
Some time was spent to find a better alternative, using basic RFC2988 as a first
step. Further analysis and experimentation showed that the Linux TCP RTO
estimator is superior to a basic RFC2988 implementation. A summary is on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/
In addition, this estimator fared well in a recent empirical evaluation:
Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith.
A Performance Study of Loss Detection/Recovery in Real-world TCP
Implementations. Proceedings of 15th IEEE International
Conference on Network Protocols (ICNP-07). 2007.
Thus there is significant benefit in reusing the existing TCP code.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the dec_pipe function and improves the way the RTO timer is rearmed
when a new acknowledgment comes in.
Details and justification for removal:
--------------------------------------
1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX
history entries between tail and head, for which it had previously been
incremented in tx_packet_sent; and it is not decremented twice for the same
entry, since it is
- either decremented when a corresponding Ack Vector cell in state 0 or 1
was received (and then ccid2s_acked==1),
- or it is decremented when ccid2s_acked==0, as part of the loss detection
in tx_packet_recv (and hence it can not have been decremented earlier).
2) Restarting the RTO timer happens for every single entry in each Ack Vector
parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to
16192 times per Ack Vector).
3) The RTO timer should not be restarted when all outstanding data has been
acknowledged. This is currently done similar to (2), in dec_pipe, when
pipe has reached 0.
The patch onsolidates the code which rearms the RTO timer, combining the
segments from new_ack and dec_pipe. As a result, the code becomes clearer
(compare with tcp_rearm_rto()).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the ccid2_hc_tx_check_sanity function: it is redundant.
Details:
========
The tx_check_sanity function performs three tests:
1) it checks that the circular TX list is sorted
- in ascending order of sequence number (ccid2s_seq)
- and time (ccid2s_sent),
- in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh);
2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN;
3) it ensures that pipe equals the number of packets that were not
marked `acked' (ccid2s_acked) between `tail' and `head'.
The following argues that each of these tests is redundant, this can be verified
by going through the code.
(1) is not necessary, since both time and GSS increase from one packet to the
next, so that subsequent insertions in tx_packet_sent (which advance the `head'
pointer) will be in ascending order of time and sequence number.
In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN
(set to 1024) unless allocation caused an earlier failure, because:
* at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1;
* subsequent calls to tx_alloc_seq take place whenever head->next == tail in
tx_packet_sent; then a new chunk of size 1024 is inserted between head and
tail, and seqbufc is incremented by one.
To show that (3) is redundant requires looking at two cases.
The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and
decremented in tx_packet_recv. When head == tail (TX history empty) then pipe
should be 0, which is the case directly after initialisation and after a
retransmission timeout has occurred (ccid2_hc_tx_rto_expire).
The first case involves parsing Ack Vectors for packets recorded in the live
portion of the buffer, between tail and head. For each packet marked by the
receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by
one, so for all such packets the BUG_ON in tx_check_sanity will not trigger.
The second case is the loss detection in the second half of tx_packet_recv,
below the comment "Check for NUMDUPACK".
The first while-loop here ensures that the sequence number of `seqp' is either
above or equal to `high_ack', or otherwise equal to the highest sequence number
sent so far (of the entry head->prev, as head points to the next unsent entry).
The next while-loop ("while (1)") counts the number of acked packets starting
from that position of seqp, going backwards in the direction from head->prev to
tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points
to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block
is entered next.
The while-loop contained within that block in turn traverses the list backwards,
from head to tail; the position of `seqp' is saved in the variable `last_acked'.
For each packet not marked as `acked', a congestion event is triggered within
the loop, and pipe is decremented. The loop terminates when `seqp' has reached
`tail', whereupon tail is set to the position previously stored in `last_acked'.
Thus, between `last_acked' and the previous position of `tail',
- pipe has been decremented earlier if the packet was marked as state 0 or 1;
- pipe was decremented if the packet was not marked as acked.
That is, pipe has been decremented by the number of packets between `last_acked'
and the previous position of `tail'. As a consequence, pipe now again reflects
the number of packets which have not (yet) been acked between the new position
of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that
the BUG_ON condition in check_sanity will also not be triggered, hence the test
(3) is also redundant.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This updates CCID2 to use the CCID dequeuing mechanism, converting from
previous constant-polling to a now event-driven mechanism.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch reorganises the return value convention of the CCID TX sending
function, to permit more flexible schemes, as required by subsequent patches.
Currently the convention is
* values < 0 mean error,
* a value == 0 means "send now", and
* a value x > 0 means "send in x milliseconds".
The patch provides symbolic constants and a function to interpret return values.
In addition, it caps the maximum positive return value to 0xFFFF milliseconds,
corresponding to 65.535 seconds.
This is possible since in CCID-3 the maximum inter-packet gap is t_mbi = 64 sec.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch replaces an almost identical replication of code: large parts
of dccp_parse_options() re-appeared as ccid2_ackvector() in ccid2.c.
Apart from the duplication, this caused two more problems:
1. CCIDs should not need to be concerned with parsing header options;
2. one can not assume that Ack Vectors appear as a contiguous area within an
skb, it is legal to insert other options and/or padding in between. The
current code would throw an error and stop reading in such a case.
The patch provides a new data structure and associated list housekeeping.
Only small changes were necessary to integrate with CCID-2: data structure
initialisation, adapt list traversal routine, and add call to the provided
cleanup routine.
The latter also lead to fixing the following BUG: CCID-2 so far ignored
Ack Vectors on all packets other than Ack/DataAck, which is incorrect,
since Ack Vectors can be present on any packet that has an Ack field.
Details:
--------
* received Ack Vectors are parsed by dccp_parse_options() alone, which passes
the result on to the CCID-specific routine ccid_hc_tx_parse_options();
* CCIDs interested in using/decoding Ack Vector information will add code
to fetch parsed Ack Vectors via this interface;
* a data structure, `struct dccp_ackvec_parsed' is provided as interface;
* this structure arranges Ack Vectors of the same skb into a FIFO order;
* a doubly-linked list is used to keep the required FIFO code small.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes
* functions for which updates have been provided in the preceding patches and
* the @av_vec_len field - it is no longer necessary since the buffer length is
now always computed dynamically;
* conditional debugging code (CONFIG_IP_DCCP_ACKVEC).
The reason for removing the conditional debugging code is that Ack Vectors are
an almost inevitable necessity - RFC 4341 says that for CCID-2, Ack Vectors must
be used. Furthermore, the code would be only interesting for coding - after some
extensive testing with this patch set, having the debug code around is no longer
of real help.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch brings the Ack Vector interface up to date. Its main purpose is
to lay the basis for the subsequent patches of this set, which will use the
new data structure fields and routines.
There are no real algorithmic changes, rather an adaptation:
(1) Replaced the static Ack Vector size (2) with a #define so that it can
be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems
to be sufficient for the moment) and added a solution so that computing
the ECN nonce will continue to work - even with larger Ack Vectors.
(2) Replaced the #defines for Ack Vector states with a complete enum.
(3) Replaced #defines to compute Ack Vector length and state with general
purpose routines (inlines), and updated code to use these.
(4) Added a `tail' field (conversion to circular buffer in subsequent patch).
(5) Updated the (outdated) documentation for Ack Vector struct.
(6) All sequence number containers now trimmed to 48 bits.
(7) Removal of unused bits:
* removed dccpav_ack_nonce from struct dccp_ackvec, since this is already
redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record);
* removed Elapsed Time for Ack Vectors (it was nowhere used);
* replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since
the code needs to be able to remember the old run length;
* reduced the de-/allocation routines (redundant / duplicate tests).
Justification for removing Elapsed Time information [can be removed]:
---------------------------------------------------------------------
1. The Elapsed Time information for Ack Vectors was nowhere used in the code.
2. DCCP does not implement rate-based pacing of acknowledgments. The only
recommendation for always including Elapsed Time is in section 11.3 of
RFC 4340: "Receivers that rate-pace acknowledgements SHOULD [...]
include Elapsed Time options". But such is not the case here.
3. It does not really improve estimation accuracy. The Elapsed Time field only
records the time between the arrival of the last acknowledgeable packet and
the time the Ack Vector is sent out. Since Linux does not (yet) implement
delayed Acks, the time difference will typically be small, since often the
arrival of a data packet triggers sending feedback at the HC-receiver.
Justification for changes in de-/allocation routines [can be removed]:
----------------------------------------------------------------------
* INIT_LIST_HEAD in dccp_ackvec_record_new was redundant, since the list
pointers were later overwritten when the node was added via list_add();
* dccp_ackvec_record_new() was called in a single place only;
* calls to list_del_init() before calling dccp_ackvec_record_delete() were
redundant, since subsequently the entire element was k-freed;
* since all calls to dccp_ackvec_record_delete() were preceded to a call to
list_del_init(), the WARN_ON test would never evaluate to true;
* since all calls to dccp_ackvec_record_delete() were made from within
list_for_each_entry_safe(), the test for avr == NULL was redundant;
* list_empty() in ackvec_free was redundant, since the same condition is
embedded in the loop condition of the subsequent list_for_each_entry_safe().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch is thanks to an investigation by Leandro Sales de Melo and his
colleagues. They worked out two state diagrams which highlight the fact that
the xxx_TERM states in CCID-3/4 are in fact not necessary.
And this can be confirmed by in turn looking at the code: the xxx_TERM states
are only ever set in ccid3_hc_{rx,tx}_exit(). These two functions are part
of the following call chain:
* ccid_hc_{tx,rx}_exit() are called from ccid_delete() only;
* ccid_delete() invokes ccid_hc_{tx,rx}_exit() in the way of a destructor:
after calling ccid_hc_{tx,rx}_exit(), the CCID is released from memory;
* ccid_delete() is in turn called only by ccid_hc_{tx,rx}_delete();
* ccid_hc_{tx,rx}_delete() is called only if
- feature negotiation failed (dccp_feat_activate_values()),
- when changing the RX/TX CCID (to eject the current CCID),
- when destroying the socket (in dccp_destroy_sock()).
In other words, when CCID-3 sets the state to xxx_TERM, it is at a time where
no more processing should be going on, hence it is not necessary to introduce
a dedicated exit state - this is implicit when unloading the CCID.
The patch removes this state, one switch-statement collapses as a result.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the argument `more' from ccid_hc_tx_packet_sent, since it was
nowhere used in the entire code.
(Anecdotally, this argument was not even used in the original KAME code where
the function originally came from; compare the variable moreToSend in the
freebsd61-dccp-kame-28.08.2006.patch now maintained by Emmanuel Lochin.)
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The `options_received' struct is redundant, since it re-duplicates the existing
`p' and `x_recv' fields. This patch removes the sub-struct and migrates the
format conversion operations (cf. below) to ccid3_hc_tx_parse_options().
Why the fields are redundant
----------------------------
The Loss Event Rate p and the Receive Rate x_recv are initially 0 when first
loading CCID-3, as ccid_new() zeroes out the entire ccid3_hc_tx_sock.
When Loss Event Rate or Receive Rate options are received, they are stored by
ccid3_hc_tx_parse_options() into the fields `ccid3or_loss_event_rate' and
`ccid3or_receive_rate' of the sub-struct `options_received' in ccid3_hc_tx_sock.
After parsing (considering only the established state - dccp_rcv_established()),
the packet is passed on to ccid_hc_tx_packet_recv(). This calls the CCID-3
specific routine ccid3_hc_tx_packet_recv(), which performs the following copy
operations between fields of ccid3_hc_tx_sock:
* hctx->options_received.ccid3or_receive_rate is copied into hctx->x_recv,
after scaling it for fixpoint arithmetic, by 2^64;
* hctx->options_received.ccid3or_loss_event_rate is copied into hctx->p,
considering the above special cases; in addition, a value of 0 here needs to
be mapped into p=0 (when no Loss Event Rate option has been received yet).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This adds a function to take care of the following cases occurring in the
computation of the Loss Rate p:
* 1/(2^32-1) is mapped into 0% as per RFC 4342, 8.5;
* 1/0 is mapped into the maximum of 100%;
* we want to avoid that p = 1/x is rounded down to 0 when x is very large,
since this means accidentally re-entering slow-start (indicated by p==0).
In the last case, the minimum-resolution value of p is returned.
Furthermore, a bug in ccid3_hc_rx_getsockopt is fixed (1/0 was mapped into ~0U),
which now allows to consistently print the scaled p-values as
printf("Loss Event Rate = %u.%04u %%\n", rx_info.tfrcrx_p / 10000,
rx_info.tfrcrx_p % 10000);
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch ...
1. adds packet type information to ccid_hc_{rx,tx}_parse_options(). This is
necessary, since table 3 in RFC 4340, 5.8 leaves it to the CCIDs to state
which options may (not) appear on what packet type.
2. adds such a check for CCID-3's {Loss Event, Receive} Rate as specified in
RFC 4340 8.3 ("Receive Rate options MUST NOT be sent on DCCP-Data packets")
and 8.5 ("Loss Event Rate options MUST NOT be sent on DCCP-Data packets").
3. removes an unused argument `idx' from ccid_hc_{rx,tx}_parse_options(). This
is also no longer necessary, since the CCID-specific option-parsing routines
are passed every single parameter of the type-length-value option encoding.
Also added documentation and made argument naming scheme consistent.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This simplifies and consolidates the TX option-parsing code:
1. The Loss Intervals option is not currently used, so dead code related to
this option is removed. I am aware of no plans to support the option, but
if someone wants to implement it (e.g. for inter-op tests), it is better
to start afresh than having to also update currently unused code.
2. The Loss Event and Receive Rate options have a lot of code in common (both
are 32 bit, both have same length etc.), so this is consolidated.
3. The test against GSR is not necessary, because
- on first loading CCID3, ccid_new() zeroes out all fields in the socket;
- ccid3_hc_tx_packet_recv() treats 0 and ~0U equivalently, due to
pinv = opt_recv->ccid3or_loss_event_rate;
if (pinv == ~0U || pinv == 0)
hctx->p = 0;
- as a result, the sequence number field is removed from opt_recv.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This removes the RTT-sampling function tfrc_tx_hist_rtt(), since
1. it suffered from complex passing of return values (the return value both
indicated successful lookup while the value doubled as RTT sample);
2. when for some odd reason the sample value equalled 0, this triggered a bug
warning about "bogus Ack", due to the ambiguity of the return value;
3. on a passive host which has not sent anything the TX history is empty and
thus will lead to unwanted "bogus Ack" warnings such as
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-28197148
ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-26641606.
The fix is to replace the implicit encoding by performing the steps manually.
Furthermore, the "bogus Ack" warning has been removed, since it can actually be
triggered due to several reasons (network reordering, old packet, (3) above),
hence it is not very useful.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a subtle bug in the calculation of the inter-packet gap and shows
that t_delta, as it is currently used, is not needed. And hence replaced.
The algorithm from RFC 3448, 4.6 below continually computes a send time t_nom,
which is initialised with the current time t_now; t_gran = 1E6 / HZ specifies
the scheduling granularity, s the packet size, and X the sending rate:
t_distance = t_nom - t_now; // in microseconds
t_delta = min(t_ipi, t_gran) / 2; // `delta' parameter in microseconds
if (t_distance >= t_delta) {
reschedule after (t_distance / 1000) milliseconds;
} else {
t_ipi = s / X; // inter-packet interval in usec
t_nom += t_ipi; // compute the next send time
send packet now;
}
1) Description of the bug
-------------------------
Rescheduling requires a conversion into milliseconds, due to this call chain:
* ccid3_hc_tx_send_packet() returns a timeout in milliseconds,
* this value is converted by msecs_to_jiffies() in dccp_write_xmit(),
* and finally used as jiffy-expires-value for sk_reset_timer().
The highest jiffy resolution with HZ=1000 is 1 millisecond, so using a higher
granularity does not make much sense here.
As a consequence, values of t_distance < 1000 are truncated to 0. This issue
has so far been resolved by using instead
if (t_distance >= t_delta + 1000)
reschedule after (t_distance / 1000) milliseconds;
The bug is in artificially inflating t_delta to t_delta' = t_delta + 1000. This
is unnecessarily large, a more adequate value is t_delta' = max(t_delta, 1000).
2) Consequences of using the corrected t_delta'
-----------------------------------------------
Since t_delta <= t_gran/2 = 10^6/(2*HZ), we have t_delta <= 1000 as long as
HZ >= 500. This means that t_delta' = max(1000, t_delta) is constant at 1000.
On the other hand, when using a coarse HZ value of HZ < 500, we have three
sub-cases that can all be reduced to using another constant of t_gran/2.
(a) The first case arises when t_ipi > t_gran. Here t_delta' is the constant
t_delta' = max(1000, t_gran/2) = t_gran/2.
(b) If t_ipi <= 2000 < t_gran = 10^6/HZ usec, then t_delta = t_ipi/2 <= 1000,
so that t_delta' = max(1000, t_delta) = 1000 < t_gran/2.
(c) If 2000 < t_ipi <= t_gran, we have t_delta' = max(t_delta, 1000) = t_ipi/2.
In the second and third cases we have delay values less than t_gran/2, which is
in the order of less than or equal to half a jiffy.
How these are treated depends on how fractions of a jiffy are handled: they
are either always rounded down to 0, or always rounded up to 1 jiffy (assuming
non-zero values). In both cases the error is on average in the order of 50%.
Thus we are not increasing the error when in the second/third case we replace
a value less than t_gran/2 with 0, by setting t_delta' to the constant t_gran/2.
3) Summary
----------
Fixing (1) and considering (2), the patch replaces t_delta with a constant,
whose value depends on CONFIG_HZ, changing the above algorithm to:
if (t_distance >= t_delta')
reschedule after (t_distance / 1000) milliseconds;
where t_delta' = 10^6/(2*HZ) if HZ < 500, and t_delta' = 1000 otherwise.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The CCIDs are activated as last of the features, at the end of the handshake,
were the LISTEN state of the master socket is inherited into the server
state of the child socket. Thus, the only states visible to CCIDs now are
OPEN/PARTOPEN, and the closing states.
This allows to remove tests which were previously necessary to protect
against referencing a socket in the listening state (in CCID3), but which
now have become redundant.
As a further byproduct of enabling the CCIDs only after the connection has been
fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock
can now be eliminated:
* the CCID is loaded, so it is not necessary to test if it is NULL,
* if it is possible to load a CCID and leave the private area NULL, then this
is a bug, which should crash loudly - and earlier,
* the test for state==OPEN || state==PARTOPEN now reduces only to the closing
phase (e.g. when the node has received an unexpected Reset).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
This patch does the same for CCID-3 as the previous patch for CCID-2:
s#ccid3hctx_##g;
s#ccid3hcrx_##g;
plus manual editing to retain consistency.
Please note: expanded the fields of the `struct tfrc_tx_info' in the hc_tx_sock,
since using short #define identifiers is not a good idea. The only place where
this embedded struct was used is ccid3_hc_tx_getsockopt().
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch fixes two problems caused by the ubiquitous long "hctx->ccid2htx_"
and "hcrx->ccid2hcrx_" prefixes:
* code becomes hard to read;
* multiple-line statements are almost inevitable even for simple expressions;
The prefixes are not really necessary (compare with "struct tcp_sock").
There had been previous discussion of this on dccp@vger, but so far this was
not followed up (most people agreed that the prefixes are too long).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Leandro Melo de Sales <leandroal@gmail.com>
Two registration routines, for SP and NN features, are provided by this patch,
replacing a previous routine which was used for both feature types.
These are internal-only routines and therefore start with `__feat_register'.
It further exports the known limits of Sequence Window and Ack Ratio as symbolic
constants.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
The BUG_ON(w_tot == 0) only holds if there is no more than 1 loss interval in
the loss history. If there is only a single loss interval, the calc_i_mean()
routine need in fact not be called (RFC 3448, 6.3.1).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This sets the sysfs permissions so that root can toggle the `debug'
parameter available for nearly every DCCP module. This is useful
since there are various module inter-dependencies. The debug flag
can now be toggled at runtime using
echo 1 > /sys/module/dccp/parameters/dccp_debug
echo 1 > /sys/module/dccp_ccid2/parameters/ccid2_debug
echo 1 > /sys/module/dccp_ccid3/parameters/ccid3_debug
echo 1 > /sys/module/dccp_tfrc_lib/parameters/tfrc_debug
The last is not very useful yet, since no code at the moment calls
the tfrc_debug() macro.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This corrects an error in the computation of the open loss interval I_0:
* the interval length is (highest_seqno - start_seqno) + 1
* and not (highest_seqno - start_seqno).
This condition was not fully clear in RFC 3448, but reflects the current
revision state of rfc3448bis and is also consistent with RFC 4340, 6.1.1.
Further changes:
----------------
* variable renamed due to line length constraints;
* explicit typecast to `s64' to avoid implicit signed/unsigned casting.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a bug in the logic of the TFRC loss detection:
* new_loss_indicated() should not be called while a loss is pending;
* but the code allows this;
* thus, for two subsequent gaps in the sequence space, when loss_count
has not yet reached NDUPACK=3, the loss_count is falsely reduced to 1.
To avoid further and similar problems, all loss handling and loss detection is
now done inside tfrc_rx_hist_handle_loss(), using an appropriate routine to
track new losses.
Further changes:
----------------
* added a reminder that no RX history operations should be performed when
rx_handle_loss() has identified a (new) loss, since the function takes
care of packet reordering during loss detection;
* made tfrc_rx_hist_loss_pending() bool (thanks to an earlier suggestion
by Arnaldo);
* removed unused functions.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
RFC 4340, 7.7 specifies up to 6 bytes for the NDP Count option, whereas the code
is currently limited to up to 3 bytes. This seems to be a relict of an earlier
draft version and is brought up to date by the patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The TFRC loss detection code used the wrong loss condition (RFC 4340, 7.7.1):
* the difference between sequence numbers s1 and s2 instead of
* the number of packets missing between s1 and s2 (one less than the distance).
Since this condition appears in many places of the code, it has been put into a
separate function, dccp_loss_free().
Further changes:
----------------
* tidied up incorrect typing (it was using `int' for u64/s64 types);
* optimised conditional statements for common case of non-reordered packets;
* rewrote comments/documentation to match the changes.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a bug in computing the inter-packet-interval t_ipi = s/X:
scaled_div32(a, b) uses u32 for b, but in "scaled_div32(s, X)" the type of the
sending rate `X' is u64. Since X is scaled by 2^6, this truncates rates greater
than 2^26 Bps (~537 Mbps).
Using full 64-bit division now.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This fixes a bug in the reverse lookup of p: given a value f(p), instead of p,
the function returned the smallest tabulated value f(p).
The smallest tabulated value of
10^6 * f(p) = sqrt(2*p/3) + 12 * sqrt(3*p/8) * (32 * p^3 + p)
for p=0.0001 is 8172.
Since this value is scaled by 10^6, the outcome of this bug is that a loss
of 8172/10^6 = 0.8172% was reported whenever the input was below the table
resolution of 0.01%.
This means that the value was over 80 times too high, resulting in large spikes
of the initial loss interval, thus unnecessarily reducing the throughput.
Also corrected the printk format (%u for u32).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
This patch fixes the following sparse warnings:
* nested min(max()) expression:
net/dccp/ccids/ccid3.c:91:21: warning: symbol '__x' shadows an earlier one
net/dccp/ccids/ccid3.c:91:21: warning: symbol '__y' shadows an earlier one
* Declaration of function prototypes in .c instead of .h file, resulting in
"should it be static?" warnings.
* Declared "struct dccpw" static (local to dccp_probe).
* Disabled dccp_delayed_ack() - not fully removed due to RFC 4340, 11.3
("Receivers SHOULD implement delayed acknowledgement timers ...").
* Used a different local variable name to avoid
net/dccp/ackvec.c:293:13: warning: symbol 'state' shadows an earlier one
net/dccp/ackvec.c:238:33: originally declared here
* Removed unused functions `dccp_ackvector_print' and `dccp_ackvec_print'.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
In commit $(825de27d9e) (from 27th May, commit
message `dccp ccid-3: Fix "t_ipi explosion" bug'), the CCID-3 window counter
computation was fixed to cope with RTTs < 4 microseconds.
Such RTTs can be found e.g. when running CCID-3 over loopback. The fix removed
a check against RTT < 4, but introduced a divide-by-zero bug.
All steady-state RTTs in DCCP are filtered using dccp_sample_rtt(), which
ensures non-zero samples. However, a zero RTT is possible on initialisation,
when there is no RTT sample from the Request/Response exchange.
The fix is to use the fallback-RTT from RFC 4340, 3.4.
This is also better than just fixing update_win_count() since it allows other
parts of the code to always assume that the RTT is non-zero during the time
that the CCID is used.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
The identification of this bug is thanks to Cheng Wei and Tomasz
Grobelny.
To avoid divide-by-zero, the implementation previously ignored RTTs
smaller than 4 microseconds when performing integer division RTT/4.
When the RTT reached a value less than 4 microseconds (as observed on
loopback), this prevented the Window Counter CCVal value from
advancing. As a result, the receiver stopped sending feedback. This in
turn caused non-ending expiries of the nofeedback timer at the sender,
so that the sending rate was progressively reduced until reaching the
minimum of one packet per 64 seconds.
The patch fixes this bug by handling integer division more
intelligently. Due to consistent use of dccp_sample_rtt(),
divide-by-zero-RTT is avoided.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Makes the intention of the nested min/max clear.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements the changes to the nofeedback timer handling suggested
in draft rfc3448bis00, section 4.4. In particular, these changes mean:
* better handling of the lossless case (p == 0)
* the timestamp for computing t_ld becomes obsolete
* much more recent document (RFC 3448 is almost 5 years old)
* concepts in rfc3448bis arose from a real, working implementation
(cf. sec. 12)
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements the algorithm to update the allowed sending rate X upon
receiving feedback packets, as described in draft rfc3448bis, 4.2/4.3.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* the NO_SENT state is only triggered in bidirectional mode,
costing unnecessary processing.
* the TERM (terminating) state is irrelevant.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch
1) concentrates previously scattered computation of p_inv into one function;
2) removes the `p' element of the CCID3 RX sock (it is redundant);
3) makes the tfrc_rx_info structure standalone, only used on demand.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch makes the registration messages of CCID 2/3 a bit more
informative: instead of repeating the CCID number as currently done,
"CCID: Registered CCID 2 (ccid2)" or
"CCID: Registered CCID 3 (ccid3)",
the descriptive names of the CCID's (from RFCs) are now used:
"CCID: Registered CCID 2 (TCP-like)" and
"CCID: Registered CCID 3 (TCP-Friendly Rate Control)".
To allow spaces in the name, the slab name string has been changed to
refer to the numeric CCID identifier, using the same format as before.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This hooks up the TFRC Loss Interval database with CCID 3 packet reception.
In addition, it makes the CCID-specific computation of the first loss
interval (which requires access to all the guts of CCID3) local to ccid3.c.
The patch also fixes an omission in the DCCP code, that of a default /
fallback RTT value (defined in section 3.4 of RFC 4340 as 0.2 sec); while
at it, the upper bound of 4 seconds for an RTT sample has been reduced to
match the initial TCP RTO value of 3 seconds from[RFC 1122, 4.2.3.1].
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves two inlines back to packet_history.h: these are not private
to packet_history.c, but are needed by CCID3/4 to detect whether a new
loss is indicated, or whether a loss is already pending.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each time feedback is sent two lines are printed:
ccid3_hc_rx_send_feedback: client ... - entry
ccid3_hc_rx_send_feedback: Interval ...usec, X_recv=..., 1/p=...
The first line is redundant and thus removed.
Further, documentation of ccid3_hc_rx_sock (capitalisation) is made consistent.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A ringbuffer-based implementation of loss interval history is easier to
maintain, allocate, and update.
The `swap' routine to keep the RX history sorted is due to and was written
by Arnaldo Carvalho de Melo, simplifying an earlier macro-based variant.
Details:
* access to the Loss Interval Records via macro wrappers (with safety checks);
* simplified, on-demand allocation of entries (no extra memory consumption on
lossless links); cache allocation is local to the module / exported as service;
* provision of RFC-compliant algorithm to re-compute average loss interval;
* provision of comprehensive, new loss detection algorithm
- support for all cases of loss, including re-ordered/duplicate packets;
- waiting for NDUPACK=3 packets to fill the hole;
- updating loss records when a late-arriving packet fills a hole.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves the inlines (which were previously declared as macros) back into
packet_history.h since the loss detection code needs to be able to read entries
from the RX history in order to create the relevant loss entries: it needs at
least tfrc_rx_hist_loss_prev() and tfrc_rx_hist_last_rcv(), which in turn
require the definition of the other inlines (macros).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This separates RX/TX initialisation and puts all packet history / loss intervals
initialisation into tfrc.c.
The organisation is uniform: slab declaration -> {rx,tx}_init() -> {rx,tx}_exit()
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moved up the comment "Receiver routines" above the first occurrence of
RX history routines.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Credit here goes to Gerrit Renker, that provided the initial implementation for
this new codebase.
I modified it just to try to make it closer to the existing API, renaming some
functions, add namespacing and fix one bug where the tfrc_rx_hist_alloc was not
freeing the allocated ring entries on the error path.
Original changeset comment from Gerrit:
-----------
This provides a new, self-contained and generic RX history service for TFRC
based protocols.
Details:
* new data structure, initialisation and cleanup routines;
* allocation of dccp_rx_hist entries local to packet_history.c,
as a service exported by the dccp_tfrc_lib module.
* interface to automatically track highest-received seqno;
* receiver-based RTT estimation (needed for instance by RFC 3448, 6.3.1);
* a generic function to test for `data packets' as per RFC 4340, sec. 7.7.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the sender sets window counters [RFC 4342, sections 5 and 8.1].
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is in preparation for merging the new rx history code written by Gerrit Renker.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is in preparation for merging the new rx history code written by Gerrit Renker.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the tfrc_lib module in the following manner:
(1) a dedicated tfrc source file to call the packet history &
loss interval init/exit functions.
(2) a dedicated tfrc_pr_debug macro with toggle switch `tfrc_debug'.
Commiter note: renamed tfrc_module.c to tfrc.c, and made CONFIG_IP_DCCP_CCID3
select IP_DCCP_TFRC_LIB.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on a previous patch by Gerrit Renker.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch was based on another made by Gerrit Renker, his changelog was:
------------------------------------------------------
The patch set migrates TFRC TX history to a singly-linked list.
The details are:
* use of a consistent naming scheme (all TFRC functions now begin with `tfrc_');
* allocation and cleanup are taken care of internally;
* provision of a lookup function, which is used by the CCID TX infrastructure
to determine the time a packet was sent (in turn used for RTT sampling);
* integration of the new interface with the present use in CCID3.
------------------------------------------------------
Simplifications I did:
. removing the tfrc_tx_hist_head that had a pointer to the list head and
another for the slabcache.
. No need for creating a slabcache for each CCID that wants to use the TFRC
tx history routines, create a single slabcache when the dccp_tfrc_lib module
init routine is called.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a comment which identifies an `issue' with dccp_write_xmit() where there is none.
The comment assumes it is possible that a packet is sent between the calls to
ccid_hc_tx_send_packet(),
dccp_transmit_skb(),
ccid_hc_tx_packet_sent()
(in the above order) in dccp_write_xmit().
I think that this is impossible, since dccp_write_xmit() is always called under lock:
* when called as dccp_write_xmit(sk, 1) from dccp_send_close(), the socket is locked
(see code comment above dccp_send_close());
* when called as dccp_write_xmit(sk, 0) from dccp_send_msg(), it is after lock_sock() has been called;
* when called as dccp_write_xmit(sk, 0) from dccp_write_xmit_timer(), bh_lock_sock() has been called
and the if/else statement has made sure that sk_lock.owner is not set;
* there are no other places where dccp_write_xmit() is called.
Furthermore, the debug statement for printing the sequence number of the packet just sent has been
removed, since the entire list is being printed anyway and so the entry of that number appears last.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code used two different variables to count Acks, one of them redundant.
This patch reduces the number of Ack counters to one.
The type of the Ack counter has also been changed to u32 (twice the range of int);
and the variable has been renamed into `packets_acked' - for consistency with
RFC 3465 (and similarly named variables are used by TCP and SCTP).
Lastly, a slightly less aggressive `maxincr' increment is used (for even Ack Ratios,
maxincr was Ack Ratio/2 + 1 instead of Ack Ratio/2).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the synchronisation variable `ccid2hctx_sendwait', which is set to 1
when the CCID2 sender may send a new packet, and which is set to 0 otherwise
The variable is redundant, since it is only used in combination with the hc_tx_send_packet/
hc_tx_packet_sent function pair. Both functions are called under socket lock, so the
following happens when the CCID2 may send a new packet:
* it sets sendwait = 1 in tx_send_packet and returns 0;
* the subsequent call to tx_packet_sent clears the sendwait flag;
* since tx_send_packet returns 0 if and only if sendwait == 1, the BUG_ON condition
in tx_packet_sent is never satisfied, since that function is never called when
tx_send_packet returns a value different from 0 (cf. dccp_write_xmit);
* the call to tx_packet_sent clears the flag so that the condition "!sendwait" is
true the next time tx_packet_sent is called.
In other words, it is sufficient to just return 0 / not-0 to synchronise tx_send_packet
and tx_packet_sent -- which is what the patch does.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reduces the amount of redundant debugging messages:
* pipe/cwnd are printed in both tx_send_packet() and tx_packet_sent().
Both functions are called immediately after one another, so one occurrence is sufficient.
* Since tx_packet_sent() prints pipe/cwnd already, the second printk for pipe is redundant.
* In tx_packet_sent() the check_sanity function is called twice (at the begin and at the end).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ccid2_change_pipe only does an assignment. This patch simplifies the code by
replacing the function with the assignment it performs.
Furthermore, the type of pipe is promoted from `signed' to unsigned (increasing the range).
As a result, a BUG_ON test for negative values now becomes obsolete (for safety not removed,
but replaced with a less annoying `DCCP_BUG').
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current function ccid2_change_cwnd in effect makes only an assignment, as
the test whether cwnd has reached 0 is only required when cwnd is halved.
This patch simplifies the code by replacing the function with the assignment
it performs.
Furthermore, since ssthresh derives from cwnd and appears in many assignments and
comparisons, the type of ssthresh has also been changed to match that of cwnd.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces the field member `numdupack', which was used as a read-only
constant in the code, with a #define.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a variable `ccid2hctx_sent' which is incremented but
never referenced/read (i.e., dead code).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This comments out a problematic section comprising a half-finished algorithm:
- The variable `ccid2hctx_ackloss' is never initialised to a value different from 0 and
hence in fact is a read-only constant.
- The `arsent' variable counts packets other than Acks (it is incremented for every packet),
and there is no test for Ack Loss.
- The concept of counting Acks as such leads to a complex calculation, and the calculation
at the moment is inconsistent with this concept.
The problem is that the number of Acks - rather than the number of windows - is counted,
which leads to a complex (cubic/quadratic) expression - this is not even implemented.
In its current state, the commented-out algorithm interfers with normal processing by
changing Ack Ratio incorrectly, and at the wrong times.
A new algorithm is necessary, which will not necessarily use the same variables as used by
the unfinished one; hence the old variables have been removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 4341, sec. 5 states that "The cwnd parameter is initialized to at most
four packets for new connections, following the rules from [RFC3390]", which
is implemented by this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes a bug in the current code. I agree with Andrea's comment
that there is a problem here but the way it is treated does not fix it.
The problem is that whenever Ack Ratio > cwnd, starvation/deadlock occurs:
* the receiver will not send an Ack until (Ack Ratio - cwnd) data packets
have arrived;
* the sender will not send any data packet before the receipt of an Ack
advances the send window.
The only way that the connection then progresses was via RTO timeout. In one
extreme case (bulk transfer), it was observed that this happened for every single
packet; i.e. hundreds of packets, each a RTO timeout of 1..3 seconds apart:
a transfer which normally would take a fraction of a second thus grew to
several minutes.
The solution taken by this approach is to observe the relation
"Ack Ratio <= cwnd"
by using the constraint (1) from RFC 4341, 6.1.2; i.e. set
Ack Ratio = ceil(cwnd / 2)
and update it whenever either Ack Ratio or cwnd change. This ensures that
the deadlock problem can not arise.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since it makes not sense to assign negative values to Ack Ratio, this
patch disallows this possibility.
As a consequence, a Bug test for negative Ack Ratio values becomes obsolete.
Furthermore, a check against overflow (as Ack Ratio may not exceed 2 bytes,
due to RFC 4340, 11.3) has been added.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This replaces use of normal subtraction with modulo-48 subtraction.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In CCID2 the receiver-history is sorted in ascending order of sequence number,
but the processing of received Ack Vectors requires the list traversal in the
opposite direction.
The current code has a bug in this regard: the list traversal is upwards. As a
consequence, only Ack Vectors with a run length of 1 will pass, in all other
Ack Vectors the remaining (acked) sequence numbers are missed, and may later
falsely be identified as lost.
Note: This bug is only visible when Ack Ratio > 1, since otherwise the run
lengths of Ack Vectors are 0.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This cleans up the consequences of an earlier patch which
introduced the `if IP_DCCP' clause into net/dccp/Kconfig.
The CCID Kconfig menu is sourced within this clause; as a
consequence, all tests of type `depends on IP_DCCP' are now
redundant.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses the following problems:
1. DCCP relies for its proper functioning on having at least one CCID module
enabled (as in TCP plugable congestion control). Currently it is possible to
disable both CCIDs and thus leave the DCCP module in a compiled, but entirely
non-functional state: no sockets can be created when no CCID is available.
Furthermore, the protocol is (again like TCP) not intended to be used without
CCIDs. Last, a non-empty CCID list is needed for doing CCID feature negotiation.
2. Internally the default CCID that is advertised by the Linux host is set to CCID2
(DCCPF_INITIAL_CCID in include/linux/dccp.h). Disabling CCID2 in the Kconfig
menu without changing the defaults leads to a failure `module not found' when
trying to load the dccp module (which internally tries to load the default CCID).
3. The specification (RFC 4340, sec. 10) treats CCID2 somewhat like a
`minimum common denominator'; the specification says that:
* "New connections start with CCID 2 for both endpoints"
* "A DCCP implementation intended for general use, such as an implementation in a
general-purpose operating system kernel, SHOULD implement at least CCID 2.
The intent is to make CCID 2 broadly available for interoperability [...]"
Providing CCID2 as minimum-required CCID (like Reno/Cubic in TCP) thus seems reasonable.
Hence this patch automatically selects CCID2 when DCCP is enabled. Documentation also added.
Discussions with Ian McDonald on this subject are gratefully acknowledged.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The moving average computation occurs so frequently in the CCID 3 code that
it merits an inline function of its own. This is uses a suggestion by
Arnaldo as per http://www.mail-archive.com/dccp@vger.kernel.org/msg01662.html
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes/updates the handling of idle and application-limited periods in CCID3,
which currently is broken: there is no detection as to how long a sender has been
idle - there is only one flag which is toggled in between function calls.
Being obsolete now, the `idle' flag is removed.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a previously undiscovered bug; the problem is in computing
the elapsed time as the time between `receiving' the packet (i.e. skb enters
CCID module) and sending feedback:
- there is no layer-processing, queueing, or delay involved,
- hence the elapsed time is in the order of 1 function call
- this is in the dimension of maximally 50..100usec
- which renders the use of elapsed time almost entirely useless.
The fix is simply to ignore such trivial amounts of elapsed time.
As a further advantage, the now useless elapsed_time field can be removed from
the socket, which reduces the socket structure by another four bytes.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This updates the CCID3 code with regard to two instances of using `MSS' in place of `s':
1. The RFC3390-based initial rate: both rfc3448bis as well as the Faster Restart
draft now consistently use `s' instead of MSS.
2. Now agrees with section 4.2 of rfc3448bis: "If the sender is ready to send data when
it does not yet have a round trip sample, the value of X is set to s bytes per
second, for segment size s [...]"
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.
The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Assigning initial values of `0' is redundant when loading a new CCID structure,
since in net/dccp/ccid.c the entire CCID structure is zeroed out prior to
initialisation in ccid_new():
struct ccid {
struct ccid_operations *ccid_ops;
char ccid_priv[0];
};
// ...
if (rx) {
memset(ccid + 1, 0, ccid_ops->ccid_hc_rx_obj_size);
if (ccid->ccid_ops->ccid_hc_rx_init != NULL &&
ccid->ccid_ops->ccid_hc_rx_init(ccid, sk) != 0)
goto out_free_ccid;
} else {
memset(ccid + 1, 0, ccid_ops->ccid_hc_tx_obj_size);
/* analogous to the rx case */
}
This patch therefore removes the redundant assignments. Thanks to Arnaldo for
the inspiration.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This fixes `unaligned (read) access' errors of the type
Kernel unaligned access at TPC[100f970c] dccp_parse_options+0x4f4/0x7e0 [dccp]
Kernel unaligned access at TPC[1011f2e4] ccid3_hc_tx_parse_options+0x1ac/0x380 [dccp_ccid3]
Kernel unaligned access at TPC[100f9898] dccp_parse_options+0x680/0x880 [dccp]
by using the get_unaligned macro for parsing options.
Commiter note: Preserved the sparse __be{16,32} annotations.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This replaces several uses of standard arithmetic with the DCCP
sequence number arithmetic functions. The problem here is that the
sequence number wrap-around was not taken into consideration.
* Condition "seqp->ccid2s_seq <= prev->ccid2s_seq" has been replaced
by
dccp_delta_seqno(seqp->ccid2s_seq, prev->ccid2s_seq) >= 0
since if seqp is `before' prev, then the delta_seqno() is positive.
* The test whether sequence numbers `a' and `b' are consecutive has
the form
dccp_delta_seqno(a, b) == 1
* Increment of ccid2hctx_rpseq could be done using dccp_inc_seqno(),
but since here the incremented ccid2hctx_rpseq == seqno, used
assignment instead.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb's passed to ccid2_hc_tx_send_packet() are headerless, the packet
type is decided later, in dccp_write_xmit(). Therefore the first test
of the switch/case block is always true, the others are never reached.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes a test for `val < 1' which would only have been triggered
when val < 0, due to a preceding test for 0. Fixed by using an
unsigned type for cwnd (as in TCP) instead.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes an ugly BUG_ON which has been pointed out by Arnaldo.
Instead of freezing up the machine, a `critical' message is now issued
to the system log.
There is potential of doing this more gracefully (eg. there are a few
internal variables which could be updated despite the lack of memory),
but that requires more complicated changes to the algorithm; thus a
`FIXME' has been added.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>