linux

Author	SHA1	Message	Date
Gerrit Renker	ddab05568e	dccp: Clean up slow-path input processing This patch rearranges the order of statements of the slow-path input processing (i.e. any other state than OPEN), to resolve the following issues. 1. Dependencies: the order of statements now better matches RFC 4340, 8.5, i.e. step 7 is before step 9 (previously 9 was before 7), and parsing options in step 8 (which can consume resources) now comes after step 7. 2. Bug-fix: in state CLOSED, there should not be any sequence number checking or option processing. This is why the test for CLOSED has been moved after the test for LISTEN. 3. As before sequence number checks are omitted if in state LISTEN/REQUEST, due to the note underneath the table in RFC 4340, 7.5.3. 4. Packets are now passed on to Ack Vector / CCID processing only after - step 7 (receive unexpected packets), - step 9 (receive Reset), - step 13 (receive CloseReq), - step 14 (receive Close) and only if the state is PARTOPEN. This simplifies CCID processing: - in LISTEN/CLOSED the CCIDs are non-existent; - in RESPOND/REQUEST the CCIDs have not yet been negotiated; - in CLOSEREQ and active-CLOSING the node has already closed this socket; - in passive-CLOSING the client is waiting for its Reset. In the last case, RFC 4340, 8.3 leaves it open to ignore further incoming data, which is the approach taken here. As a result of (3), CCID processing is now indeed confined to OPEN/PARTOPEN states, i.e. congestion control is performed only on the flow of data packets. This avoids pathological cases of doing congestion control on those messages which set up and terminate the connection. I have done a few checks to see if this creates a problem in other parts of the code. This seems not to be the case; even if there were one, it would be better to fix it than to perform congestion control on Close/Request/Response messages. Similarly for Ack Vectors (as they depend on the negotiated CCID). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:39 +02:00
Gerrit Renker	6224877b2c	tcp/dccp: Consolidate common code for RFC 3390 conversion This patch consolidates the code common to TCP and CCID-2: * TCP uses RFC 3390 in a packet-oriented manner (tcp_input.c) and * CCID-2 uses RFC 3390 in packet-oriented manner (RFC 4341). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:39 +02:00
Gerrit Renker	b25b0c60b0	dccp: Combine the functionality of enqeueing and cloning Realising the following call pattern, * first dccp_entail() is called to enqueue a new skb and * then skb_clone() is called to transmit a clone of that skb, this patch integrates both interrelated steps into dccp_entail(). Note: the return value of skb_clone is not checked. It may be an idea to add a warning if this occurs. In both instances, however, a timer is set for retransmission, so that cloning is re-tried via dccp_retransmit_skb(). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:39 +02:00
Gerrit Renker	20bbd0f75e	dccp ccid-2: Remove wrappers around sk_{reset,stop}_timer() This removes the wrappers around the sk timer functions as it makes the code clearer and not much is gained from using wrappers: the BUG_ON in start_rto_timer will never trigger since that function was called only when * the RTO timer expired (rto_expire, and then timer_pending() is false); * in tx_packet_sent only if !timer_pending() (BUG_ON is redundant here); * previously in new_ack, after stopping the timer (timer_pending() false). One further motive behind this patch is to replace the RTO timer with the icsk retransmission timer, as it is already part of the DCCP socket. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:39 +02:00
Gerrit Renker	1435562d7e	dccp ccid-2: Replace broken RTT estimator with better algorithm The current CCID-2 RTT estimator code is in parts broken and lags behind the suggestions in RFC2988 of using scaled variants for SRTT/RTTVAR. That code is replaced by the present patch, which reuses the Linux TCP RTT estimator code - reasons for this code duplication are given below. Further details: ---------------- 1. The minimum RTO of previously one second has been replaced with TCP's, since RFC4341, sec. 5 says that the minimum of 1 sec. (suggested in RFC2988, 2.4) is not necessary. Instead, the TCP_RTO_MIN is used, which agrees with DCCP's concept of a default RTT (RFC 4340, 3.4). 2. The maximum RTO has been set to DCCP_RTO_MAX (64 sec), which agrees with RFC2988, (2.5). 3. De-inlined the function ccid2_new_ack(). 4. Added a FIXME: the RTT is sampled several times per Ack Vector, which will give the wrong estimate. It should be replaced with one sample per Ack. However, at the moment this can not be resolved easily, since - it depends on TX history code (which also needs some work), - the cleanest solution is not to use the `sent' time at all (saves 4 bytes per entry) and use DCCP timestamps / elapsed time to estimated the RTT, which however is non-trivial to get right (but needs to be done). Reasons for reusing the Linux TCP estimator algorithm: ------------------------------------------------------ Some time was spent to find a better alternative, using basic RFC2988 as a first step. Further analysis and experimentation showed that the Linux TCP RTO estimator is superior to a basic RFC2988 implementation. A summary is on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ccid2/rto_estimator/ In addition, this estimator fared well in a recent empirical evaluation: Rewaskar, Sushant, Jasleen Kaur and F. Donelson Smith. A Performance Study of Loss Detection/Recovery in Real-world TCP Implementations. Proceedings of 15th IEEE International Conference on Network Protocols (ICNP-07). 2007. Thus there is significant benefit in reusing the existing TCP code. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:39 +02:00
Gerrit Renker	e9803c0104	dccp ccid-2: Simplify dec_pipe and rearming of RTO timer This removes the dec_pipe function and improves the way the RTO timer is rearmed when a new acknowledgment comes in. Details and justification for removal: -------------------------------------- 1) The BUG_ON in dec_pipe is never triggered: pipe is only decremented for TX history entries between tail and head, for which it had previously been incremented in tx_packet_sent; and it is not decremented twice for the same entry, since it is - either decremented when a corresponding Ack Vector cell in state 0 or 1 was received (and then ccid2s_acked==1), - or it is decremented when ccid2s_acked==0, as part of the loss detection in tx_packet_recv (and hence it can not have been decremented earlier). 2) Restarting the RTO timer happens for every single entry in each Ack Vector parsed by tx_packet_recv (according to RFC 4340, 11.4 this can happen up to 16192 times per Ack Vector). 3) The RTO timer should not be restarted when all outstanding data has been acknowledged. This is currently done similar to (2), in dec_pipe, when pipe has reached 0. The patch onsolidates the code which rearms the RTO timer, combining the segments from new_ack and dec_pipe. As a result, the code becomes clearer (compare with tcp_rearm_rto()). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:38 +02:00
Gerrit Renker	c6f0f2e71f	dccp ccid-2: Remove redundant sanity tests This removes the ccid2_hc_tx_check_sanity function: it is redundant. Details: ======== The tx_check_sanity function performs three tests: 1) it checks that the circular TX list is sorted - in ascending order of sequence number (ccid2s_seq) - and time (ccid2s_sent), - in the direction from `tail' (hctx_seqt) to `head' (hctx_seqh); 2) it ensures that the entire list has the length seqbufc * CCID2_SEQBUF_LEN; 3) it ensures that pipe equals the number of packets that were not marked `acked' (ccid2s_acked) between `tail' and `head'. The following argues that each of these tests is redundant, this can be verified by going through the code. (1) is not necessary, since both time and GSS increase from one packet to the next, so that subsequent insertions in tx_packet_sent (which advance the `head' pointer) will be in ascending order of time and sequence number. In (2), the length of the list is always equal to seqbufc times CCID2_SEQBUF_LEN (set to 1024) unless allocation caused an earlier failure, because: * at initialisation (tx_init), there is one chunk of size 1024 and seqbufc=1; * subsequent calls to tx_alloc_seq take place whenever head->next == tail in tx_packet_sent; then a new chunk of size 1024 is inserted between head and tail, and seqbufc is incremented by one. To show that (3) is redundant requires looking at two cases. The `pipe' variable of the TX socket is incremented only in tx_packet_sent, and decremented in tx_packet_recv. When head == tail (TX history empty) then pipe should be 0, which is the case directly after initialisation and after a retransmission timeout has occurred (ccid2_hc_tx_rto_expire). The first case involves parsing Ack Vectors for packets recorded in the live portion of the buffer, between tail and head. For each packet marked by the receiver as received (state 0) or ECN-marked (state 1), pipe is decremented by one, so for all such packets the BUG_ON in tx_check_sanity will not trigger. The second case is the loss detection in the second half of tx_packet_recv, below the comment "Check for NUMDUPACK". The first while-loop here ensures that the sequence number of `seqp' is either above or equal to `high_ack', or otherwise equal to the highest sequence number sent so far (of the entry head->prev, as head points to the next unsent entry). The next while-loop ("while (1)") counts the number of acked packets starting from that position of seqp, going backwards in the direction from head->prev to tail. If NUMDUPACK=3 such packets were counted within this loop, `seqp' points to the last acknowledged packet of these, and the "if (done == NUMDUPACK)" block is entered next. The while-loop contained within that block in turn traverses the list backwards, from head to tail; the position of `seqp' is saved in the variable `last_acked'. For each packet not marked as `acked', a congestion event is triggered within the loop, and pipe is decremented. The loop terminates when `seqp' has reached `tail', whereupon tail is set to the position previously stored in `last_acked'. Thus, between `last_acked' and the previous position of `tail', - pipe has been decremented earlier if the packet was marked as state 0 or 1; - pipe was decremented if the packet was not marked as acked. That is, pipe has been decremented by the number of packets between `last_acked' and the previous position of `tail'. As a consequence, pipe now again reflects the number of packets which have not (yet) been acked between the new position of tail (at `last_acked') and head->prev, or 0 if head==tail. The result is that the BUG_ON condition in check_sanity will also not be triggered, hence the test (3) is also redundant. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:38 +02:00
Gerrit Renker	83337dae6c	dccp ccid-2: Stop polling This updates CCID2 to use the CCID dequeuing mechanism, converting from previous constant-polling to a now event-driven mechanism. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:38 +02:00
Gerrit Renker	146993cf51	dccp: Refine the wait-for-ccid mechanism This extends the existing wait-for-ccid routine so that it may be used with different types of CCID. It further addresses the problems listed below. The code looks if the write queue is non-empty and grants the TX CCID up to `timeout' jiffies to drain the queue. It will instead purge that queue if * the delay suggested by the CCID exceeds the time budget; * a socket error occurred while waiting for the CCID; * there is a signal pending (eg. annoyed user pressed Control-C); * the CCID does not support delays (we don't know how long it will take). D e t a i l s [can be removed] ------------------------------- DCCP's sending mechanism functions a bit like non-blocking I/O: dccp_sendmsg() will enqueue up to net.dccp.default.tx_qlen packets (default=5), without waiting for them to be released to the network. Rate-based CCIDs, such as CCID3/4, can impose sending delays of up to maximally 64 seconds (t_mbi in RFC 3448). Hence the write queue may still contain packets when the application closes. Since the write queue is congestion-controlled by the CCID, draining the queue is also under control of the CCID. There are several problems that needed to be addressed: 1) The queue-drain mechanism only works with rate-based CCIDs. If CCID2 for example has a full TX queue and becomes network-limited just as the application wants to close, then waiting for CCID2 to become unblocked could lead to an indefinite delay (i.e., application "hangs"). 2) Since each TX CCID in turn uses a feedback mechanism, there may be changes in its sending policy while the queue is being drained. This can lead to further delays during which the application will not be able to terminate. 3) The minimum wait time for CCID3/4 can be expected to be the queue length times the current inter-packet delay. For example if tx_qlen=100 and a delay of 15 ms is used for each packet, then the application would have to wait for a minimum of 1.5 seconds before being allowed to exit. 4) There is no way for the user/application to control this behaviour. It would be good to use the timeout argument of dccp_close() as an upper bound. Then the maximum time that an application is willing to wait for its CCIDs to can be set via the SO_LINGER option. These problems are addressed by giving the CCID a grace period of up to the `timeout' value. The wait-for-ccid function is, as before, used when the application (a) has read all the data in its receive buffer and (b) if SO_LINGER was set with a non-zero linger time, or (c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ state (client application closes after receiving CloseReq). In addition, there is a catch-all case by calling __skb_queue_purge() after waiting for the CCID. This is necessary since the write queue may still have data when (a) the host has been passively-closed, (b) abnormal termination (unread data, zero linger time), (c) wait-for-ccid could not finish within the given time limit. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:38 +02:00
Gerrit Renker	e7937772d7	dccp: Extend CCID packet dequeueing interface This extends the packet dequeuing interface of dccp_write_xmit() to allow 1. CCIDs to take care of timing when the next packet may be sent; 2. delayed sending (as before, with an inter-packet gap up to 65.535 seconds). The main purpose is to take CCID2 out of its polling mode (when it is network- limited, it tries every millisecond to send, without interruption). The interface can also be used to support other CCIDs. The mode of operation for (2) is as follows: * new packet is enqueued via dccp_sendmsg() => dccp_write_xmit(), * ccid_hc_tx_send_packet() detects that it may not send (e.g. window full), * it signals this condition via `CCID_PACKET_WILL_DEQUEUE_LATER', * dccp_write_xmit() returns without further action; * after some time the wait-condition for CCID becomes true, * that CCID schedules the tasklet, * tasklet function calls ccid_hc_tx_send_packet() via dccp_write_xmit(), * since the wait-condition is now true, ccid_hc_tx_packet() returns "send now", * packet is sent, and possibly more (since dccp_write_xmit() loops). Code reuse: the taskled function calls dccp_write_xmit(), the timer function reduces to a wrapper around the same code. If the tasklet finds that the socket is locked, it re-schedules the tasklet function (not the tasklet) after one jiffy. Changed DCCP_BUG to dccp_pr_debug when transmit_skb returns an error (e.g. when a local qdisc is used, NET_XMIT_DROP=1 can be returned for many packets). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:38 +02:00
Gerrit Renker	f4a66ca4d2	dccp: Return-value convention of hc_tx_send_packet() This patch reorganises the return value convention of the CCID TX sending function, to permit more flexible schemes, as required by subsequent patches. Currently the convention is * values < 0 mean error, * a value == 0 means "send now", and * a value x > 0 means "send in x milliseconds". The patch provides symbolic constants and a function to interpret return values. In addition, it caps the maximum positive return value to 0xFFFF milliseconds, corresponding to 65.535 seconds. This is possible since in CCID-3 the maximum inter-packet gap is t_mbi = 64 sec. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:38 +02:00
Gerrit Renker	c8bf462bc5	dccp ccid-2: Separate option parsing from CCID processing This patch replaces an almost identical replication of code: large parts of dccp_parse_options() re-appeared as ccid2_ackvector() in ccid2.c. Apart from the duplication, this caused two more problems: 1. CCIDs should not need to be concerned with parsing header options; 2. one can not assume that Ack Vectors appear as a contiguous area within an skb, it is legal to insert other options and/or padding in between. The current code would throw an error and stop reading in such a case. The patch provides a new data structure and associated list housekeeping. Only small changes were necessary to integrate with CCID-2: data structure initialisation, adapt list traversal routine, and add call to the provided cleanup routine. The latter also lead to fixing the following BUG: CCID-2 so far ignored Ack Vectors on all packets other than Ack/DataAck, which is incorrect, since Ack Vectors can be present on any packet that has an Ack field. Details: -------- * received Ack Vectors are parsed by dccp_parse_options() alone, which passes the result on to the CCID-specific routine ccid_hc_tx_parse_options(); * CCIDs interested in using/decoding Ack Vector information will add code to fetch parsed Ack Vectors via this interface; * a data structure, `struct dccp_ackvec_parsed' is provided as interface; * this structure arranges Ack Vectors of the same skb into a FIFO order; * a doubly-linked list is used to keep the required FIFO code small. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:37 +02:00
Gerrit Renker	5a577b488f	dccp ccid-2: Remove old infrastructure This removes * functions for which updates have been provided in the preceding patches and * the @av_vec_len field - it is no longer necessary since the buffer length is now always computed dynamically; * conditional debugging code (CONFIG_IP_DCCP_ACKVEC). The reason for removing the conditional debugging code is that Ack Vectors are an almost inevitable necessity - RFC 4341 says that for CCID-2, Ack Vectors must be used. Furthermore, the code would be only interesting for coding - after some extensive testing with this patch set, having the debug code around is no longer of real help. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:37 +02:00
Gerrit Renker	c2f42077bd	dccp ccid-2: Schedule Sync as out-of-band mechanism The problem with Ack Vectors is that i) their length is variable and can in principle grow quite large, ii) it is hard to predict exactly how large they will be. Due to the second point it seems not a good idea to reduce the MPS; in particular when on average there is enough room for the Ack Vector and an increase in length is momentarily due to some burst loss, after which the Ack Vector returns to its normal/average length. The solution taken by this patch is to subtract a minimum-expected Ack Vector length from the MPS (previous patch), and to defer any larger Ack Vectors onto a separate Sync - but only if indeed there is no space left on the skb. This patch provides the infrastructure to schedule Sync-packets for transporting (urgent) out-of-band data. Its signalling is quicker than scheduling an Ack, since it does not need to wait for new application data. It can thus serve other parts of the DCCP code as well. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:37 +02:00
Gerrit Renker	283fb4a5f3	dccp ccid-2: Consolidate Ack-Vector processing within main DCCP module This aggregates Ack Vector processing (handling input and clearing old state) into one function, for the following reasons and benefits: * all Ack Vector-specific processing is now in one place; * duplicated code is removed; * ensuring sanity: from an Ack Vector point of view, it is better to clear the old state first before entering new state; * Ack Event handling happens mostly within the CCIDs, not the main DCCP module. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:37 +02:00
Gerrit Renker	e28fe59f9c	dccp ccid-2: Update code for the Ack Vector input/registration routine This patch uupdates the code which registers new packets as received, using the new circular buffer interface. It contributes a new algorithm which * supports both tail/head pointers and buffer wrap-around and * deals with overflow (head/tail move in lock-step). The updated code is also partioned differently, into 1. dealing with the empty buffer, 2. adding new packets into non-empty buffer, 3. reserving space when encountering a `hole' in the sequence space, 4. updating old state and deciding when old state is irrelevant. Protection against large burst losses: With regard to (3), it is too costly to reserve space when there are large bursts of losses. When bursts get too large, the code does no longer reserve space and just fills in cells normally. This measure reduces space consumption by a factor of 63. The code reuses in part the previous implementation by Arnaldo de Melo. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:37 +02:00
Gerrit Renker	68b1de1576	dccp ccid-2: Algorithm to update buffer state This provides a routine to consistently update the buffer state when the peer acknowledges receipt of Ack Vectors; updating state in the list of Ack Vectors as well as in the circular buffer. While based on RFC 4340, several additional (and necessary) precautions were added to protect the consistency of the buffer state. These additions are essential, since analysis and experience showed that the basic algorithm was insufficient for this task (which lead to problems that were hard to debug). The algorithm now * deals with HC-sender acknowledging to HC-receiver and vice versa, * keeps track of the last unacknowledged but received seqno in tail_ackno, * has special cases to reset the overflow condition when appropriate, * is protected against receiving older information (would mess up buffer state). Note: The older code performed an unnecessary step, where the sender cleared Ack Vector state by parsing the Ack Vector received by the HC-receiver. Doing this was entirely redundant, since * the receiver always puts the full acknowledgment window (groups 2,3 in 11.4.2) into the Ack Vectors it sends; hence the HC-receiver is only interested in the highest state that the HC-sender received; * this means that the acknowledgment number on the (Data)Ack from the HC-sender is sufficient; and work done in parsing earlier state is not necessary, since the later state subsumes the earlier one (see also RFC 4340, A.4). This older interface (dccp_ackvec_parse()) is therefore removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:37 +02:00
Gerrit Renker	d7dc7e5f49	dccp ccid-2: Implementation of circular Ack Vector buffer with overflow handling This completes the implementation of a circular buffer for Ack Vectors, by extending the current (linear array-based) implementation. The changes are: (a) An `overflow' flag to deal with the case of overflow. As before, dynamic growth of the buffer will not be supported; but code will be added to deal robustly with overflowing Ack Vector buffers. (b) A `tail_seqno' field. When naively implementing the algorithm of Appendix A in RFC 4340, problems arise whenever subsequent Ack Vector records overlap, which can bring the entire run length calculation completely out of synch. (This is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/\ ack_vectors/tracking_tail_ackno/ .) (c) The buffer lengthi is now computed dynamically (i.e. current fill level), as the span between head to tail. As a result, dccp_ackvec_pending() is now simpler - the #ifdef is no longer necessary since buf_empty is always true when IP_DCCP_ACKVEC is not configured. Note on overflow handling: ------------------------- The Ack Vector code previously simply started to drop packets when the Ack Vector buffer overflowed. This means that the userspace application will not be able to receive, only because of an Ack Vector storage problem. Furthermore, overflow may be transient, so that applications may later recover from the overflow. Recovering from dropped packets is more difficult (e.g. video key frames). Hence the patch uses a different policy: when the buffer overflows, the oldest entries are subsequently overwritten. This has a higher chance of recovery. Details are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ack_vectors/ Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:36 +02:00
Gerrit Renker	4829007c7b	dccp ccid-2: Separate internals of Ack Vectors from option-parsing code This patch * separates Ack Vector housekeeping code from option-insertion code; * shifts option-specific code from ackvec.c into options.c; * introduces a dedicated routine to take care of the Ack Vector records; * simplifies the dccp_ackvec_insert_avr() routine: the BUG_ON was redundant, since the list is automatically arranged in descending order of ack_seqno. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:36 +02:00
Gerrit Renker	ff49e27089	dccp ccid-2: Ack Vector interface clean-up This patch brings the Ack Vector interface up to date. Its main purpose is to lay the basis for the subsequent patches of this set, which will use the new data structure fields and routines. There are no real algorithmic changes, rather an adaptation: (1) Replaced the static Ack Vector size (2) with a #define so that it can be adapted (with low loss / Ack Ratio, a value of 1 works, so 2 seems to be sufficient for the moment) and added a solution so that computing the ECN nonce will continue to work - even with larger Ack Vectors. (2) Replaced the #defines for Ack Vector states with a complete enum. (3) Replaced #defines to compute Ack Vector length and state with general purpose routines (inlines), and updated code to use these. (4) Added a `tail' field (conversion to circular buffer in subsequent patch). (5) Updated the (outdated) documentation for Ack Vector struct. (6) All sequence number containers now trimmed to 48 bits. (7) Removal of unused bits: * removed dccpav_ack_nonce from struct dccp_ackvec, since this is already redundantly stored in the `dccpavr_ack_nonce' (of Ack Vector record); * removed Elapsed Time for Ack Vectors (it was nowhere used); * replaced semantics of dccpavr_sent_len with dccpavr_ack_runlen, since the code needs to be able to remember the old run length; * reduced the de-/allocation routines (redundant / duplicate tests). Justification for removing Elapsed Time information [can be removed]: --------------------------------------------------------------------- 1. The Elapsed Time information for Ack Vectors was nowhere used in the code. 2. DCCP does not implement rate-based pacing of acknowledgments. The only recommendation for always including Elapsed Time is in section 11.3 of RFC 4340: "Receivers that rate-pace acknowledgements SHOULD [...] include Elapsed Time options". But such is not the case here. 3. It does not really improve estimation accuracy. The Elapsed Time field only records the time between the arrival of the last acknowledgeable packet and the time the Ack Vector is sent out. Since Linux does not (yet) implement delayed Acks, the time difference will typically be small, since often the arrival of a data packet triggers sending feedback at the HC-receiver. Justification for changes in de-/allocation routines [can be removed]: ---------------------------------------------------------------------- * INIT_LIST_HEAD in dccp_ackvec_record_new was redundant, since the list pointers were later overwritten when the node was added via list_add(); * dccp_ackvec_record_new() was called in a single place only; * calls to list_del_init() before calling dccp_ackvec_record_delete() were redundant, since subsequently the entire element was k-freed; * since all calls to dccp_ackvec_record_delete() were preceded to a call to list_del_init(), the WARN_ON test would never evaluate to true; * since all calls to dccp_ackvec_record_delete() were made from within list_for_each_entry_safe(), the test for avr == NULL was redundant; * list_empty() in ackvec_free was redundant, since the same condition is embedded in the loop condition of the subsequent list_for_each_entry_safe(). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:36 +02:00
Gerrit Renker	b8c6bcee1d	dccp: Reduce noise in output and convert to ktime_t This fixes the problem that dccp_probe output can grow quite large without apparent benefit (many identical data points), creating huge files (up to over one Gigabyte for a few minutes' test run) which are very hard to post-process (in one instance it got so bad that gnuplot ate up all memory plus swap). The cause for the problem is that the kprobe is inserted into dccp_sendmsg(), which can be called in a polling-mode (whenever the TX queue is full due to congestion-control issues, EAGAIN is returned). This creates many very similar data points, i.e. the increase of processing time does not increase the quality/information of the probe output. The fix is to attach the probe to a different function -- write_xmit was chosen since it gets called continually (both via userspace and timer); an input-path function would stop sampling as soon as the other end stops sending feedback. For comparison the output file sizes for the same 20 second test run over a lossy link: * before / without patch: 118 Megabytes * after / with patch: 1.2 Megabytes and there was much less noise in the output. To allow backward compatibility with scripts that people use, the now-unused `size' field in the output has been replaced with the CCID identifier. This also serves for future compatibility - support for CCID2 is work in progress (depends on the still unfinished SRTT/RTTVAR updates). While at it, the update to ktime_t was also performed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:36 +02:00
Gerrit Renker	a9c1656ab1	dccp: Merge now-reduced connect_init() function After moving the assignment of GAR/ISS from dccp_connect_init() to dccp_transmit_skb(), the former function becomes very small, so that a merger with dccp_connect() suggests itself. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:35 +02:00
Gerrit Renker	bfbddd085a	dccp: Fix the adjustments to AWL and SWL This fixes a problem and a potential loophole with regard to seqno/ackno validity: the problem is that the initial adjustments to AWL/SWL were only performed at the begin of the connection, during the handshake. Since the Sequence Window feature is always greater than Wmin=32 (7.5.2), it is however necessary to perform these adjustments at least for the first W/W' (variables as per 7.5.1) packets in the lifetime of a connection. This requirement is complicated by the fact that W/W' can change at any time during the lifetime of a connection. Therefore the consequence is to perform this safety check each time SWL/AWL are updated. A second problem solved by this patch is that the remote/local Sequence Window feature values (which set the bounds for AWL/SWL/SWH) are undefined until the feature negotiation has completed. During the initial handshake we have more stringent sequence number protection, the changes added by this patch effect that {A,S}W{L,H} are within the correct bounds at the instant that feature negotiation completes (since the SeqWin feature activation handlers call dccp_update_gsr/gss()). A detailed rationale is below -- can be removed from the commit message. 1. Server sequence number checks during initial handshake --------------------------------------------------------- The server can not use the fields of the listening socket for seqno/ackno checks and thus needs to store all relevant information on a per-connection basis on the dccp_request socket. This is a size-constrained structure and has currently only ISS (dreq_iss) and ISR (dreq_isr) defined. Adding further fields (SW{L,H}, AW{L,H}) would increase the size of the struct and it is questionable whether this will have any practical gain. The currently implemented solution is as follows. * receiving first Request: dccp_v{4,6}_conn_request sets ISR := P.seqno, ISS := dccp_v{4,6}_init_sequence() * sending first Response: dccp_v{4,6}_send_response via dccp_make_response() sets P.seqno := ISS, sets P.ackno := ISR * receiving retransmitted Request: dccp_check_req() overrides ISR := P.seqno * answering retransmitted Request: dccp_make_response() sets ISS += 1, otherwise as per first Response * completing the handshake: succeeds in dccp_check_req() for the first Ack where P.ackno == ISS (P.seqno is not tested) * creating child socket: ISS, ISR are copied from the request_sock This solution will succeed whenever the server can receive the Request and the subsequent Ack in succession, without retransmissions. If there is packet loss, the client needs to retransmit until this condition succeeds; it will otherwise eventually give up. Adding further fields to the request_sock could increase the robustness a bit, in that it would make possible to let a reordered Ack (from a retransmitted Response) pass. The argument against such a solution is that if the packet loss is not persistent and an Ack gets through, why not wait for the one answering the original response: if the loss is persistent, it is probably better to not start the connection in the first place. Long story short: the present design (by Arnaldo) is simple and will likely work just as well as a more complicated solution. As a consequence, {A,S}W{L,H} are not needed until the moment the request_sock is cloned into the accept queue. At that stage feature negotiation has completed, so that the values for the local and remote Sequence Window feature (7.5.2) are known, i.e. we are now in a better position to compute {A,S}W{L,H}. 2. Client sequence number checks during initial handshake --------------------------------------------------------- Until entering PARTOPEN the client does not need the adjustments, since it constrains the Ack window to the packet it sent. * sending first Request: dccp_v{4,6}_connect() choose ISS, dccp_connect() then sets GAR := ISS (as per 8.5), dccp_transmit_skb() (with the previous bug fix) sets GSS := ISS, AWL := ISS, AWH := GSS * n-th retransmitted Request (with previous patch): dccp_retransmit_skb() via timer calls dccp_transmit_skb(), which sets GSS := ISS+n and then AWL := ISS, AWH := ISS+n * receiving any Response: dccp_rcv_request_sent_state_process() -- accepts packet if AWL <= P.ackno <= AWH; -- sets GSR = ISR = P.seqno * sending the Ack completing the handshake: dccp_send_ack() calls dccp_transmit_skb(), which sets GSS += 1 and AWL := ISS, AWH := GSS Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:35 +02:00
Gerrit Renker	2975abd251	dccp: Schedule an Ack when receiving timestamps This schedules an Ack when receiving a timestamp, exploiting the existing inet_csk_schedule_ack() function, saving one case in the `dccp_ack_pending()' function. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:35 +02:00
Gerrit Renker	d0995e6a9e	dccp ccid-3: Remove dead states This patch is thanks to an investigation by Leandro Sales de Melo and his colleagues. They worked out two state diagrams which highlight the fact that the xxx_TERM states in CCID-3/4 are in fact not necessary. And this can be confirmed by in turn looking at the code: the xxx_TERM states are only ever set in ccid3_hc_{rx,tx}_exit(). These two functions are part of the following call chain: * ccid_hc_{tx,rx}_exit() are called from ccid_delete() only; * ccid_delete() invokes ccid_hc_{tx,rx}_exit() in the way of a destructor: after calling ccid_hc_{tx,rx}_exit(), the CCID is released from memory; * ccid_delete() is in turn called only by ccid_hc_{tx,rx}_delete(); * ccid_hc_{tx,rx}_delete() is called only if - feature negotiation failed (dccp_feat_activate_values()), - when changing the RX/TX CCID (to eject the current CCID), - when destroying the socket (in dccp_destroy_sock()). In other words, when CCID-3 sets the state to xxx_TERM, it is at a time where no more processing should be going on, hence it is not necessary to introduce a dedicated exit state - this is implicit when unloading the CCID. The patch removes this state, one switch-statement collapses as a result. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:35 +02:00
Gerrit Renker	5fe94963a1	dccp ccid-3: Remove duplicate documentation This removes RX-socket documentation which is either duplicate or non-existent. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:35 +02:00
Gerrit Renker	c506d91d9a	dccp: Unused argument in CCID tx function This removes the argument `more' from ccid_hc_tx_packet_sent, since it was nowhere used in the entire code. (Anecdotally, this argument was not even used in the original KAME code where the function originally came from; compare the variable moreToSend in the freebsd61-dccp-kame-28.08.2006.patch now maintained by Emmanuel Lochin.) Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:35 +02:00
Gerrit Renker	f10ecaee6d	dccp: Replace magic CCID-specific numbers by symbolic constants The constants DCCPO_{MIN,MAX}_CCID_SPECIFIC are nowhere used in the code, but instead for the CCID-specific options numbers are used. This patch unifies the use of CCID-specific option numbers, by adding symbolic names reflecting the definitions in RFC 4340, 10.3. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:34 +02:00
Gerrit Renker	ce177ae2e6	dccp ccid-3: Remove redundant 'options_received' struct The `options_received' struct is redundant, since it re-duplicates the existing `p' and `x_recv' fields. This patch removes the sub-struct and migrates the format conversion operations (cf. below) to ccid3_hc_tx_parse_options(). Why the fields are redundant ---------------------------- The Loss Event Rate p and the Receive Rate x_recv are initially 0 when first loading CCID-3, as ccid_new() zeroes out the entire ccid3_hc_tx_sock. When Loss Event Rate or Receive Rate options are received, they are stored by ccid3_hc_tx_parse_options() into the fields `ccid3or_loss_event_rate' and `ccid3or_receive_rate' of the sub-struct `options_received' in ccid3_hc_tx_sock. After parsing (considering only the established state - dccp_rcv_established()), the packet is passed on to ccid_hc_tx_packet_recv(). This calls the CCID-3 specific routine ccid3_hc_tx_packet_recv(), which performs the following copy operations between fields of ccid3_hc_tx_sock: * hctx->options_received.ccid3or_receive_rate is copied into hctx->x_recv, after scaling it for fixpoint arithmetic, by 2^64; * hctx->options_received.ccid3or_loss_event_rate is copied into hctx->p, considering the above special cases; in addition, a value of 0 here needs to be mapped into p=0 (when no Loss Event Rate option has been received yet). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:34 +02:00
Gerrit Renker	535c55df13	dccp tfrc/ccid-3: Computing Loss Rate from Loss Event Rate This adds a function to take care of the following cases occurring in the computation of the Loss Rate p: * 1/(2^32-1) is mapped into 0% as per RFC 4342, 8.5; * 1/0 is mapped into the maximum of 100%; * we want to avoid that p = 1/x is rounded down to 0 when x is very large, since this means accidentally re-entering slow-start (indicated by p==0). In the last case, the minimum-resolution value of p is returned. Furthermore, a bug in ccid3_hc_rx_getsockopt is fixed (1/0 was mapped into ~0U), which now allows to consistently print the scaled p-values as printf("Loss Event Rate = %u.%04u %%\n", rx_info.tfrcrx_p / 10000, rx_info.tfrcrx_p % 10000); Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:34 +02:00
Gerrit Renker	3306c781ff	dccp: Add packet type information to CCID-specific option parsing This patch ... 1. adds packet type information to ccid_hc_{rx,tx}_parse_options(). This is necessary, since table 3 in RFC 4340, 5.8 leaves it to the CCIDs to state which options may (not) appear on what packet type. 2. adds such a check for CCID-3's {Loss Event, Receive} Rate as specified in RFC 4340 8.3 ("Receive Rate options MUST NOT be sent on DCCP-Data packets") and 8.5 ("Loss Event Rate options MUST NOT be sent on DCCP-Data packets"). 3. removes an unused argument `idx' from ccid_hc_{rx,tx}_parse_options(). This is also no longer necessary, since the CCID-specific option-parsing routines are passed every single parameter of the type-length-value option encoding. Also added documentation and made argument naming scheme consistent. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:34 +02:00
Gerrit Renker	47a61e7b43	dccp ccid-3: Simplify and consolidate tx_parse_options This simplifies and consolidates the TX option-parsing code: 1. The Loss Intervals option is not currently used, so dead code related to this option is removed. I am aware of no plans to support the option, but if someone wants to implement it (e.g. for inter-op tests), it is better to start afresh than having to also update currently unused code. 2. The Loss Event and Receive Rate options have a lot of code in common (both are 32 bit, both have same length etc.), so this is consolidated. 3. The test against GSR is not necessary, because - on first loading CCID3, ccid_new() zeroes out all fields in the socket; - ccid3_hc_tx_packet_recv() treats 0 and ~0U equivalently, due to pinv = opt_recv->ccid3or_loss_event_rate; if (pinv == ~0U \|\| pinv == 0) hctx->p = 0; - as a result, the sequence number field is removed from opt_recv. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:34 +02:00
Gerrit Renker	63b3a73bb8	dccp ccid-3: Remove ugly RTT-sampling history lookup This removes the RTT-sampling function tfrc_tx_hist_rtt(), since 1. it suffered from complex passing of return values (the return value both indicated successful lookup while the value doubled as RTT sample); 2. when for some odd reason the sample value equalled 0, this triggered a bug warning about "bogus Ack", due to the ambiguity of the return value; 3. on a passive host which has not sent anything the TX history is empty and thus will lead to unwanted "bogus Ack" warnings such as ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-28197148 ccid3_hc_tx_packet_recv: server(e7b7d518): DATAACK with bogus ACK-26641606. The fix is to replace the implicit encoding by performing the steps manually. Furthermore, the "bogus Ack" warning has been removed, since it can actually be triggered due to several reasons (network reordering, old packet, (3) above), hence it is not very useful. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:34 +02:00
Gerrit Renker	de6f2b59e5	dccp ccid-3: Bug fix for the inter-packet scheduling algorithm This fixes a subtle bug in the calculation of the inter-packet gap and shows that t_delta, as it is currently used, is not needed. And hence replaced. The algorithm from RFC 3448, 4.6 below continually computes a send time t_nom, which is initialised with the current time t_now; t_gran = 1E6 / HZ specifies the scheduling granularity, s the packet size, and X the sending rate: t_distance = t_nom - t_now; // in microseconds t_delta = min(t_ipi, t_gran) / 2; // `delta' parameter in microseconds if (t_distance >= t_delta) { reschedule after (t_distance / 1000) milliseconds; } else { t_ipi = s / X; // inter-packet interval in usec t_nom += t_ipi; // compute the next send time send packet now; } 1) Description of the bug ------------------------- Rescheduling requires a conversion into milliseconds, due to this call chain: * ccid3_hc_tx_send_packet() returns a timeout in milliseconds, * this value is converted by msecs_to_jiffies() in dccp_write_xmit(), * and finally used as jiffy-expires-value for sk_reset_timer(). The highest jiffy resolution with HZ=1000 is 1 millisecond, so using a higher granularity does not make much sense here. As a consequence, values of t_distance < 1000 are truncated to 0. This issue has so far been resolved by using instead if (t_distance >= t_delta + 1000) reschedule after (t_distance / 1000) milliseconds; The bug is in artificially inflating t_delta to t_delta' = t_delta + 1000. This is unnecessarily large, a more adequate value is t_delta' = max(t_delta, 1000). 2) Consequences of using the corrected t_delta' ----------------------------------------------- Since t_delta <= t_gran/2 = 10^6/(2HZ), we have t_delta <= 1000 as long as HZ >= 500. This means that t_delta' = max(1000, t_delta) is constant at 1000. On the other hand, when using a coarse HZ value of HZ < 500, we have three sub-cases that can all be reduced to using another constant of t_gran/2. (a) The first case arises when t_ipi > t_gran. Here t_delta' is the constant t_delta' = max(1000, t_gran/2) = t_gran/2. (b) If t_ipi <= 2000 < t_gran = 10^6/HZ usec, then t_delta = t_ipi/2 <= 1000, so that t_delta' = max(1000, t_delta) = 1000 < t_gran/2. (c) If 2000 < t_ipi <= t_gran, we have t_delta' = max(t_delta, 1000) = t_ipi/2. In the second and third cases we have delay values less than t_gran/2, which is in the order of less than or equal to half a jiffy. How these are treated depends on how fractions of a jiffy are handled: they are either always rounded down to 0, or always rounded up to 1 jiffy (assuming non-zero values). In both cases the error is on average in the order of 50%. Thus we are not increasing the error when in the second/third case we replace a value less than t_gran/2 with 0, by setting t_delta' to the constant t_gran/2. 3) Summary ---------- Fixing (1) and considering (2), the patch replaces t_delta with a constant, whose value depends on CONFIG_HZ, changing the above algorithm to: if (t_distance >= t_delta') reschedule after (t_distance / 1000) milliseconds; where t_delta' = 10^6/(2HZ) if HZ < 500, and t_delta' = 1000 otherwise. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:33 +02:00
Gerrit Renker	b2e317f4b5	dccp ccid-3: No more CCID control blocks in LISTEN state The CCIDs are activated as last of the features, at the end of the handshake, were the LISTEN state of the master socket is inherited into the server state of the child socket. Thus, the only states visible to CCIDs now are OPEN/PARTOPEN, and the closing states. This allows to remove tests which were previously necessary to protect against referencing a socket in the listening state (in CCID3), but which now have become redundant. As a further byproduct of enabling the CCIDs only after the connection has been fully established, several typecast-initialisations of ccid3_hc_{rx,tx}_sock can now be eliminated: * the CCID is loaded, so it is not necessary to test if it is NULL, * if it is possible to load a CCID and leave the private area NULL, then this is a bug, which should crash loudly - and earlier, * the test for state==OPEN \|\| state==PARTOPEN now reduces only to the closing phase (e.g. when the node has received an unexpected Reset). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:33 +02:00
Gerrit Renker	842d1ef14f	dccp ccid-3: Remove ccid3hc{tx,rx}_ prefixes This patch does the same for CCID-3 as the previous patch for CCID-2: s#ccid3hctx_##g; s#ccid3hcrx_##g; plus manual editing to retain consistency. Please note: expanded the fields of the `struct tfrc_tx_info' in the hc_tx_sock, since using short #define identifiers is not a good idea. The only place where this embedded struct was used is ccid3_hc_tx_getsockopt(). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:33 +02:00
Gerrit Renker	1fb8750960	dccp ccid-2: Remove ccid2hc{tx,rx}_ prefixes This patch fixes two problems caused by the ubiquitous long "hctx->ccid2htx_" and "hcrx->ccid2hcrx_" prefixes: * code becomes hard to read; * multiple-line statements are almost inevitable even for simple expressions; The prefixes are not really necessary (compare with "struct tcp_sock"). There had been previous discussion of this on dccp@vger, but so far this was not followed up (most people agreed that the prefixes are too long). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Leandro Melo de Sales <leandroal@gmail.com>	2008-09-04 07:45:33 +02:00
Gerrit Renker	88ddac513a	dccp: Special case of the MPS for client-PARTOPEN with DataAcks To increase robustness, it is necessary to resend Confirm feature-negotiation options, even though the RFC does not mandate it. But feature negotiation options can take (much) more room than the options on common DataAck packets. Instead of reducing the MPS always for a case which only applies to the three messages send during initial handshake, this patch devises a special case: if the payload length of the DataAck in PARTOPEN is too large, an Ack is sent to carry the options, and the feature-negotiation list is then flushed. This means that the server gets two Acks for one Response. If both Acks get lost, it is probably better to restart the connection anyway and devising yet another special-case does not seem worth the extra complexity. The patch (over-)estimates the expected overhead to be 32*4 bytes -- commonly seen values were 20-90 bytes for initial feature-negotiation options. It uses sizeof(u32) to mean "aligned units of 4 bytes". For consistency, another use of sizeof is modified. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:33 +02:00
Gerrit Renker	55ebe3ab2d	dccp: Leave headroom for options when calculating the MPS The Maximum Packet Size (MPS) is of interest for applications which want to transfer data, so it is only relevant to the data transfer phase of a connection (unless one wants to send data on the DCCP-Request, but that is not considered here). The strategy chosen to deal with this requirement is to leave room for only such options that may appear on data packets. A special consideration applies to Ack Vectors: this is purely guesswork, since these can have any length between 3 and 1020 bytes. The strategy chosen here is to subtract a configurable minimum, the value of 16 bytes (2 bytes for type/length plus 14 Ack Vector cells) has been found by experimentatation. If people experience this as too much or too little, this could later be turned into a Kconfig option. There are currently no CCID-specific header options which may appear on data packets, hence it is not necessary to define a corresponding CCID field. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:33 +02:00
Gerrit Renker	2faae5587f	dccp ccid-2: Use feature-negotiation to report Ack Ratio changes This uses the new feature-negotiation framework to signal Ack Ratio changes, as required by RFC 4341, sec. 6.1.2. This raises some problems for CCID-2 since it can at the moment not cope gracefully with Ack Ratio of e.g. 2. A FIXME has thus been added which reverts to the existing policy of bypassing the Ack Ratio sysctl. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:32 +02:00
Gerrit Renker	4861a35443	dccp: Support for exchanging of NN options in established state This patch provides support for the reception of NN options in (PART)OPEN state. It is a combination of change_recv() and confirm_recv(), specifically geared towards receiving the `fast-path' NN options. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:32 +02:00
Gerrit Renker	624a965a93	dccp: Support for the exchange of NN options in established state In contrast to static feature negotiation at the begin of a connection, which establishes the capabilities of both endpoints, this patch introduces support for dynamic exchange of feature negotiation options. Such a dynamic exchange is necessary in at least two cases: * CCID-2's Ack Ratio (RFC 4341, 6.1.2) which changes during the connection; * Sequence Window values that, as per RFC 4340, 7.5.2, should be sent "as as the connection progresses". Both are NN (non-negotiable) features. Hence dynamic feature "negotiation" is distinguished from static/pre-connection negotiation by the following: * no new capabilities are negotiated (those that matter for the connection are negotiated prior to setting up the connection, comparable to SIP); * features must be understood by each endpoint: as per RFC 4340, 6.4, Sequence Window is "Req'd" and Ack Ratio must be understood when CCID-2 is used as per the note underneath Table 4. These characteristics are reflected in the implementation: * only NN options can be exchanged after connection setup; * NN options are activated directly after validating them. The rationale is that a peer must accept every valid NN value (RFC 4340, 6.3.2), hence it will either accept the value and send a "Confirm R", or it will send an empty Confirm (which will reset the connection according to FN rules). * An Ack is scheduled directly after activation to accelerate communicating the update to the peer. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:32 +02:00
Gerrit Renker	76f738a795	dccp: Debugging functions for feature negotiation Since all feature-negotiation processing now takes place in feat.c, functions for producing verbose debugging output are concentrated there. New functions to print out values, entry records, and options are provided, and also a macro is defined to not always have the function name in the output line. Thanks a lot to Wei Yongjun and Giuseppe Galeota for help with errors in an earlier revision of this patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:32 +02:00
Gerrit Renker	0a4822679d	dccp: Initialisation and type-checking of feature sysctls This patch takes care of initialising and type-checking sysctls related to feature negotiation. Type checking is important since some of the sysctls now directly act on the feature-negotiation process. The sysctls are initialised with the known default values for each feature. For the type-checking the value constraints from RFC 4340 are used: * Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes), tested and confirmed that it works up to 4294967295 - for Gbps speed; * Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer); * CCIDs are between 0 .. 255; * request_retries, retries1, retries2 also between 0..255 for good measure; * tx_qlen is checked to be non-negative; * sync_ratelimit remains as before. Further changes: ---------------- Performed s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:32 +02:00
Gerrit Renker	51c7d4fa26	dccp: Implement both feature-local and feature-remote Sequence Window feature This adds full support for local/remote Sequence Window feature, from which the * sequence-number-validity (W) and * acknowledgment-number-validity (W') windows derive as specified in RFC 4340, 7.5.3. Specifically, the following changes are introduced: * integrated new socket fields into dccp_sk; * updated the update_gsr/gss routines with regard to these fields; * updated handler code: the Sequence Window feature is located at the TX side, so the local feature is meant if the handler-rx flag is false; * the initialisation of `rcv_wnd' in reqsk is removed, since - rcv_wnd is not used by the code anywhere; - sequence number checks are not done in the LISTEN state (cf. 7.5.3); - dccp_check_req checks the Ack number validity more rigorously; * the `struct dccp_minisock' became empty and is now removed. Until the handshake completes with activating negotiated values, the local/remote Sequence-Window values are undefined and thus can not reliably be estimated. This issue is addressed in a separate patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:32 +02:00
Gerrit Renker	09856c1089	dccp: Auto-load (when supported) CCID plugins for negotiation This adds auto-loading of CCIDs (when module loading is enabled) for the purpose of feature negotiation. The problem with loading the CCIDs at the end of feature negotiation is that this would happen in software interrupt context. Besides, if the host advertises CCIDs during negotiation, it should have them ready to use, in case an agreeing peer wants to use it for the connection. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:31 +02:00
Gerrit Renker	5d3dac267a	dccp: Initialisation framework for feature negotiation This initialises feature negotiation from two tables, which are initialised from sysctls. As a novel feature, specifics of the implementation (e.g. currently short seqnos and ECN are not supported) are advertised for robustness. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:31 +02:00
Gerrit Renker	b235dc4abb	dccp ccid-2: Phase out the use of boolean Ack Vector sysctl This removes the use of the sysctl and the minisock variable for the Send Ack Vector feature, which is now handled fully dynamically via feature negotiation; i.e. when CCID2 is enabled, Ack Vectors are automatically enabled (as per RFC 4341, 4.). Using a sysctl in parallel to this implementation would open the door to crashes, since much of the code relies on tests of the boolean minisock / sysctl variable. Thus, this patch replaces all tests of type if (dccp_msk(sk)->dccpms_send_ack_vector) /* ... / with if (dp->dccps_hc_rx_ackvec != NULL) / ... / The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature negotiation concluded that Ack Vectors are to be used on the half-connection. Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child), so that the test is a valid one. The activation handler for Ack Vectors is called as soon as the feature negotiation has concluded at the server when the Ack marking the transition RESPOND => OPEN arrives; * client after it has sent its ACK, marking the transition REQUEST => PARTOPEN. Adding the sequence number of the Response packet to the Ack Vector has been removed, since (a) connection establishment implies that the Response has been received; (b) the CCIDs only look at packets received in the (PART)OPEN state, i.e. this entry will always be ignored; (c) it can not be used for anything useful - to detect loss for instance, only packets received after the loss can serve as pseudo-dupacks. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:31 +02:00
Gerrit Renker	68e074bfce	dccp: Remove manual influence on NDP Count feature Updating the NDP count feature is handled automatically now: * for CCID-2 it is disabled, since the code does not use NDP counts; * for CCID-3 it is enabled, as NDP counts are used to determine loss lengths. Allowing the user to change NDP values leads to unpredictable and failing behaviour, since it is then possible to disable NDP counts even when they are needed (e.g. in CCID-3). This means that only those user settings are sensible that agree with the values for Send NDP Count implied by the choice of CCID. But those settings are already activated by the feature negotiation (CCID dependency tracking), hence this form of support is redundant. At startup the initialisation of the NDP count feature is with the default value of 0, which is done implicitly by the zeroing-out of the socket when it is allocated. If the choice of CCID or feature negotiation enables NDP count, this will then be updated via the NDP activation handler. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:31 +02:00
Gerrit Renker	78673e24df	dccp: Remove obsolete parts of the old CCID interface The TX/RX CCIDs of the minisock are now redundant: similar to the Ack Vector case, their value equals initially that of the sysctl, but at the end of feature negotiation may be something different. The old interface removed by this patch thus has been replaced by the newer interface to dynamically query the currently loaded CCIDs earlier in this patch set. Also removed the constructors for the TX CCID and the RX CCID, since the switch rx/non-rx is done by the handler in minisocks.c (and the handler is the only place in the code where CCIDs are loaded). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:31 +02:00
Gerrit Renker	23479cbfd3	dccp: Clean up old feature-negotiation infrastructure The code removed by this patch is no longer referenced or used, the added lines update documentation and copyrights. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:30 +02:00
Gerrit Renker	c49b22729f	dccp: Integration of dynamic feature activation - part 3 (client side) This integrates feature-activation in the client, with these details: 1. When dccp_parse_options() fails, the reset code is already set, request_sent _state_process() currently overrides this with `Packet Error', which is not intended - so changed to use the reset code set in dccp_parse_options(); 2. There was a FIXME to change the error code when dccp_ackvec_add() fails. I have looked this up and found that: * the check whether ackno < ISN is already made earlier, * this Response is likely the 1st packet with an Ackno that the client gets, * so when dccp_ackvec_add() fails, the reason is likely not a packet error. 3. When feature negotiation fails, the socket should be marked as not usable, so that the application is notified that an error occurs. This is achieved by a new label, which uses an error code of `Aborted' and which sets the socket state to CLOSED, as well as sk_err. 4. Avoids parsing the Ack twice in Respond state by not doing option processing again in dccp_rcv_respond_partopen_state_process (as option processing has already been done on the request_sock in dccp_check_req). Since this addresses congestion-control initialisation, a corresponding FIXME has been removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:30 +02:00
Gerrit Renker	e70cacb90d	dccp: Integration of dynamic feature activation - part 2 (server side) This patch integrates the activation of features at the end of negotiation into the server-side code. Note: In dccp_create_openreq_child the request_sock argument is no longer constant, since dccp_activate_values() uses the feature-negotiation list on dreq to sort out the initialisation values for the different features of the child socket; and purges this queue after use (but the `req' argument to openreq_child can and does still remain constant). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:30 +02:00
Gerrit Renker	3a53a9adfa	dccp: Integration of dynamic feature activation - part 1 (socket setup) This first patch out of three replaces the hardcoded default settings with initialisation code for the dynamic feature negotiation. Note on retransmitting Confirm options: --------------------------------------- This patch also defers flushing the client feature-negotiation queue, due to the following considerations. As long as the client is in PARTOPEN, it needs to retransmit the Confirm options for the Change options received on the DCCP-Response from the server. Otherwise, if the packet containing the Confirm options gets dropped in the network, the connection aborts due to undefined feature negotiation state. Thanks to Leandro Melo de Sales who reported a bug in an earlier revision of the patch set, resulting from not retransmitting the Confirm options. The patch now ensures that the client feature-negotiation queue is flushed only when entering the OPEN state. Since confirmed Change options are removed as soon as they are confirmed (in the DCCP-Response), this ensures that Confirm options are retransmitted. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:30 +02:00
Gerrit Renker	c926c6aed3	dccp: Feature activation handlers This patch provides the post-processing of feature negotiation state, after the negotiation has completed. To this purpose, handlers are used and added to the dccp_feat_table. Each handler is passed a boolean flag whether the RX or TX side of the feature is meant. Several handlers are provided already, new handlers can easily be added. The initialisation is now fully dynamic, i.e. CCIDs are activated only after the feature negotiation. The integration of this dynamic activation is done in the subsequent patches. Thanks to Wei Yongjun for pointing out the necessity of skipping over empty Confirm options while copying the negotiated feature values. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:30 +02:00
Gerrit Renker	d2150b7bff	dccp: Processing Confirm options Analogous to the previous patch, this adds code to interpret incoming Confirm feature-negotiation options. Both functions operate on the feature-negotiation list of either the request_sock (server) or the dccp_sock (client). Thanks to Wei Yongjun for pointing out that it is overly restrictive to check the entire list of confirmed SP values. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:29 +02:00
Gerrit Renker	5a146b97d5	dccp: Process incoming Change feature-negotiation options This adds/replaces code for processing incoming ChangeL/R options. The main difference is that: * mandatory FN options are now interpreted inside the function (there are too many individual cases to do this externally); * the function returns an appropriate Reset code or 0, which is then used to fill in the data for the Reset packet. Old code, which is no longer used or referenced, has been removed. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:29 +02:00
Gerrit Renker	c664d4f4e2	dccp: Preference list reconciliation This provides two functions to * reconcile preference lists (with appropriate return codes) and * reorder the preference list if successful reconciliation changed the preferred value. The patch also removes the old code for processing SP/NN Change options, since new code to process these is mostly there already; related references have been commented out. The code for processing Change options follows in the next patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:29 +02:00
Gerrit Renker	f8a644c07e	dccp: Integrate feature-negotiation insertion code The patch implements insertion of feature negotiation at the server (listening and request socket) and the client (connecting socket). In dccp_insert_options(), several statements have been grouped together now to achieve (I hope) better efficiency by reducing the number of tests each packet has to go through: - Ack Vectors are sent if the packet is neither a Data or a Request packet; - a previous issue is corrected - feature negotiation options are allowed on DataAck packets (5.8). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:29 +02:00
Gerrit Renker	0ef118a017	dccp: Insert feature-negotiation options into skb This patch replaces the earlier insertion routine from options.c, so that code specific to feature negotiation can remain in feat.c. This is possible by calling a function already existing in options.c. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:29 +02:00
Gerrit Renker	cf9ddf73b9	dccp: Header option insertion routine for feature-negotiation The patch extends existing code: * Confirm options divide into the confirmed value plus an optional preference list for SP values. Previously only the preference list was echoed for SP values, now the confirmed value is added as per RFC 4340, 6.1; * length and sanity checks are added to avoid illegal memory (or NULL) access. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:29 +02:00
Gerrit Renker	d0440ee6f6	dccp: Support for Mandatory options Support for Mandatory options is provided by this patch, which will be used by subsequent feature-negotiation patches. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2008-09-04 07:45:28 +02:00
Gerrit Renker	b9aaac1c53	dccp: Increase the scope of variable-length htonl/ntohl functions This extends the scope of two available functions, encode\|decode_value_var, to work up to 6 (8) bytes, to match maximum requirements in the RFC. These functions are going to be used both by general option processing and feature negotiation code, hence declarations have been put into feat.h. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2008-09-04 07:45:28 +02:00
Gerrit Renker	c8041e264b	dccp: API to query the current TX/RX CCID This provides function to query the current TX/RX CCID dynamically, without reliance on the minisock value, using dynamic information available in the currently loaded CCID module. This query function is then used to (a) provide the getsockopt part for getting/setting CCIDs via sockopts; (b) replace the current test for "which CCID is in use" in probe.c. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:28 +02:00
Gerrit Renker	fade756f18	dccp: Set per-connection CCIDs via socket options With this patch, TX/RX CCIDs can now be changed on a per-connection basis, which overrides the defaults set by the global sysctl variables for TX/RX CCIDs. To make full use of this facility, the remaining patches of this patch set are needed, which track dependencies and activate negotiated feature values. Note on the maximum number of CCIDs that can be registered: ----------------------------------------------------------- The maximum number of CCIDs that can be registered on the socket is constrained by the space in a Confirm/Change feature negotiation option. The space in these in turn depends on the size of header options as defined in RFC 4340, 5.8. Since this is a recurring constant, it has been moved from ackvec.h into linux/dccp.h, clarifying its purpose. Relative to this size, the maximum number of CCID identifiers that can be present in a Confirm option (which always consumes 1 byte more than a Change option, cf. 6.1) is 2 bytes less than the maximum TLV size: one for the CCID-feature-type and one for the selected value. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:28 +02:00
Gerrit Renker	73bbe095bb	dccp: Tidy up setsockopt calls This splits the setsockopt calls into two groups, depending on whether an integer argument (val) is required and whether routines being called do their own locking. Some options (such as setting the CCID) use u8 rather than int, so that for these the test with regard to integer-sizeof can not be used. The second switch-case statement now only has those statements which need locking and which make use of `val'. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.sg>	2008-09-04 07:45:28 +02:00
Gerrit Renker	17c30b40ed	dccp: Deprecate Ack Ratio sysctl This patch deprecates the Ack Ratio sysctl, since * Ack Ratio is entirely ignored by CCID-3 and CCID-4, * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1); * even if it would work in CCID-2, there is no point for a user to change it: - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2), - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts (since waiting for Acks which will never arrive in this window), - cwnd is not a user-configurable value. The only reasonable place for Ack Ratio is to print it for debugging. It is planned to do this later on, as part of e.g. dccp_probe. With this patch Ack Ratio is now under full control of feature negotiation: * Ack Ratio is resolved as a dependency of the selected CCID; * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to the default of 2, following RFC 4340, 11.3 - "New connections start with Ack Ratio 2 for both endpoints"; * what happens then is part of another patch set, since it concerns the dynamic update of Ack Ratio while the connection is in full flight. Thanks to Tomasz Grobelny for discussion leading up to this patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2008-09-04 07:45:28 +02:00
Gerrit Renker	20f41eee82	dccp: Feature negotiation for minimum-checksum-coverage This provides feature negotiation for server minimum checksum coverage which so far has been missing. Since sender/receiver coverage values range only from 0...15, their type has also been reduced in size from u16 to u4. Feature-negotiation options are now generated for both sender and receiver coverage, i.e. when the peer has `forgotten' to enable partial coverage then feature negotiation will automatically enable (negotiate) the partial coverage value for this connection. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:27 +02:00
Gerrit Renker	668144f7b4	dccp: Deprecate old setsockopt framework The previous setsockopt interface, which passed socket options via struct dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to ugly code since the old approach did not distinguish between NN and SP values. This patch removes the old setsockopt interface and replaces it with two new functions to register NN/SP values for feature negotiation. These are essentially wrappers around the internal __feat_register functions, with checking added to avoid * wrong usage (type); * changing values while the connection is in progress. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:27 +02:00
Gerrit Renker	d4c8741c43	dccp: Mechanism to resolve CCID dependencies This adds a hook to resolve features whose value depends on the choice of CCID. It is done at the server since it can only be done after the CCID values have been negotiated; i.e. the client will add its CCID preference list on the Change options sent in the Request, which will be reconciled with the local preference list of the server. The concept is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\ implementation_notes.html#ccid_dependencies Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:27 +02:00
Gerrit Renker	093e1f46cf	dccp: Resolve dependencies of features on choice of CCID This provides a missing link in the code chain, as several features implicitly depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector feature, but also Ack Ratio and Send Loss Event Rate (also taken care of). For Send Ack Vector, the situation is as follows: * since CCID2 mandates the use of Ack Vectors, there is no point in allowing endpoints which use CCID2 to disable Ack Vector features such a connection; * a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4); * for all other CCIDs, the use of (Send) Ack Vector is optional and thus negotiable. However, this implies that the code negotiating the use of Ack Vectors also supports it (i.e. is able to supply and to either parse or ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack Vector support), the use of Ack Vectors is here disabled, with a comment in the source code. An analogous consideration arises for the Send Loss Event Rate feature, since the CCID-3 implementation does not support the loss interval options of RFC 4342. To make such use explicit, corresponding feature-negotiation options are inserted which signal the use of the loss event rate option, as it is used by the CCID3 code. Lastly, the values of the Ack Ratio feature are matched to the choice of CCID. The patch implements this as a function which is called after the user has made all other registrations for changing default values of features. The table is variable-length, the reserved (and hence for feature-negotiation invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0' is used to mark the end of the table. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:27 +02:00
Gerrit Renker	71bb49596b	dccp: Query supported CCIDs This provides a data structure to record which CCIDs are locally supported and three accessor functions: - a test function for internal use which is used to validate CCID requests made by the user; - a copy function so that the list can be used for feature-negotiation; - documented getsockopt() support so that the user can query capabilities. The data structure is a table which is filled in at compile-time with the list of available CCIDs (which in turn depends on the Kconfig choices). Using the copy function for cloning the list of supported CCIDs is useful for feature negotiation, since the negotiation is now with the full list of available CCIDs (e.g. {2, 3}) instead of the default value {2}. This means negotiation will not fail if the peer requests to use CCID3 instead of CCID2. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:27 +02:00
Gerrit Renker	86349c8d9c	dccp: Registration routines for changing feature values Two registration routines, for SP and NN features, are provided by this patch, replacing a previous routine which was used for both feature types. These are internal-only routines and therefore start with `__feat_register'. It further exports the known limits of Sequence Window and Ack Ratio as symbolic constants. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:27 +02:00
Gerrit Renker	5591d28628	dccp: Limit feature negotiation to connection setup phase This patch starts the new implementation of feature negotiation: 1. Although it is theoretically possible to perform feature negotiation at any time (and RFC 4340 supports this), in practice this is prohibitively complex, as it requires to put traffic on hold for each new negotiation. 2. As a byproduct of restricting feature negotiation to connection setup, the feature-negotiation retransmit timer is no longer required. This part is now mapped onto the protocol-level retransmission. Details indicating why timers are no longer needed can be found on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\ implementation_notes.html This patch disables anytime negotiation, subsequent patches work out full feature negotiation support for connection setup. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:27 +02:00
Gerrit Renker	702083839b	dccp: Cleanup routines for feature negotiation This inserts the required de-allocation routines for memory allocated by feature negotiation in the socket destructors, replacing dccp_feat_clean() in one instance. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:26 +02:00
Gerrit Renker	828755cee0	dccp: Per-socket initialisation of feature negotiation This provides feature-negotiation initialisation for both DCCP sockets and DCCP request_sockets, to support feature negotiation during connection setup. It also resolves a FIXME regarding the congestion control initialisation. Thanks to Wei Yongjun for help with the IPv6 side of this patch. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:26 +02:00
Gerrit Renker	3001fc0569	dccp: List management for new feature negotiation This adds list fields and list management functions for the new feature negotiation implementation. The new code is kept in parallel to the old code, until removed at the end of the patch set. Thanks to Arnaldo for suggestions to improve the code. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:26 +02:00
Gerrit Renker	b4eec20637	dccp: Implement lookup table for feature-negotiation information A lookup table for feature-negotiation information, extracted from RFC 4340/42, is provided by this patch. All currently known features can be found in this table, along with their feature location, their default value, and type. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:26 +02:00
Gerrit Renker	5c7c9451f1	dccp: Basic data structure for feature negotiation This patch prepares for the new and extended feature-negotiation routines. The following feature-negotiation data structures are provided: * a container for the various (SP or NN) values, * symbolic state names to track feature states, * an entry struct which holds all current information together, * elementary functions to fill in and process these structures. Entry structs are arranged as FIFO for the following reason: RFC 4340 specifies that if multiple options of the same type are present, they are processed in the order of their appearance in the packet; which means that this order needs to be preserved in the local data structure (the later insertion code also respects this order). The struct list_head has been chosen for the following reasons: the most frequent operations are * add new entry at tail (when receiving Change or setting socket options); * delete entry (when Confirm has been received); * deep copy of entire list (cloning from listening socket onto request socket). The NN value has been set to 64 bit, which is a currently sufficient upper limit (Sequence Window feature has 48 bit). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-09-04 07:45:26 +02:00
Gerrit Renker	959fd992f0	dccp ccid-3: Replace lazy BUG_ON with condition The BUG_ON(w_tot == 0) only holds if there is no more than 1 loss interval in the loss history. If there is only a single loss interval, the calc_i_mean() routine need in fact not be called (RFC 3448, 6.3.1). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:25 +02:00
Gerrit Renker	432649916b	dccp: Toggle debug output without module unloading This sets the sysfs permissions so that root can toggle the `debug' parameter available for nearly every DCCP module. This is useful since there are various module inter-dependencies. The debug flag can now be toggled at runtime using echo 1 > /sys/module/dccp/parameters/dccp_debug echo 1 > /sys/module/dccp_ccid2/parameters/ccid2_debug echo 1 > /sys/module/dccp_ccid3/parameters/ccid3_debug echo 1 > /sys/module/dccp_tfrc_lib/parameters/tfrc_debug The last is not very useful yet, since no code at the moment calls the tfrc_debug() macro. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:25 +02:00
Gerrit Renker	48816322ad	dccp: Empty the write queue when disconnecting dccp_disconnect() can be called due to several reasons: 1. when the connection setup failed (inet_stream_connect()); 2. when shutting down (inet_shutdown(), inet_csk_listen_stop()); 3. when aborting the connection (dccp_close() with 0 linger time). In case (1) the write queue is empty. This patch empties the write queue, if in case (2) or (3) it was not yet empty. This avoids triggering the write-queue BUG_TRAP in sk_stream_kill_queues() later on. It also seems natural to do: when breaking an association, to delete all packets that were originally intended for the soon-disconnected end (compare with call to tcp_write_queue_purge in tcp_disconnect()). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:25 +02:00
Gerrit Renker	eac7726bf5	dccp: Fill in the Data fields for "Option Error" Resets This updates the use of the `out_invalid_option' label, which produces a Reset (code 5, "Option Error"), to fill in the Data1...Data3 fields as specified in RFC 4340, 5.6. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:25 +02:00
Gerrit Renker	faf61c3319	dccp: Silently ignore options with nonsensical lengths This updates the option-parsing code with regard to RFC 4340, 5.8: "[..] options with nonsensical lengths (length byte less than two or more than the remaining space in the options portion of the header) MUST be ignored, and any option space following an option with nonsensical length MUST likewise be ignored." Hence in the following cases erratic options will be ignored: 1. The type byte of a multi-byte option is the last byte of the header options (i.e. effective option length of 1). 2. The value of the length byte is less than the minimum 2. This has been changed from previously 3: although no multi-byte option with a length less than 3 yet exists (cf. table 3 in 5.8), a length of 2 is valid. (The switch-statement in dccp_parse has further per-option length checks.) 3. The option length exceeds the length of the remaining option space. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:24 +02:00
Wei Yongjun	ba1a6c7bc0	dccp: Always generate a Reset in response to option errors RFC4340 states that if a packet is received with an option error (such as a Mandatory Option as the last byte of the option list), the endpoint should repond with a Reset. In the LISTEN and RESPOND states, the endpoint correctly reponds with Reset, while in the REQUEST/OPEN states, packets with option errors are just ignored. The packet sequence is as follows: Case 1: Endpoint A Endpoint B (CLOSED) (CLOSED) <---------------- REQUEST RESPONSE -----------------> (1) (with invalid option) <---------------- RESET (with Reset Code 5, "Option Error") (1) currently just ignored, no Reset is sent Case 2: Endpoint A Endpoint B (OPEN) (OPEN) DATA-ACK -----------------> (2) (with invalid option) <---------------- RESET (with Reset Code 5, "Option Error") (2) currently just ignored, no Reset is sent This patch fixes the problem, by generating a Reset instead of silently ignoring option errors. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-09-04 07:45:24 +02:00
Linus Torvalds	316343e2cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: bnx2x: Accessing un-mapped page ath9k: Fix TX control flag use for no ACK and RTS/CTS ath9k: Fix TX status reporting iwlwifi: fix STATUS_EXIT_PENDING is not set on pci_remove iwlwifi: call apm stop on exit iwlwifi: fix Tx cmd memory allocation failure handling iwlwifi: fix rx_chain computation iwlwifi: fix station mimo power save values iwlwifi: remove false rxon if rx chain changes iwlwifi: fix hidden ssid discovery in passive channels iwlwifi: W/A for the TSF correction in IBSS netxen: Remove workaround for chipset quirk pcnet-cs, axnet_cs: add new IDs, remove dup ID with less info ixgbe: initialize interrupt throttle rate net/usb/pegasus: avoid hundreds of diagnostics tipc: Don't use structure names which easily globally conflict.	2008-09-03 16:21:02 -07:00
David S. Miller	6c00055a81	tipc: Don't use structure names which easily globally conflict. Andrew Morton reported a build failure on sparc32, because TIPC uses names like "struct node" and there is a like named data structure defined in linux/node.h This just regexp replaces "struct node" to "struct tipc_node" to avoid this and any future similar problems. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-02 23:38:32 -07:00
Linus Torvalds	d26acd92fa	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: ipsec: Fix deadlock in xfrm_state management. ipv: Re-enable IP when MTU > 68 net/xfrm: Use an IS_ERR test rather than a NULL test ath9: Fix ath_rx_flush_tid() for IRQs disabled kernel warning message. ath9k: Incorrect key used when group and pairwise ciphers are different. rt2x00: Compiler warning unmasked by fix of BUILD_BUG_ON mac80211: Fix debugfs union misuse and pointer corruption wireless/libertas/if_cs.c: fix memory leaks orinoco: Multicast to the specified addresses iwlwifi: fix 64bit platform firmware loading iwlwifi: fix apm_stop (wrong bit polarity for FLAG_INIT_DONE) iwlwifi: workaround interrupt handling no some platforms iwlwifi: do not use GFP_DMA in iwl_tx_queue_init net/wireless/Kconfig: clarify the description for CONFIG_WIRELESS_EXT_SYSFS net: Unbreak userspace usage of linux/mroute.h pkt_sched: Fix locking of qdisc_root with qdisc_root_sleeping_lock() ipv6: When we droped a packet, we should return NET_RX_DROP instead of 0	2008-09-02 21:02:14 -07:00
David S. Miller	37b08e34a9	ipsec: Fix deadlock in xfrm_state management. Ever since commit `4c563f7669` ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is illegal to call __xfrm_state_destroy (and thus xfrm_state_put()) with xfrm_state_lock held. If we do, we'll deadlock since we have the lock already and __xfrm_state_destroy() tries to take it again. Fix this by pushing the xfrm_state_put() calls after the lock is dropped. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-02 20:14:15 -07:00
Thomas Graf	2c10b32bf5	netlink: Remove compat API for nested attributes Removes all _nested_compat() functions from the API. The prio qdisc no longer requires them and netem has its own format anyway. Their existance is only confusing. Resend: Also remove the wrapper macro. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-02 17:30:27 -07:00
Breno Leitao	06770843c2	ipv: Re-enable IP when MTU > 68 Re-enable IP when the MTU gets back to a valid size. This patch just checks if the in_dev is NULL on a NETDEV_CHANGEMTU event and if MTU is valid (bigger than 68), then re-enable in_dev. Also a function that checks valid MTU size was created. Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-02 17:28:58 -07:00
Julien Brunel	9d7d74029e	net/xfrm: Use an IS_ERR test rather than a NULL test In case of error, the function xfrm_bundle_create returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = xfrm_bundle_create(...) ... when != x = E * if (x != NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-09-02 17:24:28 -07:00
Jouni Malinen	2b58b20939	mac80211: Fix debugfs union misuse and pointer corruption debugfs union in struct ieee80211_sub_if_data is misused by including a common default_key dentry as a union member. This ends occupying the same memory area with the first dentry in other union members (structures; usually drop_unencrypted). Consequently, debugfs operations on default_key symlinks and drop_unencrypted entry are using the same dentry pointer even though they are supposed to be separate ones. This can lead to removing entries incorrectly or potentially leaving something behind since one of the dentry pointers gets lost. Fix this by moving the default_key dentry to a new struct (common_debugfs) that contains dentries (more to be added in future) that are shared by all vif types. The debugfs union must only be used for vif type-specific entries to avoid this type of pointer corruption. Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-09-02 17:39:50 -04:00
Florian Mickler	d9664741e0	net/wireless/Kconfig: clarify the description for CONFIG_WIRELESS_EXT_SYSFS Current setup with hal and NetworkManager will fail to work without newest hal version with this config option disabled. Although this will solve itself by time, at the moment it is dishonest to say that we don't know any software that uses it, if there are many many people relying on old hal versions. Signed-off-by: Florian Mickler <florian@mickler.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-09-02 15:03:19 -04:00
Linus Torvalds	e77295dc9e	Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux * 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: nfsd: fix buffer overrun decoding NFSv4 acl sunrpc: fix possible overrun on read of /proc/sys/sunrpc/transports nfsd: fix compound state allocation error handling svcrdma: Fix race between svc_rdma_recvfrom thread and the dto_tasklet	2008-09-02 10:58:11 -07:00
Cyrill Gorcunov	27df6f25ff	sunrpc: fix possible overrun on read of /proc/sys/sunrpc/transports Vegard Nossum reported ---------------------- > I noticed that something weird is going on with /proc/sys/sunrpc/transports. > This file is generated in net/sunrpc/sysctl.c, function proc_do_xprt(). When > I "cat" this file, I get the expected output: > $ cat /proc/sys/sunrpc/transports > tcp 1048576 > udp 32768 > But I think that it does not check the length of the buffer supplied by > userspace to read(). With my original program, I found that the stack was > being overwritten by the characters above, even when the length given to > read() was just 1. David Wagner added (among other things) that copy_to_user could be probably used here. Ingo Oeser suggested to use simple_read_from_buffer() here. The conclusion is that proc_do_xprt doesn't check for userside buffer size indeed so fix this by using Ingo's suggestion. Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> CC: Ingo Oeser <ioe-lkml@rameria.de> Cc: Neil Brown <neilb@suse.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Greg Banks <gnb@sgi.com> Cc: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-09-01 14:24:24 -04:00
David S. Miller	b171e19ed0	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/mac80211/mlme.c	2008-08-29 23:06:00 -07:00
Jarek Poplawski	102396ae65	pkt_sched: Fix locking of qdisc_root with qdisc_root_sleeping_lock() Use qdisc_root_sleeping_lock() instead of qdisc_root_lock() where appropriate. The only difference is while dev is deactivated, when currently we can use a sleeping qdisc with the lock of noop_qdisc. This shouldn't be dangerous since after deactivation root lock could be used only by gen_estimator code, but looks wrong anyway. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-29 14:27:52 -07:00
Yang Hongyang	3cc76caa98	ipv6: When we droped a packet, we should return NET_RX_DROP instead of 0 Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-29 14:27:51 -07:00
David S. Miller	143b11c03c	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6	2008-08-29 14:02:13 -07:00
Henrique de Moraes Holschuh	1563574448	rfkill: rename rfkill_mutex to rfkill_global_mutex rfkill_mutex and rfkill->mutex are too easy to confuse with each other. Rename rfkill_mutex to rfkill_global_mutex, so that they are easier to tell apart with just one glance. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Cc: Michael Buesch <mb@bu3sch.de> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:11 -04:00
Henrique de Moraes Holschuh	f745ba03a1	rfkill: add WARN and BUG_ON paranoia (v2) BUG_ON() and WARN() the heck out of buggy drivers calling into the rfkill subsystem. Also switch from WARN_ON(1) to the new descriptive WARN(). Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:10 -04:00
Felipe Balbi	01b510b9c2	rfkill: add missing line break Trivial patch adding a missing line break on rfkill_claim_show(). Signed-off-by: Felipe Balbi <felipe.balbi@nokia.com> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.co> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:10 -04:00
Henrique de Moraes Holschuh	849e0576a7	rfkill: use strict_strtoul (v2) Switch sysfs parsing to something that actually works properly. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:10 -04:00
Jouni Malinen	36aedc903e	mac80211/cfg80211: HT capabilities for NEW_STA Allow userspace (e.g., hostapd) to set HT capabilities for associated STAs. This is based on a patch from Zhu Yi <yi.zhu@intel.com> (only the NL80211_ATTR_HT_CAPABILITY for NEW_STA part is included here). Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:09 -04:00
Daniel Wagner	2f58bbf27f	mac80211: Use only precedence level of DSCP field for frame classification Bit 4-5 of DSCP should not be considered by classify_d1. The 802.11 QoS Priority field is only depending on the precedence level. Signed-off-by: Daniel Wagner <wagi@monom.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:05 -04:00
Jouni Malinen	43ac2ca384	mac80211: Handle scan result IEs in one block Clean up and extend scan result processing by storing all the IEs from Beacon/Probe Response frames in a single block instead of allocating memory for each specific IE separately. This removes lot of unnecessary code and automatically supports reporting of new IEs (e.g., IEEE 802.11r) into user space without need to manually extend mac80211 scanning code whenever a new protocol adds IE(s). Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:24:05 -04:00
Jouni Malinen	9f1ba9062e	mac80211/cfg80211: Add BSS configuration options for AP mode This change adds a new cfg80211 command, NL80211_CMD_SET_BSS, to allow AP mode BSS parameters to be changed from user space (e.g., hostapd). The drivers using mac80211 are expected to be modified with separate changes to use the new BSS info parameter for short slot time in the bss_info_changed() handler. Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:23:55 -04:00
Johannes Berg	7f93ea3e24	mac80211: fill start-sequence-number for BA session start Otherwise, drivers are required to keep track of the sequence numbers themselves, and they really shouldn't be since we already do it for them. I'll fix the race once we figure out how this code should work at all, it's currently disabled. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-29 16:23:55 -04:00
Eric Dumazet	a627266570	ip: speedup /proc/net/rt_cache handling When scanning route cache hash table, we can avoid taking locks for empty buckets. Both /proc/net/rt_cache and NETLINK RTM_GETROUTE interface are taken into account. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-28 01:11:25 -07:00
Andi Kleen	6be547a61d	inet_diag: Add empty bucket optimization to inet_diag too Skip quickly over empty buckets in inet_diag. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-28 01:09:54 -07:00
Andi Kleen	6eac560407	tcp: Skip empty hash buckets faster in /proc/net/tcp On most systems most of the TCP established/time-wait hash buckets are empty. When walking the hash table for /proc/net/tcp their read locks would always be aquired just to find out they're empty. This patch changes the code to check first if the buckets have any entries before taking the lock, which is much cheaper than taking a lock. Since the hash tables are large this makes a measurable difference on processing /proc/net/tcp, especially on architectures with slow read_lock (e.g. PPC) On a 2GB Core2 system time cat /proc/net/tcp > /dev/null (with a mostly empty hash table) goes from 0.046s to 0.005s. On systems with slower atomics (like P4 or POWER4) or larger hash tables (more RAM) the difference is much higher. This can be noticeable because there are some daemons around who regularly scan /proc/net/tcp. Original idea for this patch from Marcus Meissner, but redone by me. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-28 01:08:02 -07:00
Vlad Yasevich	d97240552c	sctp: fix random memory dereference with SCTP_HMAC_IDENT option. The number of identifiers needs to be checked against the option length. Also, the identifier index provided needs to be verified to make sure that it doesn't exceed the bounds of the array. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 16:09:49 -07:00
Vlad Yasevich	328fc47ea0	sctp: correct bounds check in sctp_setsockopt_auth_key The bonds check to prevent buffer overlflow was not exactly right. It still allowed overflow of up to 8 bytes which is sizeof(struct sctp_authkey). Since optlen is already checked against the size of that struct, we are guaranteed not to cause interger overflow either. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 16:08:54 -07:00
David S. Miller	4d40555250	Merge branch 'lvs-next-2.6' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/lvs-2.6	2008-08-27 05:11:26 -07:00
David S. Miller	6c36810a73	Merge branch 'no-iwlwifi' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2008-08-27 04:29:50 -07:00
Hugh Dickins	d994af0d50	ipv4: mode 0555 in ipv4_skeleton vpnc on today's kernel says Cannot open "/proc/sys/net/ipv4/route/flush": d--------- 0 root root 0 2008-08-26 11:32 /proc/sys/net/ipv4/route d--------- 0 root root 0 2008-08-26 19:16 /proc/sys/net/ipv4/neigh Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 02:35:18 -07:00
Philip Love	7982d5e1b3	tcp: fix tcp header size miscalculation when window scale is unused The size of the TCP header is miscalculated when the window scale ends up being 0. Additionally, this can be induced by sending a SYN to a passive open port with a window scale option with value 0. Signed-off-by: Philip Love <love_phil@emc.com> Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 02:33:50 -07:00
Jarek Poplawski	f6f9b93f16	pkt_sched: Fix gen_estimator locks While passing a qdisc root lock to gen_new_estimator() and gen_replace_estimator() dev could be deactivated or even before grafting proper root qdisc as qdisc_sleeping (e.g. qdisc_create), so using qdisc_root_lock() is not enough. This patch adds qdisc_root_sleeping_lock() for this, plus additional checks, where necessary. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 02:25:17 -07:00
Jarek Poplawski	f7a54c13c7	pkt_sched: Use rcu_assign_pointer() to change dev_queue->qdisc These pointers are RCU protected, so proper primitives should be used. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 02:22:07 -07:00
Jarek Poplawski	666d9bbedf	pkt_sched: Fix dev_graft_qdisc() locking During dev_graft_qdisc() dev is deactivated, so qdisc_root_lock() returns wrong lock of noop_qdisc instead of qdisc_sleeping. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-27 02:15:20 -07:00
Gerrit Renker	eff253c427	dccp ccid-3: Replace lazy BUG_ON with condition The BUG_ON(w_tot == 0) only holds if there is no more than 1 loss interval in the loss history. If there is only a single loss interval, the calc_i_mean() routine need in fact not be called (RFC 3448, 6.3.1). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-08-27 07:22:00 +02:00
Gerrit Renker	157439fa4a	dccp: Toggle debug output without module unloading This sets the sysfs permissions so that root can toggle the `debug' parameter available for nearly every DCCP module. This is useful since there are various module inter-dependencies. The debug flag can now be toggled at runtime using echo 1 > /sys/module/dccp/parameters/dccp_debug echo 1 > /sys/module/dccp_ccid2/parameters/ccid2_debug echo 1 > /sys/module/dccp_ccid3/parameters/ccid3_debug echo 1 > /sys/module/dccp_tfrc_lib/parameters/tfrc_debug The last is not very useful yet, since no code at the moment calls the tfrc_debug() macro. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-08-27 07:22:00 +02:00
Gerrit Renker	b569d5a134	dccp: Empty the write queue when disconnecting dccp_disconnect() can be called due to several reasons: 1. when the connection setup failed (inet_stream_connect()); 2. when shutting down (inet_shutdown(), inet_csk_listen_stop()); 3. when aborting the connection (dccp_close() with 0 linger time). In case (1) the write queue is empty. This patch empties the write queue, if in case (2) or (3) it was not yet empty. This avoids triggering the write-queue BUG_TRAP in sk_stream_kill_queues() later on. It also seems natural to do: when breaking an association, to delete all packets that were originally intended for the soon-disconnected end (compare with call to tcp_write_queue_purge in tcp_disconnect()). Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-08-27 07:22:00 +02:00
Gerrit Renker	5a056417e6	dccp: Fill in the Data fields for "Option Error" Resets This updates the use of the `out_invalid_option' label, which produces a Reset (code 5, "Option Error"), to fill in the Data1...Data3 fields as specified in RFC 4340, 5.6. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-08-27 07:22:00 +02:00
Gerrit Renker	1efa6bbac8	dccp: Silently ignore options with nonsensical lengths This updates the option-parsing code with regard to RFC 4340, 5.8: "[..] options with nonsensical lengths (length byte less than two or more than the remaining space in the options portion of the header) MUST be ignored, and any option space following an option with nonsensical length MUST likewise be ignored." Hence in the following cases erratic options will be ignored: 1. The type byte of a multi-byte option is the last byte of the header options (i.e. effective option length of 1). 2. The value of the length byte is less than the minimum 2. This has been changed from previously 3: although no multi-byte option with a length less than 3 yet exists (cf. table 3 in 5.8), a length of 2 is valid. (The switch-statement in dccp_parse has further per-option length checks.) 3. The option length exceeds the length of the remaining option space. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-08-27 07:21:59 +02:00
Wei Yongjun	33c449675c	dccp: Always generate a Reset in response to option errors RFC4340 states that if a packet is received with an option error (such as a Mandatory Option as the last byte of the option list), the endpoint should repond with a Reset. In the LISTEN and RESPOND states, the endpoint correctly reponds with Reset, while in the REQUEST/OPEN states, packets with option errors are just ignored. The packet sequence is as follows: Case 1: Endpoint A Endpoint B (CLOSED) (CLOSED) <---------------- REQUEST RESPONSE -----------------> (1) (with invalid option) <---------------- RESET (with Reset Code 5, "Option Error") (1) currently just ignored, no Reset is sent Case 2: Endpoint A Endpoint B (OPEN) (OPEN) DATA-ACK -----------------> (2) (with invalid option) <---------------- RESET (with Reset Code 5, "Option Error") (2) currently just ignored, no Reset is sent This patch fixes the problem, by generating a Reset instead of silently ignoring option errors. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-08-27 07:21:59 +02:00
Simon Horman	7fd1067851	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/lvs-2.6 into lvs-next-2.6	2008-08-27 15:11:37 +10:00
Julius Volz	e3c2ced8d2	IPVS: Rename ip_vs_proto_ah.c to ip_vs_proto_ah_esp.c After integrating ESP into ip_vs_proto_ah, rename it (and the references to it) to ip_vs_proto_ah_esp.c and delete the old ip_vs_proto_esp.c. Signed-off-by: Julius Volz <juliusv@google.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-27 13:50:37 +10:00
Julius Volz	409a19669e	IPVS: Integrate ESP protocol into ip_vs_proto_ah.c Rename all ah_* functions to ah_esp_* (and adjust comments). Move ESP protocol definition into ip_vs_proto_ah.c and remove all usage of ip_vs_proto_esp.c. Make the compilation of ip_vs_proto_ah.c dependent on a new config variable, IP_VS_PROTO_AH_ESP, which is selected either by IP_VS_PROTO_ESP or IP_VS_PROTO_AH. Only compile the selected protocols' structures within this file. Signed-off-by: Julius Volz <juliusv@google.com> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-27 13:50:35 +10:00
John W. Linville	576fdeaef6	mac80211: quiet chatty IBSS merge message It seems obvious that this #ifndef should be the opposite polarity... Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-26 20:33:34 -04:00
Jan-Espen Pettersen	8ab65b03b7	mac80211: don't send empty extended rates IE The association request includes a list of supported data rates. 802.11b: 4 supported rates. 802.11g: 12 (8 + 4) supported rates. 802.11a: 8 supported rates. The rates tag of the assoc request has room for only 8 rates. In case of 802.11g an extended rate tag is appended. However in net/wireless/mlme.c an extended (empty) rate tag is also appended if the number of rates is exact 8. This empty (length=0) extended rates tag causes some APs to deny association with code 18 (unsupported rates). These APs include my ZyXEL G-570U, and according to Tomas Winkler som Cisco APs. 'If count == 8' has been used to check for the need for an extended rates tag. But count would also be equal to 8 if the for loop exited because of no more supported rates. Therefore a check for count being less than rates_len would seem more correct. Thanks to: * Dan Williams for newbie guidance * Tomas Winkler for confirming the problem Signed-off-by: Jan-Espen Pettersen <sigsegv@radiotube.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-26 20:06:33 -04:00
Jouni Malinen	93015f0f34	mac80211: Fix debugfs file add/del for netdev Previous version was using incorrect union structures for non-AP interfaces when adding and removing max_ratectrl_rateidx and force_unicast_rateidx entries. Depending on the vif type, this ended up in corrupting debugfs entries since the dentries inside different union structures ended up going being on top of eachother.. As the end result, debugfs files were being left behind with references to freed data (instant kernel oops on access) and directories were not removed properly when unloading mac80211 drivers. This patch fixes those issues by using only a single union structure based on the vif type. Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-26 20:06:33 -04:00
Julia Lawall	667d8af9af	net/mac80211/mesh.c: correct the argument to __mesh_table_free In the function mesh_table_grow, it is the new table not the argument table that should be freed if the function fails (cf commit `bd9b448f4c`) The semantic match that detects this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @r exists@ local idexpression x; expression E,f; position p1,p2,p3; identifier l; statement S; @@ x = mesh_table_alloc@p1(...) ... if (x == NULL) S ... when != E = x when != mesh_table_free(x) goto@p2 l; ... when != E = x when != f(...,x,...) when any ( return $0\\|x$; \| return@p3 ...; ) @script:python@ p1 << r.p1; p2 << r.p2; p3 << r.p3; @@ print "%s: call on line %s not freed or saved before return on line %s via line %s" % (p1[0].file,p1[0].line,p3[0].line,p2[0].line) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-26 20:06:32 -04:00
Jouni Malinen	087d833e5a	mac80211: Use IWEVASSOCREQIE instead of IWEVCUSTOM The previous code was using IWEVCUSTOM to report IEs from AssocReq and AssocResp frames into user space. This can easily hit the 256 byte limit (IW_CUSTOM_MAX) with APs that include number of vendor IEs in AssocResp. This results in the event message not being sent and dmesg showing "wlan0 (WE) : Wireless Event too big (366)" type of errors. Convert mac80211 to use IWEVASSOCREQIE/IWEVASSOCRESPIE to avoid the issue of being unable to send association IEs as wireless events. These newer event types use binary encoding and larger maximum size (IW_GENERIC_IE_MAX = 1024), so the likelyhood of not being able to send the IEs is much smaller than with IWEVCUSTOM. As an extra benefit, the code is also quite a bit simpler since there is no need to allocate an extra buffer for hex encoding. Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-26 20:06:32 -04:00
Felipe Balbi	988b02f1bf	net: rfkill: add missing line break Trivial patch adding a missing line break on rfkill_claim_show(). Signed-off-by: Felipe Balbi <felipe.balbi@nokia.com> Acked-by: Ivo van Doorn <IvDoorn@gmail.co> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-26 20:06:31 -04:00
Al Viro	ce3113ec57	ipv6: sysctl fixes Braino: net.ipv6 in ipv6 skeleton has no business in rotable class Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-25 15:18:15 -07:00
Al Viro	2f4520d35d	ipv4: sysctl fixes net.ipv4.neigh should be a part of skeleton to avoid ordering problems Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-25 15:17:44 -07:00
Vlad Yasevich	30c2235cbc	sctp: add verification checks to SCTP_AUTH_KEY option The structure used for SCTP_AUTH_KEY option contains a length that needs to be verfied to prevent buffer overflow conditions. Spoted by Eugene Teo <eteo@redhat.com>. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-25 15:16:19 -07:00
Stephen Hemminger	f410a1fba7	ipv6: protocol for address routes This fixes a problem spotted with zebra, but not sure if it is necessary a kernel problem. With IPV6 when an address is added to an interface, Zebra creates a duplicate RIB entry, one as a connected route, and other as a kernel route. When an address is added to an interface the RTN_NEWADDR message causes Zebra to create a connected route. In IPV4 when an address is added to an interface a RTN_NEWROUTE message is set to user space with the protocol RTPROT_KERNEL. Zebra ignores these messages, because it already has the connected route. The problem is that route created in IPV6 has route protocol == RTPROT_BOOT. Was this a design decision or a bug? This fixes it. Same patch applies to both net-2.6 and stable. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-23 05:16:46 -07:00
Ilpo Järvinen	a4356b2920	tcp: Add tcp_parse_aligned_timestamp Some duplicated code lying around. Located with my suffix tree tool. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-23 05:12:29 -07:00
Ilpo Järvinen	2cf46637b5	tcp: Add tcp_collapse_one to eliminate duplicated code Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-23 05:11:41 -07:00
Ilpo Järvinen	cbe2d128a0	tcp: Add tcp_validate_incoming & put duplicated code there Large block of code duplication removed. Sadly, the return value thing is a bit tricky here but it seems the most sensible way to return positive from validator on success rather than negative. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-23 05:10:12 -07:00
Denis V. Lunev	fdc0bde90a	icmp: icmp_sk() should not use smp_processor_id() in preemptible code Pass namespace into icmp_xmit_lock, obtain socket inside and return it as a result for caller. Thanks Alexey Dobryan for this report: Steps to reproduce: CONFIG_PREEMPT=y CONFIG_DEBUG_PREEMPT=y tracepath <something> BUG: using smp_processor_id() in preemptible [00000000] code: tracepath/3205 caller is icmp_sk+0x15/0x30 Pid: 3205, comm: tracepath Not tainted 2.6.27-rc4 #1 Call Trace: [<ffffffff8031af14>] debug_smp_processor_id+0xe4/0xf0 [<ffffffff80409405>] icmp_sk+0x15/0x30 [<ffffffff8040a17b>] icmp_send+0x4b/0x3f0 [<ffffffff8025a415>] ? trace_hardirqs_on_caller+0xd5/0x160 [<ffffffff8025a4ad>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8023a475>] ? local_bh_enable_ip+0x95/0x110 [<ffffffff804285b9>] ? _spin_unlock_bh+0x39/0x40 [<ffffffff8025a26c>] ? mark_held_locks+0x4c/0x90 [<ffffffff8025a4ad>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8025a415>] ? trace_hardirqs_on_caller+0xd5/0x160 [<ffffffff803e91b4>] ip_fragment+0x8d4/0x900 [<ffffffff803e7030>] ? ip_finish_output2+0x0/0x290 [<ffffffff803e91e0>] ? ip_finish_output+0x0/0x60 [<ffffffff803e6650>] ? dst_output+0x0/0x10 [<ffffffff803e922c>] ip_finish_output+0x4c/0x60 [<ffffffff803e92e3>] ip_output+0xa3/0xf0 [<ffffffff803e68d0>] ip_local_out+0x20/0x30 [<ffffffff803e753f>] ip_push_pending_frames+0x27f/0x400 [<ffffffff80406313>] udp_push_pending_frames+0x233/0x3d0 [<ffffffff804067d1>] udp_sendmsg+0x321/0x6f0 [<ffffffff8040d155>] inet_sendmsg+0x45/0x80 [<ffffffff803b967f>] sock_sendmsg+0xdf/0x110 [<ffffffff8024a100>] ? autoremove_wake_function+0x0/0x40 [<ffffffff80257ce5>] ? validate_chain+0x415/0x1010 [<ffffffff8027dc10>] ? __do_fault+0x140/0x450 [<ffffffff802597d0>] ? __lock_acquire+0x260/0x590 [<ffffffff803b9e55>] ? sockfd_lookup_light+0x45/0x80 [<ffffffff803ba50a>] sys_sendto+0xea/0x120 [<ffffffff80428e42>] ? _spin_unlock_irqrestore+0x42/0x80 [<ffffffff803134bc>] ? __up_read+0x4c/0xb0 [<ffffffff8024e0c6>] ? up_read+0x26/0x30 [<ffffffff8020b8bb>] system_call_fastpath+0x16/0x1b icmp6_sk() is similar. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-23 04:43:33 -07:00
Ron Rindjunsky	9859b81eae	mac80211: add direct probe before association This patch adds a direct probe request as first step in the association flow if data we have is not up to date. Motivation of this step is to make sure that the bss information we have is correct, since last scan could have been done a while ago, and beacons do not fully answer this need as there are potential differences between them and probe responses (e.g. WMM parameter element) Signed-off-by: Ron Rindjunsky <ron.rindjunsky@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:30:00 -04:00
Ron Rindjunsky	6042a3e3ff	mac80211: change number of pre-assoc scans This patch fixes noticed problem in noisy environments of 50+ APs that scan fails to find the requested AP on first try, which leads to connection refusal. second scan has empirically proven to fix this problem in almost all cases. Signed-off-by: Ron Rindjunsky <ron.rindjunsky@intel.com> Signed-off-by: Esti Kummer <ester.kummer@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:59 -04:00
Tomas Winkler	48c2fc59aa	mac80211: cleanup mlme state namespace This patch move add STA_MLME to station mlme state defines. Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:59 -04:00
Tomas Winkler	8e7cdbb633	mac80211: filter probes in ieee80211_rx_mgmt_probe_resp This patch moves filtering statement from ieee80211_rx_bss_info which is called for both beacon and probe to ieee80211_rx_mgmt_probe_resp and save few cycles in beacon parsing. Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:58 -04:00
Jasper Bryant-Greene	f698d856f6	replace net_device arguments with ieee80211_{local,sub_if_data} as appropriate This patch replaces net_device arguments to mac80211 internal functions with ieee80211_{local,sub_if_data} as appropriate. It also does the same for many 802.11s mesh functions, and changes the mesh path table to be indexed on sub_if_data rather than net_device. If the mesh part needs to be a separate patch let me know, but since mesh uses a lot of mac80211 functions which were being converted anyway, the changes go hand-in-hand somewhat. This patch probably does not convert all the functions which could be converted, but it is a large chunk and followup patches will be provided. Signed-off-by: Jasper Bryant-Greene <jasper@amiton.co.nz> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:58 -04:00
Jasper Bryant-Greene	fef1643bf0	move ETH_P_PAE from ieee80211_i.h to if_ether.h ETH_P_PAE belongs in if_ether.h with the other ETH_P_* definitions. This patch moves it there. Signed-off-by: Jasper Bryant-Greene <jasper@amiton.co.nz> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:57 -04:00
Henrique de Moraes Holschuh	96c87607ac	rfkill: introduce RFKILL_STATE_MAX While it is interesting to not add last-enum-markers because it allows gcc to warn us of switch() statements missing a valid state, we really should be handling memory corruption on a rfkill state with default clauses, anyway. So add RFKILL_STATE_MAX and use it where applicable. It makes for safer code in the long run. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:57 -04:00
Henrique de Moraes Holschuh	77fba13ccc	rfkill: add __must_check annotations rfkill is not a small, mere detail in wireless support. Once it starts supporting rfkill and users start counting on that support, a wireless device is at risk of operating in dangerous conditions should rfkill support fail to properly activate. Therefore, add the required __must_check annotations on some key functions of the rfkill API, for which the wireless drivers absolutely MUST handle the failure mode safely in order to avoid a potentially dangerous situation where the wireless transmitter is left enabled when the user don't want it to. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:57 -04:00
Henrique de Moraes Holschuh	9961920199	rfkill: add default global states (v2) Add a second set of global states, "rfkill_default_states", to track the state that will be used when the first rfkill class of a given type is registered, and also to save "undo" information when rfkill_epo is called. Add a new exported function, rfkill_set_default(), which can be used by platform drivers to restore radio state saved by the platform across reboots or shutdown. Also, fix rfkill_epo to properly update rfkill_states, but still preserve a copy of the state so that we can undo the effect of rfkill_epo later if we want to. Add rfkill_restore_states() to restore rfkill_states from the copy. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:56 -04:00
Henrique de Moraes Holschuh	02589f6051	rfkill: detect bogus double-registering (v2) Detect and abort with -EEXIST if rfkill_register is called twice on the same rfkill struct. And WARN_ON(it) for good measure. While at it, flag when we are adding the first switch of a type, we will need that information later. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:56 -04:00
Luis Carlos Cobo	bdbe819540	mac80211: allow no mac address until firmware load Originally by Johannes Berg. This patch adds support for devices that do not report their MAC address until the firmware is loaded. While the address is not known, a multicast on is used. Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:55 -04:00
Harvey Harrison	4eb2ae9a42	mac80211: remove WLAN_FC_DATA_PRESENT All users are gone now. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:55 -04:00
Harvey Harrison	a4b7d7bda5	mac80211: remove rx/tx_data->fc member Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:54 -04:00
Harvey Harrison	358c8d9d33	mac80211: use ieee80211 frame control directly Remove the last users of the rx/tx_data->fc data members and use the le16 frame_control from the header directly. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:54 -04:00
Harvey Harrison	e7827a7031	mac80211: remove IEEE80211_FC helper Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:54 -04:00
Harvey Harrison	6b644e524b	mac80211: remove ieee80211_get_hdrlen All users have been moved over to the version taking a le16 frame control rather than a cpu-endian value. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:54 -04:00
Harvey Harrison	b73d70ad86	mac80211: rx.c/tx.c remove more users of tx/rx_data->fc Those functions that still use ieee80211_get_hdrlen are moved over to use the little endian frame control. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:53 -04:00
Harvey Harrison	d298487260	mac80211: wep.c replace magic numbers in IV/ICV removal Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:53 -04:00
Harvey Harrison	c44d040e18	mac80211: wme.h remove unused QOS_CONTROL_LEN linux/ieee80211.h now has IEEE80211_QOS_CTL_LEN for this purpose. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:53 -04:00
Harvey Harrison	62bf1d762e	mac80211: explicitly check skb->len ieee80211_get_hdrlen_from_skb internally checks the skb is long enough to hold the full ieee80211_hdr, else it returns zero. Use ieee80211_hdrlen which always returns the hdrlen and check the remaining room in the skb explicitly when removing encryption headers or the qos control field. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:53 -04:00
Bruno Randolf	9deb1ae572	mac80211: radiotap: assume modulation from rates use the rates ERP flag to derive CCK or OFDM modulation for the radiotap header. (it might be more correct to get this information from the hardware itself, but it seems safe to assume this in most practical cases.) Signed-off-by: Bruno Randolf <br1@einfach.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:50 -04:00
Bruno Randolf	b4f28bbb9b	mac80211: add rx status flag for short preamble and use it for the radiotap header Signed-off-by: Bruno Randolf <br1@einfach.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:50 -04:00
Tomas Winkler	92ab853549	mac80211: add ieee80211_queue_stopped) This patch adds ieee80211_queue_stopped that let drivers to query queue status Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:50 -04:00
Robert P. J. Day	5442060c08	WIRELESS: Make wireless one-click selectable. Use "menuconfig" to make wireless support one-click selectable. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:50 -04:00
Julia Lawall	d92a8e81e0	net/ieee80211: adjust error handling Converts a test in error handling code to a sequence of labels. The semantic match that found the problem is: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @@ expression E,E1,E2; @@ E = alloc_etherdev(...) ... when != E = E1 if (...) { ... free_netdev(E); ... return ...; } ... when != E = E2 ( if (...) { ... when != free_netdev(E); return dev; } \| * if (...) { ... when != free_netdev(E); return ...; } \| register_netdev(E) ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-22 16:29:49 -04:00
Jarek Poplawski	f6e0b239a2	pkt_sched: Fix qdisc list locking Since some qdiscs call qdisc_tree_decrease_qlen() (so qdisc_lookup()) without rtnl_lock(), adding and deleting from a qdisc list needs additional locking. This patch adds global spinlock qdisc_list_lock and wrapper functions for modifying the list. It is considered as a temporary solution until hfsc_dequeue(), netem_dequeue() and tbf_dequeue() (or qdisc_tree_decrease_qlen()) are redone. With feedback from Herbert Xu and David S. Miller. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-22 03:31:39 -07:00
Jarek Poplawski	2540e0511e	pkt_sched: Fix qdisc_watchdog() vs. dev_deactivate() race dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog() or other timer calling netif_schedule() after dev_queue_deactivate(). We prevent this checking aliveness before scheduling the timer. Since during deactivation the root qdisc is available only as qdisc_sleeping additional accessor qdisc_root_sleeping() is created. With feedback from Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-21 05:11:14 -07:00
Vlad Yasevich	5e739d1752	sctp: fix potential panics in the SCTP-AUTH API. All of the SCTP-AUTH socket options could cause a panic if the extension is disabled and the API is envoked. Additionally, there were some additional assumptions that certain pointers would always be valid which may not always be the case. This patch hardens the API and address all of the crash scenarios. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-21 03:34:25 -07:00
David S. Miller	195648bbc5	pkt_sched: Prevent livelock in TX queue running. If dev_deactivate() is trying to quiesce the queue, it is theoretically possible for another cpu to livelock trying to process that queue. This happens because dev_deactivate() grabs the queue spinlock as it checks the queue state, whereas net_tx_action() does a trylock and reschedules the qdisc if it hits the lock. This breaks the livelock by adding a check on __QDISC_STATE_DEACTIVATED to net_tx_action() when the trylock fails. Based upon feedback from Herbert Xu and Jarek Poplawski. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-19 04:00:36 -07:00
David S. Miller	d2805395aa	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6	2008-08-19 01:33:25 -07:00
Sven Wegener	f728bafb56	ipvs: Fix race conditions in lblcr scheduler We can't access the cache entry outside of our critical read-locked region, because someone may free that entry. Also getting an entry under read lock, then locking for write and trying to delete that entry looks fishy, but should be no problem here, because we're only comparing a pointer. Also there is no need for our own rwlock, there is already one in the service structure for use in the schedulers. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-19 17:37:08 +10:00
Sven Wegener	39ac50d0c7	ipvs: Fix race conditions in lblc scheduler We can't access the cache entry outside of our critical read-locked region, because someone may free that entry. And we also need to check in the critical region wether the destination is still available, i.e. it's not in the trash. If we drop our reference counter, the destination can be purged from the trash at any time. Our caller only guarantees that no destination is moved to the trash, while we are scheduling. Also there is no need for our own rwlock, there is already one in the service structure for use in the schedulers. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-19 17:37:04 +10:00
Simon Horman	3f087668c4	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6	2008-08-19 17:36:22 +10:00
David S. Miller	f3b9605d74	Revert "pkt_sched: Add BH protection for qdisc_stab_lock." This reverts commit `1cfa26661a`. qdisc_destroy() runs fully under RTNL again and not from softint any longer, so this change is no longer needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 22:33:05 -07:00
David S. Miller	deb3abf15f	Revert "pkt_sched: Protect gen estimators under est_lock." This reverts commit `d4766692e7`. qdisc_destroy() now runs in RTNL fully again, so this change is no longer needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 22:32:10 -07:00
Ilpo Järvinen	e5befbd952	pkt_sched: remove bogus block (cleanup) ...Last block local var got just deleted. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 22:30:01 -07:00
Stephen Hemminger	9f59365374	nf_nat: use secure_ipv4_port_ephemeral() for NAT port randomization Use incoming network tuple as seed for NAT port randomization. This avoids concerns of leaking net_random() bits, and also gives better port distribution. Don't have NAT server, compile tested only. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> [ added missing EXPORT_SYMBOL_GPL ] Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:32:32 -07:00
Pablo Neira Ayuso	fab00c5d15	netfilter: ctnetlink: sleepable allocation with spin lock bh This patch removes a GFP_KERNEL allocation while holding a spin lock with bottom halves disabled in ctnetlink_change_helper(). This problem was introduced in 2.6.23 with the netfilter extension infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:31:46 -07:00
Pablo Neira Ayuso	cb1cb5c474	netfilter: ctnetlink: fix sleep in read-side lock section Fix allocation with GFP_KERNEL in ctnetlink_create_conntrack() under read-side lock sections. This problem was introduced in 2.6.25. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:31:24 -07:00
Pablo Neira Ayuso	1575e7ea01	netfilter: ctnetlink: fix double helper assignation for NAT'ed conntracks If we create a conntrack that has NAT handlings and a helper, the helper is assigned twice. This happens because nf_nat_setup_info() - via nf_conntrack_alter_reply() - sets the helper before ctnetlink, which indeed does not check if the conntrack already has a helper as it thinks that it is a brand new conntrack. The fix moves the helper assignation before the set of the status flags. This avoids a bogus assertion in __nf_ct_ext_add (if netfilter assertions are enabled) which checks that the conntrack must not be confirmed. This problem was introduced in 2.6.23 with the netfilter extension infrastructure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>	2008-08-18 21:30:55 -07:00
Anders Grafström	46faec9858	netfilter: ipt_addrtype: Fix matching of inverted destination address type This patch fixes matching of inverted destination address type. Signed-off-by: Anders Grafström <grfstrm@users.sourceforge.net> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:29:57 -07:00
David S. Miller	8e0f36ec37	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2008-08-18 21:15:44 -07:00
Gerrit Renker	d28934ad8a	dccp: Fix panic caused by too early termination of retransmission mechanism Thanks is due to Wei Yongjun for the detailed analysis and description of this bug at http://marc.info/?l=dccp&m=121739364909199&w=2 The problem is that invalid packets received by a client in state REQUEST cause the retransmission timer for the DCCP-Request to be reset. This includes freeing the Request-skb ( in dccp_rcv_request_sent_state_process() ). As a consequence, * the arrival of further packets cause a double-free, triggering a panic(), * the connection then may hang, since further retransmissions are blocked. This patch changes the order of statements so that the retransmission timer is reset, and the pending Request freed, only if a valid Response has arrived (or the number of sysctl-retries has been exhausted). Further changes: ---------------- To be on the safe side, replaced __kfree_skb with kfree_skb so that if due to unexpected circumstances the sk_send_head is NULL the WARN_ON is used instead. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:14:20 -07:00
David S. Miller	4d8863a29c	pkt_sched: Don't hold qdisc lock over qdisc_destroy(). Based upon reports by Denys Fedoryshchenko, and feedback and help from Jarek Poplawski and Herbert Xu. We always either: 1) Never made an external reference to this qdisc. or 2) Did a dev_deactivate() which purged all asynchronous references. So do not lock the qdisc when we call qdisc_destroy(), it's illegal anyways as when we drop the lock this is free'd memory. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:06:19 -07:00
Jarek Poplawski	25bfcd5a78	pkt_sched: Add lockdep annotation for qdisc locks Qdisc locks are initialized in the same function, qdisc_alloc(), so lockdep can't distinguish tx qdisc lock from rx and reports "possible recursive locking detected" when both these locks are taken eg. while using act_mirred with ifb. This looks like a false positive. Anyway, after this patch these locks will be reported more exactly. Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:06:09 -07:00
David S. Miller	8608db031b	pkt_sched: Never schedule non-root qdiscs. Based upon initial discovery and patch by Jarek Poplawski. The qdisc watchdogs can be attached to any qdisc, not just the root, so make sure we schedule the correct one. CBQ has a similar bug. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 21:05:56 -07:00
Ron Rindjunsky	a61dae1f78	mac80211: update new sta's rx timestamp This patch fixes needless probe request caused by zero value in sta->last_rx inside ieee80211_associated flow Signed-off-by: Ron Rindjunsky <ron.rindjunsky@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-18 11:05:13 -04:00
Henrique de Moraes Holschuh	e10e0dfe3b	rfkill: protect suspended rfkill controllers Guard rfkill controllers attached to a rfkill class against state changes after class suspend has been issued. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-18 11:05:12 -04:00
Marcel Holtmann	63fbd24e51	[Bluetooth] Consolidate maintainers information The Bluetooth entries for the MAINTAINERS file are a little bit too much. Consolidate them into two entries. One for Bluetooth drivers and another one for the Bluetooth subsystem. Also the MODULE_AUTHOR should indicate the current maintainer of the module and actually not the original author. Fix all Bluetooth modules to provide current maintainer information. Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2008-08-18 13:23:53 +02:00
Marcel Holtmann	90855d7b72	[Bluetooth] Fix userspace breakage due missing class links The Bluetooth adapters and connections are best presented via a class in sysfs. The removal of the links inside the Bluetooth class broke assumptions by userspace programs on how to find attached adapters. This patch creates adapters and connections as part of the Bluetooth class, but it uses different device types to distinguish them. The userspace programs can now easily navigate in the sysfs device tree. The unused platform device and bus have been removed to keep the code simple and clean. Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2008-08-18 13:23:53 +02:00
David S. Miller	69747650c8	pkt_sched: Fix return value corruption in HTB and TBF. Based upon a bug report by Josip Rodin. Packet schedulers should only return NET_XMIT_DROP iff the packet really was dropped. If the packet does reach the device after we return NET_XMIT_DROP then TCP can crash because it depends upon the enqueue path return values being accurate. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-18 00:39:41 -07:00
David S. Miller	96d203169d	pkt_sched: Fix missed RCU unlock in dev_queue_xmit() Noticed by Jarek Poplawski. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 23:37:16 -07:00
Yang Hongyang	13601cd8e4	ipv6: Fix the return interface index when get it while no message is received. When get receiving interface index while no message is received, the bounded device's index of the socket should be returned. RFC 3542: Issuing getsockopt() for the above options will return the sticky option value i.e., the value set with setsockopt(). If no sticky option value has been set getsockopt() will return the following values: - For the IPV6_PKTINFO option, it will return an in6_pktinfo structure with ipi6_addr being in6addr_any and ipi6_ifindex being zero. Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 23:21:52 -07:00
David S. Miller	4cf7cb280e	sch_prio: Use NET_XMIT_SUCCESS instead of "0" constant. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 22:45:17 -07:00
Jussi Kivilinna	0d40b6e564	sch_prio: Use return value from inner qdisc requeue Use return value from inner qdisc requeue when value returned isn't NET_XMIT_SUCCESS, instead of always returning NET_XMIT_DROP. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 22:43:56 -07:00
David S. Miller	1e0d5a5747	pkt_sched: No longer destroy qdiscs from RCU. We can now kill them synchronously with all of the previous dev_deactivate() cures. This makes netdev destruction and shutdown saner as the qdiscs hold references to the device. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 22:31:26 -07:00
Jarek Poplawski	3a76e3716b	pkt_sched: Grab correct lock in notify_and_destroy(). From: Jarek Poplawski <jarkao2@gmail.com> When we are destroying non-root qdiscs, we need to lock the root of the qdisc tree not the the qdisc itself. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 22:02:11 -07:00
David S. Miller	4335cd2da1	pkt_sched: Simplify dev_deactivate() polling loop. The condition under which the previous qdisc has no more references after we've attached &noop_qdisc is that both RUNNING and SCHED are both seen clear while holding the root lock. So just make specifically that check in the polling loop, instead of this overly complex "check without then check with lock held" sequence. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 21:58:07 -07:00
Jarek Poplawski	def82a1db1	net: Change handling of the __QDISC_STATE_SCHED flag in net_tx_action(). Change handling of the __QDISC_STATE_SCHED flag in net_tx_action() to enable proper control in dev_deactivate(). Now, if this flag is seen as unset under root_lock means a qdisc can't be netif_scheduled. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 21:54:43 -07:00
David S. Miller	a9312ae893	pkt_sched: Add 'deactivated' state. This new state lets dev_deactivate() mark a qdisc as having been deactivated. dev_queue_xmit() and ing_filter() check for this bit and do not try to process the qdisc if the bit is set. dev_deactivate() polls the qdisc after setting the bit, waiting for both __QDISC_STATE_RUNNING and __QDISC_STATE_SCHED to clear. This isn't perfect yet, but subsequent changesets will make it so. This part is just one piece of the puzzle. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-17 21:51:03 -07:00
Simon Horman	51df190139	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6	2008-08-16 14:44:17 +10:00
Rusty Russell	db543c1f97	net: skb_copy_datagram_from_iovec() There's an skb_copy_datagram_iovec() to copy out of a paged skb, but nothing the other way around (because we don't do that). We want to allocate big skbs in tun.c, so let's add the function. It's a carbon copy of skb_copy_datagram_iovec() with enough changes to be annoying. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-15 19:52:30 -07:00
Herbert Xu	6f85a124d8	net: Preserve netfilter attributes in skb_gso_segment using __copy_skb_header skb_gso_segment didn't preserve some attributes in the original skb such as the netfilter fields. This was harmless until they were used which is the case for packets going through lo. This patch makes it call __copy_skb_header which also picks up some other missing attributes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-15 19:51:36 -07:00
Stephen Hemminger	e4119a4318	bridge: show offload settings Add more ethtool generic operations to dump the bridge offload settings. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-15 19:51:07 -07:00
Herbert Xu	c6153b5b77	ipv4: Disable route secret interval on zero interval Let me first state that disabling the route cache hash rebuild should not be done without extensive analysis on the risk profile and careful deliberation. However, there are times when this can be done safely or for testing. For example, when you have mechanisms for ensuring that offending parties do not exist in your network. This patch lets the user disable the rebuild if the interval is set to zero. This also incidentally fixes a divide-by-zero error with name-spaces. In addition, this patch makes the effect of an interval change immediate rather than it taking effect at the next rebuild as is currently the case. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-15 13:44:31 -07:00
Jarek Poplawski	323c048836	pkt_sched: Fix unlocking in tc_ctl_tfilter() Fix a bug with spin_lock_bh() inserted instead of spin_unlock_bh() by some recent patch. Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-14 17:01:10 -07:00
Simon Horman	4a031b0e6a	ipvs: rename __ip_vs_wlc_schedule in lblc and lblcr schedulers For the sake of clarity, rename __ip_vs_wlc_schedule() in lblc.c to __ip_vs_lblc_schedule() and the version in lblcr.c to __ip_vs_lblc_schedule(). I guess the original name stuck from a copy and paste. Cc: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-15 09:26:15 +10:00
Sven Wegener	a919cf4b6b	ipvs: Create init functions for estimator code Commit `8ab19ea36c` ("ipvs: Fix possible deadlock in estimator code") fixed a deadlock condition, but that condition can only happen during unload of IPVS, because during normal operation there is at least our global stats structure in the estimator list. The mod_timer() and del_timer_sync() calls are actually initialization and cleanup code in disguise. Let's make it explicit and move them to their own init and cleanup function. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-15 09:26:15 +10:00
Sven Wegener	82dfb6f322	ipvs: Only call init_service, update_service and done_service for schedulers if defined There are schedulers that only schedule based on data available in the service or destination structures and they don't need any persistent storage or initialization routine. These schedulers currently provide dummy functions for the init_service, update_service and/or done_service functions. For the init_service and done_service cases we already have code that only calls these functions, if the scheduler provides them. Do the same for the update_service case and remove the dummy functions from all schedulers. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-15 09:26:14 +10:00
Julius Volz	9a812198ae	IPVS: Add genetlink interface implementation Add the implementation of the new Generic Netlink interface to IPVS and keep the old set/getsockopt interface for userspace backwards compatibility. Signed-off-by: Julius Volz <juliusv@google.com> Acked-by: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au>	2008-08-15 09:26:14 +10:00
Brian Haley	191cd58250	netns: Add network namespace argument to rt6_fill_node() and ipv6_dev_get_saddr() ipv6_dev_get_saddr() blindly de-references dst_dev to get the network namespace, but some callers might pass NULL. Change callers to pass a namespace pointer instead. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-14 15:33:21 -07:00
Daniel Lezcano	877acedc0d	netns: Fix crash by making igmp per namespace This patch makes the multicast socket to be per namespace. When a network namespace is created, other than the init_net and a multicast packet is received, the kernel goes to a hang or a kernel panic. How to reproduce ? * create a child network namespace * create a pair virtual device veth * ip link add type veth * move one side to the pair network device to the child namespace * ip link set netns <childpid> dev veth1 * ping -I veth0 224.0.0.1 The bug appears because the function ip_mc_init_dev does not initialize the different multicast fields as it exits because it is not the init_net. BUG: soft lockup - CPU#0 stuck for 61s! [avahi-daemon:2695] Modules linked in: irq event stamp: 50350 hardirqs last enabled at (50349): [<c03ee949>] _spin_unlock_irqrestore+0x34/0x39 hardirqs last disabled at (50350): [<c03ec639>] schedule+0x9f/0x5ff softirqs last enabled at (45712): [<c0374d4b>] ip_setsockopt+0x8e7/0x909 softirqs last disabled at (45710): [<c03ee682>] _spin_lock_bh+0x8/0x27 Pid: 2695, comm: avahi-daemon Not tainted (2.6.27-rc2-00029-g0872073 #3) EIP: 0060:[<c03ee47c>] EFLAGS: 00000297 CPU: 0 EIP is at __read_lock_failed+0x8/0x10 EAX: c4f38810 EBX: c4f38810 ECX: 00000000 EDX: c04cc22e ESI: fb0000e0 EDI: 00000011 EBP: 0f02000a ESP: c4e3faa0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 8005003b CR2: 44618a40 CR3: 04e37000 CR4: 000006d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [<c02311f8>] ? _raw_read_lock+0x23/0x25 [<c0390666>] ? ip_check_mc+0x1c/0x83 [<c036d478>] ? ip_route_input+0x229/0xe92 [<c022e2e4>] ? trace_hardirqs_on_thunk+0xc/0x10 [<c0104c9c>] ? do_IRQ+0x69/0x7d [<c0102e64>] ? restore_nocheck_notrace+0x0/0xe [<c036fdba>] ? ip_rcv+0x227/0x505 [<c0358764>] ? netif_receive_skb+0xfe/0x2b3 [<c03588d2>] ? netif_receive_skb+0x26c/0x2b3 [<c035af31>] ? process_backlog+0x73/0xbd [<c035a8cd>] ? net_rx_action+0xc1/0x1ae [<c01218a8>] ? __do_softirq+0x7b/0xef [<c0121953>] ? do_softirq+0x37/0x4d [<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b [<c0122037>] ? local_bh_enable+0x96/0xab [<c035b50d>] ? dev_queue_xmit+0x3d4/0x40b [<c012181e>] ? _local_bh_enable+0x79/0x88 [<c035fcb8>] ? neigh_resolve_output+0x20f/0x239 [<c0373118>] ? ip_finish_output+0x1df/0x209 [<c0373364>] ? ip_dev_loopback_xmit+0x62/0x66 [<c0371db5>] ? ip_local_out+0x15/0x17 [<c0372013>] ? ip_push_pending_frames+0x25c/0x2bb [<c03891b8>] ? udp_push_pending_frames+0x2bb/0x30e [<c038a189>] ? udp_sendmsg+0x413/0x51d [<c038a1a9>] ? udp_sendmsg+0x433/0x51d [<c038f927>] ? inet_sendmsg+0x35/0x3f [<c034f092>] ? sock_sendmsg+0xb8/0xd1 [<c012d554>] ? autoremove_wake_function+0x0/0x2b [<c022e6de>] ? copy_from_user+0x32/0x5e [<c022e6de>] ? copy_from_user+0x32/0x5e [<c034f238>] ? sys_sendmsg+0x18d/0x1f0 [<c0175e90>] ? pipe_write+0x3cb/0x3d7 [<c0170347>] ? do_sync_write+0xbe/0x105 [<c012d554>] ? autoremove_wake_function+0x0/0x2b [<c03503b2>] ? sys_socketcall+0x176/0x1b0 [<c01085ea>] ? syscall_trace_enter+0x6c/0x7b [<c0102e1a>] ? syscall_call+0x7/0xb Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 16:15:57 -07:00
Jarek Poplawski	d4766692e7	pkt_sched: Protect gen estimators under est_lock. gen_kill_estimator() required rtnl_lock() protection, but since it is moved to an RCU callback __qdisc_destroy() let's use est_lock instead. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 15:20:24 -07:00
David S. Miller	b9a3b1102b	pkt_sched: Fix queue quiescence testing in dev_deactivate(). Based upon discussions with Jarek P. and Herbert Xu. First, we're testing the wrong qdisc. We just reset the device queue qdiscs to &noop_qdisc and checking it's state is completely pointless here. We want to wait until the previous qdisc that was sitting at the ->qdisc pointer is not busy any more. And that would be ->qdisc_sleeping. Because of how we propagate the samples qdisc pointer down into qdisc_run and friends via per-cpu ->output_queue and netif_schedule, we have to wait also for the __QDISC_STATE_SCHED bit to clear as well. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 15:18:38 -07:00
Jarek Poplawski	26b284de54	pkt_sched: Fix oops in htb_delete. Recent changes introduced a bug in htb_delete(): cl->parent->children counter update misses checking cl->parent for NULL, which is used for root classes, so deleting them causes an oops. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 15:16:43 -07:00
Andrew Gallatin	64c00d81b5	pktgen: prevent pktgen from using bad tx queue With the new multi-queue transmit code, it is possible to accidentally make pktgen pick a non-existing tx queue simply by using a stale script to drive pktgen. Access to this non-existing tx queue will then trigger a bad memory access and kill the machine. For example, setting "queue_map_max 2" will cause my machine to die when accessing a garbage spinlock in the non-existing tx queue: BUG: spinlock bad magic on CPU#0, kpktgend_0/564 lock: ffff88001ddf6718, .magic: ffffffff, .owner: /-1, .owner_cpu: 0 Pid: 564, comm: kpktgend_0 Not tainted 2.6.27-rc3 #35 Call Trace: [<ffffffff803a1228>] spin_bug+0xa4/0xac [<ffffffff803a1253>] _raw_spin_lock+0x23/0x123 [<ffffffff8055b06f>] _spin_lock_bh+0x17/0x1b [<ffffffff804cb57d>] pktgen_thread_worker+0xa97/0x1002 [<ffffffff8022874d>] ? finish_task_switch+0x38/0x97 [<ffffffff80242077>] ? autoremove_wake_function+0x0/0x36 [<ffffffff80242077>] ? autoremove_wake_function+0x0/0x36 [<ffffffff804caae6>] ? pktgen_thread_worker+0x0/0x1002 [<ffffffff80241a40>] kthread+0x44/0x6d [<ffffffff8020c399>] child_rip+0xa/0x11 [<ffffffff802419fc>] ? kthread+0x0/0x6d [<ffffffff8020c38f>] ? child_rip+0x0/0x11 The attached patch adds some sanity checking to prevent these sorts of configuration errors. Signed-off-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 15:16:00 -07:00
Tom Tucker	24b8b44780	svcrdma: Fix race between svc_rdma_recvfrom thread and the dto_tasklet RDMA_READ completions are kept on a separate queue from the general I/O request queue. Since a separate lock is used to protect the RDMA_READ completion queue, a race exists between the dto_tasklet and the svc_rdma_recvfrom thread where the dto_tasklet sets the XPT_DATA bit and adds I/O to the read-completion queue. Concurrently, the recvfrom thread checks the generic queue, finds it empty and resets the XPT_DATA bit. A subsequent svc_xprt_enqueue will fail to enqueue the transport for I/O and cause the transport to "stall". The fix is to protect both lists with the same lock and set the XPT_DATA bit with this lock held. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-08-13 16:57:31 -04:00
Arnaldo Carvalho de Melo	3e8a0a559c	dccp: change L/R must have at least one byte in the dccpsf_val field Thanks to Eugene Teo for reporting this problem. Signed-off-by: Eugene Teo <eugenete@kernel.sg> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 13:48:39 -07:00
Jean-Christophe DUBOIS	c1e24df27f	xfrm: remove unnecessary variable in xfrm_output_resume() 2nd try Small fix removing an unnecessary intermediate variable. Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 13:35:37 -07:00
Jamal Hadi Salim	36723873b6	net-sched: fix Action flushing return code Flushing must consistently return ENOMEM on failure of any allocation Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 02:41:45 -07:00
Jamal Hadi Salim	f97017cdef	net-sched: Fix actions flushing Flushing of actions has been broken since we changed the semantics of netlink parsed tb[X] to mean X is an attribute type. This makes the flushing work. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 02:41:22 -07:00
Julien Brunel	34093d055e	net/rxrpc: Use an IS_ERR test rather than a NULL test In case of error, the function rxrpc_get_transport returns an ERR pointer, but never returns a NULL pointer. So after a call to this function, a NULL test should be replaced by an IS_ERR test. A simplified version of the semantic patch that makes this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @correct_null_test@ expression x,E; statement S1, S2; @@ x = rxrpc_get_transport(...) <... when != x = E if ( ( - x@p2 != NULL + ! IS_ERR ( x ) \| - x@p2 == NULL + IS_ERR( x ) ) ) S1 else S2 ...> ? x = E; // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 02:40:48 -07:00
Jamal Hadi Salim	317900cb01	wext: Send name on events In the minimal the wireless extensions oughta send at least the name in addition to the ifindex. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 02:39:56 -07:00
Andrew Morton	6ced0b3f1e	net/tipc/subscr.c: don't use ___constant_swab32 It's an internal implementation detail which we _should_ be free to change. So we did, and it promptly broke. The compiler shold be able to work out when to use the __constant version anyway. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 02:32:06 -07:00
Brian Haley	5e0115e500	ipv6: Fix OOPS, ip -f inet6 route get fec0::1, linux-2.6.26, ip6_route_output, rt6_fill_node+0x175 Alexey Dobriyan wrote: > On Thu, Aug 07, 2008 at 07:00:56PM +0200, John Gumb wrote: >> Scenario: no ipv6 default route set. > >> # ip -f inet6 route get fec0::1 >> >> BUG: unable to handle kernel NULL pointer dereference at 00000000 >> IP: [<c0369b85>] rt6_fill_node+0x175/0x3b0 >> EIP is at rt6_fill_node+0x175/0x3b0 > > 0xffffffff80424dd3 is in rt6_fill_node (net/ipv6/route.c:2191). > 2186 } else > 2187 #endif > 2188 NLA_PUT_U32(skb, RTA_IIF, iif); > 2189 } else if (dst) { > 2190 struct in6_addr saddr_buf; > 2191 ====> if (ipv6_dev_get_saddr(ip6_dst_idev(&rt->u.dst)->dev, > ^^^^^^^^^^^^^^^^^^^^^^^^ > NULL > > 2192 dst, 0, &saddr_buf) == 0) > 2193 NLA_PUT(skb, RTA_PREFSRC, 16, &saddr_buf); > 2194 } The commit that changed this can't be reverted easily, but the patch below works for me. Fix NULL de-reference in rt6_fill_node() when there's no IPv6 input device present in the dst entry. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-13 01:58:57 -07:00
Jarek Poplawski	1cfa26661a	pkt_sched: Add BH protection for qdisc_stab_lock. Since qdisc_stab_lock is used in qdisc_put_stab(), which is called in BH context from __qdisc_destroy() RCU callback, softirq safe locking is needed. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-11 18:11:06 -07:00
David S. Miller	0a37c10ed4	Merge branch 'stealer/ipvs/for-davem' of git://git.stealer.net/linux-2.6	2008-08-11 18:04:35 -07:00
Simon Horman	e93615d086	ipvs: Explictly clear ip_vs_stats members In order to align the coding styles of ip_vs_zero_stats() and its child-function ip_vs_zero_estimator(), clear ip_vs_stats members explicitlty rather than doing a limited memset(). This was chosen over modifying ip_vs_zero_estimator() to use memset() as it is more robust against changes in members in the relevant structures. memset() would be prefered if all members of the structure were to be cleared. Cc: Sven Wegener <sven.wegener@stealer.net> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Sven Wegener <sven.wegener@stealer.net>	2008-08-11 14:00:55 +02:00
Sven Wegener	519e49e888	ipvs: No need to zero out ip_vs_stats during initialization It's a global variable and automatically initialized to zero. And now we can also initialize the lock at compile time. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 14:00:46 +02:00
Sven Wegener	3a14a313f9	ipvs: Embed estimator object into stats object There's no reason for dynamically allocating an estimator object for every stats object. Directly embed an estimator object into every stats object and switch to using the kernel-provided list implementation. This makes the code much simpler and faster, as we do not need to traverse the list of all estimators to find the one belonging to a stats object. There's no need to use an rwlock, as we only have one reader. Also reorder the members of the estimator structure slightly to avoid padding overhead. This can't be done with the stats object as the members are currently copied to our user space object via memcpy() and changing it would break ABI. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 14:00:43 +02:00
Sven Wegener	5587da55fb	ipvs: Mark net_vs_ctl_path const Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 11:46:27 +02:00
Sven Wegener	048cf48b89	ipvs: Annotate init functions with __init Being able to discard these functions saves a couple of bytes at runtime. The cleanup functions can't be annotated with __exit as they are also called from init functions. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 11:46:18 +02:00
Sven Wegener	d149ccc9cf	ipvs: Initialize schedulers' struct list_head at compile time No need to do it at runtime and this saves a couple of bytes in the text section. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 11:46:06 +02:00
Sven Wegener	66a0be4720	ipvs: Use list_empty() instead of open-coding the same functionality Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 11:45:57 +02:00
Sven Wegener	8ab19ea36c	ipvs: Fix possible deadlock in estimator code There is a slight chance for a deadlock in the estimator code. We can't call del_timer_sync() while holding our lock, as the timer might be active and spinning for the lock on another cpu. Work around this issue by using try_to_del_timer_sync() and releasing the lock. We could actually delete the timer outside of our lock, as the add and kill functions are only every called from userspace via [gs]etsockopt() and are serialized by a mutex, but better make this explicit. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Cc: stable <stable@kernel.org> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 11:45:40 +02:00
Sven Wegener	bc0fde2fad	ipvs: Fix possible deadlock in sync code Commit `998e7a7680` ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()") introduced a possible deadlock in the sync code. We need to use the _bh versions for the lock, as the lock is also accessed from a bottom half. Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Acked-by: Simon Horman <horms@verge.net.au>	2008-08-11 11:44:38 +02:00
Herbert Xu	d97106ea52	udp: Drop socket lock for encapsulated packets The socket lock is there to protect the normal UDP receive path. Encapsulation UDP sockets don't need that protection. In fact the locking is deadly for them as they may contain another UDP packet within, possibly with the same addresses. Also the nested bit was copied from TCP. TCP needs it because of accept(2) spawning sockets. This simply doesn't apply to UDP so I've removed it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-09 00:35:05 -07:00
David S. Miller	8123b421e8	pkt_sched: Fix ingress deletion and filter attachment. Based upon bug reports by Stephen Hemminger. We still had some cases using ->qdisc instead of ->qdisc_sleeping. Also, qdisc_lookup() should return ingress qdiscs. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-08 23:23:39 -07:00
Jamal Hadi Salim	76aab2c1ea	pkt_sched: Fix actions referencing When an action is added several times with the same exact index it gets deleted on every even-numbered attempt. This fixes that issue. Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-07 20:37:22 -07:00
David S. Miller	c1c6c11a68	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6	2008-08-07 20:28:46 -07:00
Adam Langley	2aaab9a0cc	tcp: (whitespace only) fix confusing indentation The indentation in part of tcp_minisocks makes it look like one of the if statements is much more important than it actually is. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-07 20:27:45 -07:00
David S. Miller	827ebd6410	pkt_sched: Fix qdisc config when link is down. Bug reported by Stephen Hemminger. We need to fetch the root from ->qdisc_sleeping not ->qdisc. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-07 20:26:40 -07:00
Marcel Holtmann	28111eb2f5	[Bluetooth] Add parameters to control BNEP header compression The Bluetooth qualification for PAN demands testing with BNEP header compression disabled. This is actually pretty stupid and the Linux implementation outsmarts the test system since it compresses whenever possible. So to pass qualification two need parameters have been added to control the compression of source and destination headers. Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2008-08-07 22:26:54 +02:00
Luis Carlos Cobo	8dbc1722a7	mac80211: keep mesh ifaces in allmulti mode Currently a mesh node will not forward a multicast frame if it is not subscribed to the specific multicast address. This patch addresses the issue and fixes mesh multicast forwarding. Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-07 09:49:04 -04:00
Luis Carlos Cobo	e32f85f7b9	mac80211: fix use of skb->cb for mesh forwarding Now we deal with mesh forwarding before the 802.11->802.3 conversion, thus eliminating a few unnecessary steps. The next hop lookup is called from ieee80211_master_start_xmit() instead of subif_start_xmit(). Until the next hop is found, RA in the frame will be all zeroes for frames originating from the device. For forwarded frames, RA will contain the TA of the received frame, which will be necessary to send a path error if a next hop is not found. Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-07 09:49:04 -04:00
Robert Olsson	e6fce5b916	pktgen: multiqueue etc. Sofar far pktgen have had a restriction to only use one device per kernel thread. With the new multiqueue architecture this is no longer adequate. The patch below is an effort to remove this by in pktgen configuration adding a tag to the device name a la eth0@0 etc. The tag is used for usual device config just as before. Also a new flag is introduced to mirror queue_map with sending threads smp_processor_id() QUEUE_MAP_CPU. An example: We use 4 CPU's to send to one 10g interface (eth0) and we use the new tagging to send a mix of packet sizes, 64, 576 and 1500 bytes. Also we use TX queues according to smp_processor_id() PGDEV=/proc/net/pktgen/kpktgend_0 pgset "add_device eth0@0" PGDEV=/proc/net/pktgen/kpktgend_1 pgset "add_device eth0@1" PGDEV=/proc/net/pktgen/kpktgend_2 pgset "add_device eth0@2" PGDEV=/proc/net/pktgen/kpktgend_3 pgset "add_device eth0@3" .... PGDEV=/proc/net/pktgen/eth0@0 pgset "pkt_size 64" pgset "flag QUEUE_MAP_CPU" PGDEV=/proc/net/pktgen/eth0@1 pgset "pkt_size 572" pgset "flag QUEUE_MAP_CPU" PGDEV=/proc/net/pktgen/eth0@2 pgset "pkt_size 1496" PGDEV=/proc/net/pktgen/eth0@3 pgset "pkt_size 1496" pgset "flag QUEUE_MAP_CPU" Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-07 02:23:01 -07:00
David S. Miller	32bb93b02d	Merge branch 'upstream-davem' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6	2008-08-07 02:10:27 -07:00
Jeff Garzik	3859069bc3	Merge branch 'for-jeff' of git://git.kernel.org/pub/scm/linux/kernel/git/chris/linux-2.6 into tmp	2008-08-07 04:05:46 -04:00
Joe Eykholt	f982307f22	net/core: Allow receive on active slaves. If a packet_type specifies an active slave to bonding and not just any interface, allow it to receive frames that came in on that interface. Signed-off-by: Joe Eykholt <jre@nuovasystems.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>	2008-08-07 04:00:01 -04:00
Joe Eykholt	0d7a368123	net/core: Allow certain receives on inactive slave. Allow a packet_type that specifies the exact device to receive even on an inactive bonding slave devices. This is important for some L2 protocols such as LLDP and FCoE. This can eventually be used for the bonding special cases as well. Signed-off-by: Joe Eykholt <jre@nuovasystems.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>	2008-08-07 03:59:59 -04:00
Joe Eykholt	cc9bd5cebc	net/core: Uninline skb_bond(). Otherwise subsequent changes need multiple return values. Signed-off-by: Joe Eykholt <jre@nuovasystems.com> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>	2008-08-07 03:59:58 -04:00
Gui Jianfeng	6edafaaf6f	tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup If the following packet flow happen, kernel will panic. MathineA MathineB SYN ----------------------> SYN+ACK <---------------------- ACK(bad seq) ----------------------> When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr)) is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is NULL at that moment, so kernel panic happens. This patch fixes this bug. OOPS output is as following: [ 302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 [ 302.817075] Oops: 0000 [#1] SMP [ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] [ 302.849946] [ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5) [ 302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0 [ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42 [ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046 [ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54 [ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000) [ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400 [ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0 [ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634 [ 302.900140] Call Trace: [ 302.902392] [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35 [ 302.907060] [<c05d28f8>] tcp_check_req+0x156/0x372 [ 302.910082] [<c04250a3>] printk+0x14/0x18 [ 302.912868] [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf [ 302.917423] [<c05d26be>] tcp_v4_rcv+0x563/0x5b9 [ 302.920453] [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183 [ 302.923865] [<c05bb10a>] ip_rcv_finish+0x286/0x2a3 [ 302.928569] [<c059e438>] dev_alloc_skb+0x11/0x25 [ 302.931563] [<c05a211f>] netif_receive_skb+0x2d6/0x33a [ 302.934914] [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32] [ 302.938735] [<c05a3b48>] net_rx_action+0x5c/0xfe [ 302.941792] [<c042856b>] __do_softirq+0x5d/0xc1 [ 302.944788] [<c042850e>] __do_softirq+0x0/0xc1 [ 302.948999] [<c040564b>] do_softirq+0x55/0x88 [ 302.951870] [<c04501b1>] handle_fasteoi_irq+0x0/0xa4 [ 302.954986] [<c04284da>] irq_exit+0x35/0x69 [ 302.959081] [<c0405717>] do_IRQ+0x99/0xae [ 302.961896] [<c040422b>] common_interrupt+0x23/0x28 [ 302.966279] [<c040819d>] default_idle+0x2a/0x3d [ 302.969212] [<c0402552>] cpu_idle+0xb2/0xd2 [ 302.972169] ======================= [ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6 [ 303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54 [ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 23:50:04 -07:00
David S. Miller	ee7af8264d	pkt_sched: Fix "parent is root" test in qdisc_create(). As noticed by Stephen Hemminger, the root qdisc is denoted by TC_H_ROOT, not zero. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 23:35:59 -07:00
David S. Miller	11d46123bf	ipv4: Fix over-ifdeffing of ip_static_sysctl_init. Noticed by Paulius Zaleckas. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 18:30:43 -07:00
Joakim Koskela	abf5cdb89d	ipsec: Interfamily IPSec BEET, ipv4-inner ipv6-outer Here's a revised version, based on Herbert's comments, of a fix for the ipv4-inner, ipv6-outer interfamily ipsec beet mode. It fixes the network header adjustment during interfamily, as well as makes sure that we reserve enough room for the new ipv6 header if we might have something else as the inner family. Also, the ipv4 pseudo header construction was added. Signed-off-by: Joakim Koskela <jookos@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 02:40:25 -07:00
Joakim Koskela	eb49e63093	ipsec: Interfamily IPSec BEET Here's a revised version, based on Herbert's comments, of a fix for the ipv6-inner, ipv4-outer interfamily ipsec beet mode. It fixes the network header adjustment in interfamily, and doesn't reserve space for the pseudo header anymore when we have ipv6 as the inner family. Signed-off-by: Joakim Koskela <jookos@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 02:39:30 -07:00
Krzysztof Piotr Oledzki	9714be7da8	netfilter: fix two recent sysctl problems Starting with `9043476f72` ("[PATCH] sanitize proc_sysctl") we have two netfilter releated problems: - WARNING: at kernel/sysctl.c:1966 unregister_sysctl_table+0xcc/0x103(), caused by wrong order of ini/fini calls - net.netfilter is duplicated and has truncated set of records Thanks to very useful guidelines from Al Viro, this patch fixes both of them. Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 02:35:44 -07:00
Rami Rosen	1ca615fb81	ipv6: replace dst_metric() with dst_mtu() in net/ipv6/route.c. This patch replaces dst_metric() with dst_mtu() in net/ipv6/route.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 02:34:21 -07:00
Rami Rosen	6d273f8d01	ipv4: replace dst_metric() with dst_mtu() in net/ipv4/route.c. This patch replaces dst_metric() with dst_mtu() in net/ipv4/route.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-06 02:33:49 -07:00
Ralf Baechle	ffb208479b	AX.25: Fix sysctl registration if !CONFIG_AX25_DAMA_SLAVE Since `49ffcf8f99` ("sysctl: update sysctl_check_table") setting struct ctl_table.procname = NULL does no longer work as it used to the way the AX.25 code is expecting it to resulting in the AX.25 sysctl registration code to break if CONFIG_AX25_DAMA_SLAVE was not set as in some distribution kernels. Kernel releases from 2.6.24 are affected. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-05 18:46:57 -07:00
Robert Olsson	ff2a79a5a9	pktgen: mac count dst_mac_count and src_mac_count patch from Eneas Hunguana We have sent one mac address to much. Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-05 18:45:05 -07:00
Robert Olsson	1211a64554	pktgen: random flow Random flow generation has not worked. This fixes it. Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-05 18:44:26 -07:00
Stephen Hemminger	ef647f1300	bridge: Eliminate unnecessary forward delay From: Stephen Hemminger <shemminger@vyatta.com> Based upon original patch by Herbert Xu, which contained the following problem description: -------------------- When the forward delay is set to zero, we still delay the setting of the forwarding state by one or possibly two timers depending on whether STP is enabled. This could either turn out to be instantaneous, or horribly slow depending on the load of the machine. As there is nothing preventing us from enabling forwarding straight away, this patch eliminates this potential delay by executing the code directly if the forward delay is zero. The effect of this problem is that immediately after the carrier comes on a port, the bridge will drop all packets received from that port until it enters forwarding mode, thus causing unnecessary packet loss. Note that this patch doesn't fully remove the delay due to the link watcher. We should also check the carrier state when we are about to drop an incoming packet because the port is disabled. But that's for another patch. -------------------- This version of the fix takes a different approach, in that it just does the state change directly. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-05 18:42:51 -07:00
David S. Miller	33e334950a	Merge branch 'no-ath9k' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2008-08-05 01:28:35 -07:00
Rami Rosen	ad619800e4	bridge: fix compile warning in net/bridge/br_netfilter.c This patch fixes the following warning due to incompatible pointer assignment: net/bridge/br_netfilter.c: In function 'br_netfilter_rtable_init': net/bridge/br_netfilter.c:116: warning: assignment from incompatible pointer type This warning is due to commit `4adf0af681` from July 30 (send correct MTU value in PMTU (revised)). Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-05 01:21:22 -07:00
Jarek Poplawski	c27f339af9	net_sched: Add qdisc __NET_XMIT_BYPASS flag Patrick McHardy <kaber@trash.net> noticed that it would be nice to handle NET_XMIT_BYPASS by NET_XMIT_SUCCESS with an internal qdisc flag __NET_XMIT_BYPASS and to remove the mapping from dev_queue_xmit(). David Miller <davem@davemloft.net> spotted a serious bug in the first version of this patch. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-04 22:39:11 -07:00
Jarek Poplawski	378a2f090f	net_sched: Add qdisc __NET_XMIT_STOLEN flag Patrick McHardy <kaber@trash.net> noticed: "The other problem that affects all qdiscs supporting actions is TC_ACT_QUEUED/TC_ACT_STOLEN getting mapped to NET_XMIT_SUCCESS even though the packet is not queued, corrupting upper qdiscs' qlen counters." and later explained: "The reason why it translates it at all seems to be to not increase the drops counter. Within a single qdisc this could be avoided by other means easily, upper qdiscs would still increase the counter when we return anything besides NET_XMIT_SUCCESS though. This means we need a new NET_XMIT return value to indicate this to the upper qdiscs. So I'd suggest to introduce NET_XMIT_STOLEN, return that to upper qdiscs and translate it to NET_XMIT_SUCCESS in dev_queue_xmit, similar to NET_XMIT_BYPASS." David Miller <davem@davemloft.net> noticed: "Maybe these NET_XMIT_* values being passed around should be a set of bits. They could be composed of base meanings, combined with specific attributes. So you could say "NET_XMIT_DROP \| __NET_XMIT_NO_DROP_COUNT" The attributes get masked out by the top-level ->enqueue() caller, such that the base meanings are the only thing that make their way up into the stack. If it's only about communication within the qdisc tree, let's simply code it that way." This patch is trying to realize these ideas. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-04 22:31:03 -07:00
Tomas Winkler	4c43e0d0ec	iwlwifi: HW bug fixes This patch adds few HW bug fixes. Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Zhu Yi <yi.zhu@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-04 15:09:12 -04:00
Daniel Drake	80693ceb78	mac80211: automatic IBSS channel selection When joining an ad-hoc network, the user is currently required to specify the channel. The network will not be joined otherwise, unless it happens to be sitting on the currently active channel. This patch implements automatic channel selection when the user has not locked the interface onto a specific channel. Signed-off-by: Daniel Drake <dsd@gentoo.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-04 15:09:10 -04:00
Tomas Winkler	ea95bba41e	mac80211: make listen_interval be limited by low level driver This patch makes possible for a driver to specify maximal listen interval The possibility for user to configure listen interval is not implemented yet, currently the maximum provided by the driver or 1 is used. Mac80211 uses config handler to set listen interval for to the driver. Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Zhu Yi <yi.zhu@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-04 15:09:07 -04:00
Emmanuel Grumbach	98f7dfd86c	mac80211: pass dtim_period to low level driver This patch adds the dtim_period in ieee80211_bss_conf, this allows the low level driver to know the dtim_period, and to plan power save accordingly. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Zhu Yi <yi.zhu@intel.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-04 15:09:07 -04:00
Stephen Hemminger	6e583ce524	net: eliminate refcounting in backlog queue Avoid the overhead of atomic increment/decrement on each received packet. This helps performance of non-NAPI devices (like loopback). Use cleanup function to walk queue on each cpu and clean out any left over packets. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 21:29:57 -07:00
Wei Yongjun	283d07ac20	ipv6: Do not drop packet if skb->local_df is set to true The old code will drop IPv6 packet if ipfragok is not set, since ipfragok is obsoleted, will be instead by used skb->local_df, so this check must be changed to skb->local_df. This patch fix this problem and not drop packet if skb->local_df is set to true. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 21:15:59 -07:00
Herbert Xu	f880374c2f	sctp: Drop ipfargok in sctp_xmit function The ipfragok flag controls whether the packet may be fragmented either on the local host on beyond. The latter is only valid on IPv4. In fact, we never want to do the latter even on IPv4 when PMTU is enabled. This is because even though we can't fragment packets within SCTP due to the prtocol's inherent faults, we can still fragment it at IP layer. By setting the DF bit we will improve the PMTU process. RFC 2960 only says that we SHOULD clear the DF bit in this case, so we're compliant even if we set the DF bit. In fact RFC 4960 no longer has this statement. Once we make this change, we only need to control the local fragmentation. There is already a bit in the skb which controls that, local_df. So this patch sets that instead of using the ipfragok argument. The only complication is that there isn't a struct sock object per transport, so for IPv4 we have to resort to changing the pmtudisc field for every packet. This should be safe though as the protocol is single-threaded. Note that after this patch we can remove ipfragok from the rest of the stack too. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 21:15:08 -07:00
Yang Hongyang	cfb266c0ee	ipv6: Fix the return value of Set Hop-by-Hop options header with NULL data pointer When Set Hop-by-Hop options header with NULL data pointer and optlen is not zero use setsockopt(), the kernel successfully return 0 instead of return error EINVAL or EFAULT. This patch fix the problem. Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 18:16:15 -07:00
Florian Westphal	1730554f25	ipv6: syncookies: free reqsk on xfrm_lookup error cookie_v6_check() did not call reqsk_free() if xfrm_lookup() fails, leaking the request sock. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 18:13:44 -07:00
Sven Wegener	adf044c877	net: Add missing extra2 parameter for ip_default_ttl sysctl Commit `76e6ebfb40` ("netns: add namespace parameter to rt_cache_flush") acceses the extra2 parameter of the ip_default_ttl ctl_table, but it is never set to a meaningful value. When `e84f84f276` ("netns: place rt_genid into struct net") is applied, we'll oops in rt_cache_invalidate(). Set extra2 to init_net, to avoid that. Reported-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Tested-by: Marcin Slusarz <marcin.slusarz@gmail.com> Acked-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 14:06:44 -07:00
Lennert Buytenhek	e5a4a72d4f	net: use software GSO for SG+CSUM capable netdevices If a netdevice does not support hardware GSO, allowing the stack to use GSO anyway and then splitting the GSO skb into MSS-sized pieces as it is handed to the netdevice for transmitting is likely still a win as far as throughput and/or CPU usage are concerned, since it reduces the number of trips through the output path. This patch enables the use of GSO on any netdevice that supports SG. If a GSO skb is then sent to a netdevice that supports SG but does not support hardware GSO, net/core/dev.c:dev_hard_start_xmit() will take care of doing the necessary GSO segmentation in software. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 01:23:10 -07:00
Chris Larson	745e203164	net: fix missing pneigh entries in the neighbor seq_file code When pneigh entries exist, but the user's read buffer isn't sufficient to hold them all, one of the pneigh entries will be missing from the results. In neigh_get_idx_any, the number of elements which neigh_get_idx encountered is not correctly subtracted from the position number before the call to pneigh_get_idx. neigh_get_idx reduces the position by 1 for each call to neigh_get_next, but it does not reduce it by one for the first element (neigh_get_first). The patch alters the neigh_get_idx and pneigh_get_idx functions to subtract one from pos, for the first element, when pos is non-zero. Signed-off-by: Chris Larson <clarson@mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 01:10:55 -07:00
Chris Larson	bff69732c9	net: in the first call to neigh_seq_next, call neigh_get_first, not neigh_get_idx. neigh_seq_next won't be called both with *pos > 0 && v == SEQ_START_TOKEN, so there's no point calling neigh_get_idx when we're on the start token, just call neigh_get_first directly. Signed-off-by: Chris Larson <clarson@mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-03 01:02:41 -07:00
David S. Miller	35ed4e7598	mac80211: Use queue_lock() in ieee80211_ht_agg_queue_remove(). qdisc_root_lock() is only %100 safe to use when the RTNL semaphore is held. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-02 23:25:50 -07:00
David S. Miller	5fb662297b	pkt_sched: Use qdisc_lock() on already sampled root qdisc. Based upon a bug report by Jeff Kirsher. Don't use qdisc_root_lock() in these cases as the root qdisc could have been changed, and we'd thus lock the wrong object. Tested by Emil S Tantilov who confirms that this seems to fix the problem. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-08-02 20:02:43 -07:00
David S. Miller	e9e80ea5f2	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2008-08-01 22:08:51 -07:00
Tomas Winkler	f8e79ddd31	mac80211: fix fragmentation kludge This patch make mac80211 transmit correctly fragmented packet after queue was stopped Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-01 15:31:33 -04:00
Dmitry Baryshkov	96185664f1	RFKILL: set the status of the leds on activation. Provide default activate function to set the state of the led when the led becomes bound to the trigger Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-01 15:31:33 -04:00
Dmitry Baryshkov	7c4f4578fc	RFKILL: allow one to specify led trigger name Allow the rfkill driver to specify led trigger name. By default it still defaults to the name of rfkill switch. Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-01 15:31:33 -04:00
Henrique de Moraes Holschuh	6e28fbef0f	rfkill: query EV_SW states when rfkill-input (re)?connects to a input device Every time a new input device that is capable of one of the rfkill EV_SW events (currently only SW_RFKILL_ALL) is connected to rfkill-input, we must check the states of the input EV_SW switches and take action. Otherwise, we will ignore the initial switch state. We also need to re-check the states of the EV_SW switches after a device that was under an exclusive grab is released back to us, since we got no input events from that device while it was grabbed. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-08-01 15:31:32 -04:00
Linus Torvalds	9a5467fd60	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits) tcp: MD5: Fix IPv6 signatures skbuff: add missing kernel-doc for do_not_encrypt net/ipv4/route.c: fix build error tcp: MD5: Fix MD5 signatures on certain ACK packets ipv6: Fix ip6_xmit to send fragments if ipfragok is true ipvs: Move userspace definitions to include/linux/ip_vs.h netdev: Fix lockdep warnings in multiqueue configurations. netfilter: xt_hashlimit: fix race between htable_destroy and htable_gc netfilter: ipt_recent: fix race between recent_mt_destroy and proc manipulations netfilter: nf_conntrack_tcp: decrease timeouts while data in unacknowledged irda: replace __FUNCTION__ with __func__ nsc-ircc: default to dongle type 9 on IBM hardware bluetooth: add quirks for a few hci_usb devices hysdn: remove the packed attribute from PofTimStamp_tag isdn: use the common ascii hex helpers tg3: adapt tg3 to use reworked PCI PM code atm: fix direct casts of pointers to u32 in the InterPhase driver atm: fix const assignment/discard warnings in the ATM networking driver net: use the common ascii hex helpers random32: seeding improvement ...	2008-08-01 11:35:16 -07:00
Al Viro	a1bc6eb4b4	[PATCH] ipv4_static_sysctl_init() should be under CONFIG_SYSCTL Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-08-01 11:25:22 -04:00
Adam Langley	00b1304c4c	tcp: MD5: Fix IPv6 signatures Reported by Stefanos Harhalakis; although 2.6.27-rc1 talks to itself using IPv6 TCP MD5 packets just fine, Stefanos noted that tcpdump claimed that the signatures were invalid. I broke this in `49a72dfb88` ("tcp: Fix MD5 signatures for non-linear skbs"), it was just a typo. Note that tcpdump will still sometimes claim that the signatures are incorrect. A patch to tcpdump has been submitted for this[1]. [1] http://tinyurl.com/6a4fl2 Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 21:36:07 -07:00
Ingo Molnar	8a9204db66	net/ipv4/route.c: fix build error fix: net/ipv4/route.c: In function 'ip_static_sysctl_init': net/ipv4/route.c:3225: error: 'ipv4_route_path' undeclared (first use in this function) net/ipv4/route.c:3225: error: (Each undeclared identifier is reported only once net/ipv4/route.c:3225: error: for each function it appears in.) net/ipv4/route.c:3225: error: 'ipv4_route_table' undeclared (first use in this function) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 20:51:22 -07:00
Adam Langley	90b7e1120b	tcp: MD5: Fix MD5 signatures on certain ACK packets I noticed, looking at tcpdumps, that timewait ACKs were getting sent with an incorrect MD5 signature when signatures were enabled. I broke this in `49a72dfb88` ("tcp: Fix MD5 signatures for non-linear skbs"). I didn't take into account that the skb passed to tcp_*_send_ack was the inbound packet, thus the source and dest addresses need to be swapped when calculating the MD5 pseudoheader. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 20:49:48 -07:00
Wei Yongjun	77e2f14f71	ipv6: Fix ip6_xmit to send fragments if ipfragok is true SCTP used ip6_xmit() to send fragments after received ICMP packet too big message. But while send packet used ip6_xmit, the skb->local_df is not initialized. So when skb if enter ip6_fragment(), the following code will discard the skb. ip6_fragment(...) { if (!skb->local_df) { ... return -EMSGSIZE; } ... } SCTP do the following step: 1. send packet ip6_xmit(skb, ipfragok=0) 2. received ICMP packet too big message 3. if PMTUD_ENABLE: ip6_xmit(skb, ipfragok=1) This patch fixed the problem by set local_df if ipfragok is true. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 20:46:47 -07:00
David S. Miller	c3f26a269c	netdev: Fix lockdep warnings in multiqueue configurations. When support for multiple TX queues were added, the netif_tx_lock() routines we converted to iterate over all TX queues and grab each queue's spinlock. This causes heartburn for lockdep and it's not a healthy thing to do with lots of TX queues anyways. So modify this to use a top-level lock and a "frozen" state for the individual TX queues. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 16:58:50 -07:00
Pavel Emelyanov	967ab999a0	netfilter: xt_hashlimit: fix race between htable_destroy and htable_gc Deleting a timer with del_timer doesn't guarantee, that the timer function is not running at the moment of deletion. Thus in the xt_hashlimit case we can get into a ticklish situation when the htable_gc rearms the timer back and we'll actually delete an entry with a pending timer. Fix it with using del_timer_sync(). AFAIK del_timer_sync checks for the timer to be pending by itself, so I remove the check. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 00:38:52 -07:00
Pavel Emelyanov	a8ddc9163c	netfilter: ipt_recent: fix race between recent_mt_destroy and proc manipulations The thing is that recent_mt_destroy first flushes the entries from table with the recent_table_flush and only after this removes the proc file, corresponding to that table. Thus, if we manage to write to this file the '+XXX' command we will leak some entries. If we manage to write there a 'clean' command we'll race in two recent_table_flush flows, since the recent_mt_destroy calls this outside the recent_lock. The proper solution as I see it is to remove the proc file first and then go on with flushing the table. This flushing becomes safe w/o the lock, since the table is already inaccessible from the outside. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 00:38:31 -07:00
Patrick McHardy	ae375044d3	netfilter: nf_conntrack_tcp: decrease timeouts while data in unacknowledged In order to time out dead connections quicker, keep track of outstanding data and cap the timeout. Suggested by Herbert Xu. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-31 00:38:01 -07:00
David Howells	cba5cbd155	atm: fix const assignment/discard warnings in the ATM networking driver Fix const assignment/discard warnings in the ATM networking driver. The lane2_assoc_ind() function needed its arguments changing to match changes in the lane2_ops struct (patch `61c33e0129` "atm: use const where reasonable"). Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Chas Williams <chas@cmf.nrl.navy.mil> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 16:31:46 -07:00
Harvey Harrison	6a8341b68b	net: use the common ascii hex helpers Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 16:30:15 -07:00
Simon Wunderlich	4adf0af681	bridge: send correct MTU value in PMTU (revised) When bridging interfaces with different MTUs, the bridge correctly chooses the minimum of the MTUs of the physical devices as the bridges MTU. But when a frame is passed which fits through the incoming, but not through the outgoing interface, a "Fragmentation Needed" packet is generated. However, the propagated MTU is hardcoded to 1500, which is wrong in this situation. The sender will repeat the packet again with the same frame size, and the same problem will occur again. Instead of sending 1500, the (correct) MTU value of the bridge is now sent via PMTU. To achieve this, the corresponding rtable structure is stored in its net_bridge structure. Modified to get rid of fake_net_device as well. Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 16:27:55 -07:00
Robert P. J. Day	031cf19e6f	net: Make "networking" one-click deselectable. Use a menuconfig directive to make all of networking support one-click deselectable from the top-level menu. Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 03:27:53 -07:00
Daniel Lezcano	17ef51fce0	ipv6: Fix useless proc net sockstat6 removal This call is no longer needed, sockstat6 is per namespace so it is removed at the namespace subsystem destruction. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 03:27:52 -07:00
David S. Miller	785957d3e8	tcp: MD5: Use MIB counter instead of warning for MD5 mismatch. From a report by Matti Aarnio, and preliminary patch by Adam Langley. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 03:27:25 -07:00
David S. Miller	8d50b53d66	pkt_sched: Fix OOPS on ingress qdisc add. Bug report from Steven Jan Springl: Issuing the following command causes a kernel oops: tc qdisc add dev eth0 handle ffff: ingress The problem mostly stems from all of the special case handling of ingress qdiscs. So, to fix this, do the grafting operation the same way we do for TX qdiscs. Which means that dev_activate() and dev_deactivate() now do the "qdisc_sleeping <--> qdisc" transitions on dev->rx_queue too. Future simplifications are possible now, mainly because it is impossible for dev_queue->{qdisc,qdisc_sleeping} to be NULL. There are NULL checks all over to handle the ingress qdisc special case that used to exist before this commit. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-30 02:44:25 -07:00
Miao Xie	4a36702e01	IPv6: datagram_send_ctl() should exit immediately when an error occured When an error occured, datagram_send_ctl() should exit immediately rather than continue to run the for loop. Otherwise, the variable err might be changed and the error might be hidden. Fix this bug by using "goto" instead of "break". Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-29 23:57:58 -07:00
David S. Miller	e93dc4891d	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6	2008-07-29 21:51:00 -07:00
Luis Carlos Cobo	56a6d13dfd	mac80211: fix mesh beaconing This patch fixes mesh beaconing, which was broken by "mac80211: revamp beacon configuration". Signed-off-by: Luis Carlos Cobo <luisca@cozybit.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:09 -04:00
Johannes Berg	14db74bcc3	mac80211: fix cfg80211 hooks for master interface The master interface is a virtual interface that is registered to mac80211, changing that does not seem like a good idea at the moment. However, since it has no sdata, we cannot accept any configuration for it. This patch makes the cfg80211 hooks reject any such attempt. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:08 -04:00
Johannes Berg	bba95fefb8	nl80211: fix dump callbacks Julius Volz pointed out that the dump callbacks in nl80211 were broken and fixed one of them. This patch fixes the other three and also addresses the TODOs there. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Cc: Julius Volz <juliusv@google.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:08 -04:00
Johannes Berg	d0f0980414	mac80211: partially fix skb->cb use This patch fixes mac80211 to not use the skb->cb over the queue step from virtual interfaces to the master. The patch also, for now, disables aggregation because that would still require requeuing, will fix that in a separate patch. There are two other places (software requeue and powersaving stations) where requeue can happen, but that is not currently used by any drivers/not possible to use respectively. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:08 -04:00
Rami Rosen	5422399518	mac80211: append CONFIG_ to MAC80211_VERBOSE_PS_DEBUG in net/mac80211/tx.c. In net/mac80211/tx.c, there are some #ifdef which checks MAC80211_VERBOSE_PS_DEBUG (which in fact is never set) instead of CONFIG_MAC80211_VERBOSE_PS_DEBUG, as should be. This patch replaces MAC80211_VERBOSE_PS_DEBUG with CONFIG_MAC80211_VERBOSE_PS_DEBUG in these #ifdef commands in net/mac80211/tx.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:07 -04:00
Jeremy Fitzhardinge	023a04bebe	mac80211: return correct error return from ieee80211_wep_init Return the proper error code rather than a hard-coded ENOMEM from ieee80211_wep_init. Also, print the error code on failure. Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:07 -04:00
Jiri Slaby	1b0241656b	mac80211: tx, use dev_kfree_skb_any for beacon_get Use dev_kfree_skb_any(); instead of dev_kfree_skb();, since ieee80211_beacon_get function might be called from atomic. (It's in a fail path.) Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Michael Wu <flamingice@sourmilk.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:07 -04:00
Henrique de Moraes Holschuh	435307a365	rfkill: yet more minor kernel-doc fixes For some stupid reason, I sent and old version of the patch minor kernel doc-fix patch, and it got merged before I noticed the problem. This is an incremental fix on top. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:55:03 -04:00
Henrique de Moraes Holschuh	064af1117b	rfkill: mutex fixes There are two mutexes in rfkill: rfkill->mutex, which protects some of the fields of a rfkill struct, and is also used for callback serialization. rfkill_mutex, which protects the global state, the list of registered rfkill structs and rfkill->claim. Make sure to use the correct mutex, and to not miss locking rfkill->mutex even when we already took rfkill_mutex. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:36:35 -04:00
Henrique de Moraes Holschuh	37f55e9d78	rfkill: fix led-trigger unregister order in error unwind rfkill needs to unregister the led trigger AFTER a call to rfkill_remove_switch(), otherwise it will not update the LED state, possibly leaving it ON when it should be OFF. To make led-trigger unregistering safer, guard against unregistering a trigger twice, and also against issuing trigger events to a led trigger that was unregistered. This makes the error unwind paths more resilient. Refer to "rfkill: Register LED triggers before registering switch". Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Cc: Michael Buesch <mb@bu3sch.de> Cc: Dmitry Baryshkov <dbaryshkov@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:36:32 -04:00
Henrique de Moraes Holschuh	2fd9b2212e	rfkill: document rfkill_force_state as required (v2) While the rfkill class does work with just get_state(), it doesn't work well on devices that are subject to external events that cause rfkill state changes. Document that rfkill_force_state() is required in those cases. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>	2008-07-29 16:36:32 -04:00
Ingo Molnar	9e3ee1c39c	Merge branch 'linus' into cpus4096 Conflicts: kernel/stop_machine.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-28 23:32:00 +02:00
Ingo Molnar	414f746d23	Merge branch 'linus' into cpus4096	2008-07-28 21:14:43 +02:00
Al Viro	eeb61f719c	missing bits of net-namespace / sysctl Piss-poor sysctl registration API strikes again, film at 11... What we really need is _pathname_ required to be present in already registered table, so that kernel could warn about bad order. That's the next target for sysctl stuff (and generally saner and more explicit order of initialization of ipv[46] internals wouldn't hurt either). For the time being, here are full fixups required by ..._rotable() stuff; we make per-net sysctl sets descendents of "ro" one and make sure that sufficient skeleton is there before we start registering per-net sysctls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-27 09:45:34 -07:00
David S. Miller	2ab61b0111	Merge branch 'master' of git://eden-feed.erg.abdn.ac.uk/net-2.6	2008-07-27 05:00:25 -07:00
Al Viro	6f9f489a4e	net: missing bits of net-namespace / sysctl Piss-poor sysctl registration API strikes again, film at 11... What we really need is _pathname_ required to be present in already registered table, so that kernel could warn about bad order. That's the next target for sysctl stuff (and generally saner and more explicit order of initialization of ipv[46] internals wouldn't hurt either). For the time being, here are full fixups required by ..._rotable() stuff; we make per-net sysctl sets descendents of "ro" one and make sure that sufficient skeleton is there before we start registering per-net sysctls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-27 04:40:51 -07:00
David S. Miller	15d3b4a262	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6	2008-07-27 04:40:08 -07:00
David S. Miller	2c3abab7c9	ipcomp: Fix warnings after ipcomp consolidation. net/ipv4/ipcomp.c: In function ‘ipcomp4_init_state’: net/ipv4/ipcomp.c:109: warning: unused variable ‘calg_desc’ net/ipv4/ipcomp.c:108: warning: unused variable ‘ipcd’ net/ipv4/ipcomp.c:107: warning: ‘err’ may be used uninitialized in this function net/ipv6/ipcomp6.c: In function ‘ipcomp6_init_state’: net/ipv6/ipcomp6.c:139: warning: unused variable ‘calg_desc’ net/ipv6/ipcomp6.c:138: warning: unused variable ‘ipcd’ net/ipv6/ipcomp6.c:137: warning: ‘err’ may be used uninitialized in this function Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-27 03:59:24 -07:00
Linus Torvalds	4836e30078	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (39 commits) [PATCH] fix RLIM_NOFILE handling [PATCH] get rid of corner case in dup3() entirely [PATCH] remove remaining namei_{32,64}.h crap [PATCH] get rid of indirect users of namei.h [PATCH] get rid of __user_path_lookup_open [PATCH] f_count may wrap around [PATCH] dup3 fix [PATCH] don't pass nameidata to __ncp_lookup_validate() [PATCH] don't pass nameidata to gfs2_lookupi() [PATCH] new (local) helper: user_path_parent() [PATCH] sanitize __user_walk_fd() et.al. [PATCH] preparation to __user_walk_fd cleanup [PATCH] kill nameidata passing to permission(), rename to inode_permission() [PATCH] take noexec checks to very few callers that care Re: [PATCH 3/6] vfs: open_exec cleanup [patch 4/4] vfs: immutable inode checking cleanup [patch 3/4] fat: dont call notify_change [patch 2/4] vfs: utimes cleanup [patch 1/4] vfs: utimes: move owner check into inode_change_ok() [PATCH] vfs: use kstrdup() and check failing allocation ...	2008-07-26 20:23:44 -07:00
Linus Torvalds	2284284281	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: netns: fix ip_rt_frag_needed rt_is_expired netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences netfilter: fix double-free and use-after free netfilter: arptables in netns for real netfilter: ip{,6}tables_security: fix future section mismatch selinux: use nf_register_hooks() netfilter: ebtables: use nf_register_hooks() Revert "pkt_sched: sch_sfq: dump a real number of flows" qeth: use dev->ml_priv instead of dev->priv syncookies: Make sure ECN is disabled net: drop unused BUG_TRAP() net: convert BUG_TRAP to generic WARN_ON drivers/net: convert BUG_TRAP to generic WARN_ON	2008-07-26 20:17:56 -07:00
Al Viro	516e0cc564	[PATCH] f_count may wrap around make it atomic_long_t; while we are at it, get rid of useless checks in affs, hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:40 -04:00
Al Viro	bd7b1533cd	[PATCH] sysctl: make sure that /proc/sys/net/ipv4 appears before per-ns ones Massage ipv4 initialization - make sure that net.ipv4 appears as non-per-net-namespace before it shows up in per-net-namespace sysctls. That's the only change outside of sysctl.c needed to get sane ordering rules and data structures for sysctls (esp. for procfs side of that mess). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:10 -04:00
Al Viro	734550921e	[PATCH] beginning of sysctl cleanup - ctl_table_set New object: set of sysctls [currently - root and per-net-ns]. Contains: pointer to parent set, list of tables and "should I see this set?" method (->is_seen(set)). Current lists of tables are subsumed by that; net-ns contains such a beast. ->lookup() for ctl_table_root returns pointer to ctl_table_set instead of that to ->list of that ctl_table_set. [folded compile fixes by rdd for configs without sysctl] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2008-07-26 20:53:08 -04:00
Hugh Dickins	6c3b8fc618	netns: fix ip_rt_frag_needed rt_is_expired Running recent kernels, and using a particular vpn gateway, I've been having to edit my mails down to get them accepted by the smtp server. Git bisect led to commit `e84f84f276` - netns: place rt_genid into struct net. The conversion from a != test to rt_is_expired() put one negative too many: and now my mail works. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 17:51:06 -07:00
Patrick McHardy	6c64825bf4	netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences As Linus points out, "ct->ext" and "new" are always equal, avoid unnecessary dereferences and use "new" directly. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 17:50:05 -07:00
Pekka Enberg	93bc4e89c2	netfilter: fix double-free and use-after free As suggested by Patrick McHardy, introduce a __krealloc() that doesn't free the original buffer to fix a double-free and use-after-free bug introduced by me in netfilter that uses RCU. Reported-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Tested-by: Dieter Ries <clip2@gmx.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 17:49:33 -07:00
Alexey Dobriyan	3918fed5f3	netfilter: arptables in netns for real IN, FORWARD -- grab netns from in device, OUT -- from out device. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 17:48:59 -07:00
Alexey Dobriyan	f858b4869a	netfilter: ip{,6}tables_security: fix future section mismatch Currently not visible, because NET_NS is mutually exclusive with SYSFS which is required by SECURITY. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 17:48:38 -07:00
Alexey Dobriyan	e40f51a36a	netfilter: ebtables: use nf_register_hooks() Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 17:47:53 -07:00
Alexey Dobriyan	51cc50685a	SL*B: drop kmem cache argument from constructor Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:07 -07:00
FUJITA Tomonori	8d8bb39b9e	dma-mapping: add the device argument to dma_mapping_error() Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER architecture does: This enables us to cleanly fix the Calgary IOMMU issue that some devices are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423). I think that per-device dma_mapping_ops support would be also helpful for KVM people to support PCI passthrough but Andi thinks that this makes it difficult to support the PCI passthrough (see the above thread). So I CC'ed this to KVM camp. Comments are appreciated. A pointer to dma_mapping_ops to struct dev_archdata is added. If the pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's NULL, the system-wide dma_ops pointer is used as before. If it's useful for KVM people, I plan to implement a mechanism to register a hook called when a new pci (or dma capable) device is created (it works with hot plugging). It enables IOMMUs to set up an appropriate dma_mapping_ops per device. The major obstacle is that dma_mapping_error doesn't take a pointer to the device unlike other DMA operations. So x86 can't have dma_mapping_ops per device. Note all the POWER IOMMUs use the same dma_mapping_error function so this is not a problem for POWER but x86 IOMMUs use different dma_mapping_error functions. The first patch adds the device argument to dma_mapping_error. The patch is trivial but large since it touches lots of drivers and dma-mapping.h in all the architecture. This patch: dma_mapping_error() doesn't take a pointer to the device unlike other DMA operations. So we can't have dma_mapping_ops per device. Note that POWER already has dma_mapping_ops per device but all the POWER IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device argument. [akpm@linux-foundation.org: fix sge] [akpm@linux-foundation.org: fix svc_rdma] [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: fix bnx2x] [akpm@linux-foundation.org: fix s2io] [akpm@linux-foundation.org: fix pasemi_mac] [akpm@linux-foundation.org: fix sdhci] [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: fix sparc] [akpm@linux-foundation.org: fix ibmvscsi] Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Muli Ben-Yehuda <muli@il.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-26 12:00:03 -07:00
Mike Travis	0bc3cc03fa	cpumask: change cpumask_of_cpu_ptr to use new cpumask_of_cpu * Replace previous instances of the cpumask_of_cpu_ptr* macros with a the new (lvalue capable) generic cpumask_of_cpu(). Signed-off-by: Mike Travis <travis@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jack Steiner <steiner@sgi.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-26 16:40:33 +02:00
Wei Yongjun	860239c56b	dccp: Add check for truncated ICMPv6 DCCP error packets This patch adds a minimum-length check for ICMPv6 packets, as per the previous patch for ICMPv4 payloads. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-07-26 11:59:11 +01:00
Wei Yongjun	18e1d83600	dccp: Fix incorrect length check for ICMPv4 packets Unlike TCP, which only needs 8 octets of original packet data, DCCP requires minimally 12 or 16 bytes for ICMP-payload sequence number checks. This patch replaces the insufficient length constant of 8 with a two-stage test, making sure that 12 bytes are available, before computing the basic header length required for sequence number checks. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-07-26 11:59:10 +01:00
Wei Yongjun	e0bcfb0c6a	dccp: Add check for sequence number in ICMPv6 message This adds a sequence number check for ICMPv6 DCCP error packets, in the same manner as it has been done for ICMPv4 in the previous patch. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-07-26 11:59:10 +01:00
Wei Yongjun	d68f0866f7	dccp: Fix sequence number check for ICMPv4 packets The payload of ICMP message is a part of the packet sent by ourself, so the sequence number check must use AWL and AWH, not SWL and SWH. For example: Endpoint A Endpoint B DATA-ACK --------> (SEQ=X) <-------- ICMP (Fragmentation Needed) (SEQ=X) Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-07-26 11:59:10 +01:00
Gerrit Renker	73f18fdbca	dccp: Bug-Fix - AWL was never updated The AWL lower Ack validity window advances in proportion to GSS, the greatest sequence number sent. Updating AWL other than at connection setup (in the DCCP-Request sent by dccp_v{4,6}_connect()) was missing in the DCCP code. This bug lead to syslog messages such as "kernel: dccp_check_seqno: DCCP: Step 6 failed for DATAACK packet, [...] P.ackno exists or LAWL(82947089) <= P.ackno(82948208) <= S.AWH(82948728), sending SYNC..." The difference between AWL/AWH here is 1639 packets, while the expected value (the Sequence Window) would have been 100 (the default). A closer look showed that LAWL = AWL = 82947089 equalled the ISS on the Response. The patch now updates AWL with each increase of GSS. Further changes: ---------------- The patch also enforces more stringent checks on the ISS sequence number: * AWL is initialised to ISS at connection setup and remains at this value; * AWH is then always set to GSS (via dccp_update_gss()); * so on the first Request: AWL = AWH = ISS, and on the n-th Request: AWL = ISS, AWH = ISS + n. As a consequence, only Response packets that refer to Requests sent by this host will pass, all others are discarded. This is the intention and in effect implements the initial adjustments for AWL as specified in RFC 4340, 7.5.1. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>	2008-07-26 11:59:10 +01:00
Gerrit Renker	59435444a1	dccp: Allow to distinguish original and retransmitted packets This patch allows the sender to distinguish original and retransmitted packets, which is in particular needed for the retransmission of DCCP-Requests: * the first Request uses ISS (generated in net/dccp/ip.c), and sets GSS = ISS; all retransmitted Requests use GSS' = GSS + 1, so that the n-th retransmitted Request has sequence number ISS + n (mod 48). To add generic support, the patch reorganises existing code so that: * icsk_retransmits == 0 for the original packet and * icsk_retransmits = n > 0 for the n-th retransmitted packet at the time dccp_transmit_skb() is called, via dccp_retransmit_skb(). Thanks to Wei Yongjun for pointing this problem out. Further changes: ---------------- * removed the `skb' argument from dccp_retransmit_skb(), since sk_send_head is used for all retransmissions (the exception is client-Acks in PARTOPEN state, but these do not use sk_send_head); * since sk_send_head always contains the original skb (via dccp_entail()), skb_cloned() never evaluated to true and thus pskb_copy() was never used. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>	2008-07-26 11:59:09 +01:00
David S. Miller	cdec7e50a4	Revert "pkt_sched: sch_sfq: dump a real number of flows" This reverts commit `f867e6af94`. Based upon discussions between Jarek and Patrick McHardy this is field being set is more a config parameter than a statistic. And we should add a true statistic to provide this information if we really want it. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 02:28:09 -07:00
Florian Westphal	16df845f45	syncookies: Make sure ECN is disabled ecn_ok is not initialized when a connection is established by cookies. The cookie syn-ack never sets ECN, so ecn_ok must be set to 0. Spotted using ns-3/network simulation cradle simulator and valgrind. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-26 02:21:54 -07:00
Ilpo Järvinen	547b792cac	net: convert BUG_TRAP to generic WARN_ON Removes legacy reinvent-the-wheel type thing. The generic machinery integrates much better to automated debugging aids such as kerneloops.org (and others), and is unambiguous due to better naming. Non-intuively BUG_TRAP() is actually equal to WARN_ON() rather than BUG_ON() though some might actually be promoted to BUG_ON() but I left that to future. I could make at least one BUILD_BUG_ON conversion. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-25 21:43:18 -07:00
Linus Torvalds	1ff8419871	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: ipsec: ipcomp - Decompress into frags if necessary ipsec: ipcomp - Merge IPComp implementations pkt_sched: Fix locking in shutdown_scheduler_queue()	2008-07-25 17:40:16 -07:00
Stephen Hemminger	4ecb90090c	sysctl: allow override of /proc/sys/net with CAP_NET_ADMIN Extend the permission check for networking sysctl's to allow modification when current process has CAP_NET_ADMIN capability and is not root. This version uses the until now unused permissions hook to override the mode value for /proc/sys/net if accessed by a user with capabilities. Found while working with Quagga. It is impossible to turn forwarding on/off through the command interface because Quagga uses secure coding practice of dropping privledges during initialization and only raising via capabilities when necessary. Since the dameon has reset real/effective uid after initialization, all attempts to access /proc/sys/net variables will fail. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morgan <morgan@kernel.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:45 -07:00
Dave Young	717115e1a5	printk ratelimiting rewrite All ratelimit user use same jiffies and burst params, so some messages (callbacks) will be lost. For example: a call printk_ratelimit(5 * HZ, 1) b call printk_ratelimit(5 * HZ, 1) before the 5*HZ timeout of a, then b will will be supressed. - rewrite __ratelimit, and use a ratelimit_state as parameter. Thanks for hints from andrew. - Add WARN_ON_RATELIMIT, update rcupreempt.h - remove __printk_ratelimit - use __ratelimit in net_ratelimit Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:29 -07:00
Paul E. McKenney	696adfe84c	list_for_each_rcu must die: networking All uses of list_for_each_rcu() can be profitably replaced by the easier-to-use list_for_each_entry_rcu(). This patch makes this change for networking, in preparation for removing the list_for_each_rcu() API entirely. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-25 10:53:27 -07:00
Herbert Xu	7d7e5a60c6	ipsec: ipcomp - Decompress into frags if necessary When decompressing extremely large packets allocating them through kmalloc is prone to failure. Therefore it's better to use page frags instead. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-25 02:55:33 -07:00
Herbert Xu	6fccab671f	ipsec: ipcomp - Merge IPComp implementations This patch merges the IPv4/IPv6 IPComp implementations since most of the code is identical. As a result future enhancements will no longer need to be duplicated. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-25 02:54:40 -07:00
David S. Miller	cffe1c5d7a	pkt_sched: Fix locking in shutdown_scheduler_queue() Qdisc locks need to be held with BH disabled. Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-25 01:25:04 -07:00
Linus Torvalds	c3c2233d84	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: pkt_sched: sch_sfq: dump a real number of flows atm: [fore200e] use MODULE_FIRMWARE() and other suggested cleanups netfilter: make security table depend on NETFILTER_ADVANCED tcp: Clear probes_out more aggressively in tcp_ack(). e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call net: Update entry in af_family_clock_key_strings netdev: Remove warning from __netif_schedule(). sky2: don't stop queue on shutdown	2008-07-24 12:14:58 -07:00
Ulrich Drepper	e38b36f325	flag parameters: check magic constants This patch adds test that ensure the boundary conditions for the various constants introduced in the previous patches is met. No code is generated. [akpm@linux-foundation.org: fix alpha] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	77d2720059	flag parameters: NONBLOCK in socket and socketpair This patch introduces support for the SOCK_NONBLOCK flag in socket, socketpair, and paccept. To do this the internal function sock_attach_fd gets an additional parameter which it uses to set the appropriate flag for the file descriptor. Given that in modern, scalable programs almost all socket connections are non-blocking and the minimal additional cost for the new functionality I see no reason not to add this code. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <pthread.h> #include <stdio.h> #include <unistd.h> #include <netinet/in.h> #include <sys/socket.h> #include <sys/syscall.h> #ifndef __NR_paccept # ifdef __x86_64__ # define __NR_paccept 288 # elif defined __i386__ # define SYS_PACCEPT 18 # define USE_SOCKETCALL 1 # else # error "need __NR_paccept" # endif #endif #ifdef USE_SOCKETCALL # define paccept(fd, addr, addrlen, mask, flags) \ ({ long args[6] = { \ (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \ syscall (__NR_socketcall, SYS_PACCEPT, args); }) #else # define paccept(fd, addr, addrlen, mask, flags) \ syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags) #endif #define PORT 57392 #define SOCK_NONBLOCK O_NONBLOCK static pthread_barrier_t b; static void * tf (void arg) { pthread_barrier_wait (&b); int s = socket (AF_INET, SOCK_STREAM, 0); struct sockaddr_in sin; sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK); sin.sin_port = htons (PORT); connect (s, (const struct sockaddr ) &sin, sizeof (sin)); close (s); pthread_barrier_wait (&b); pthread_barrier_wait (&b); s = socket (AF_INET, SOCK_STREAM, 0); sin.sin_port = htons (PORT); connect (s, (const struct sockaddr ) &sin, sizeof (sin)); close (s); pthread_barrier_wait (&b); return NULL; } int main (void) { int fd; fd = socket (PF_INET, SOCK_STREAM, 0); if (fd == -1) { puts ("socket(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("socket(0) set non-blocking mode"); return 1; } close (fd); fd = socket (PF_INET, SOCK_STREAM\|SOCK_NONBLOCK, 0); if (fd == -1) { puts ("socket(SOCK_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("socket(SOCK_NONBLOCK) does not set non-blocking mode"); return 1; } close (fd); int fds[2]; if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1) { puts ("socketpair(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { printf ("socketpair(0) set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } if (socketpair (PF_UNIX, SOCK_STREAM\|SOCK_NONBLOCK, 0, fds) == -1) { puts ("socketpair(SOCK_NONBLOCK) failed"); return 1; } for (int i = 0; i < 2; ++i) { fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { printf ("socketpair(SOCK_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } pthread_barrier_init (&b, NULL, 2); struct sockaddr_in sin; pthread_t th; if (pthread_create (&th, NULL, tf, NULL) != 0) { puts ("pthread_create failed"); return 1; } int s = socket (AF_INET, SOCK_STREAM, 0); int reuse = 1; setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse)); sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK); sin.sin_port = htons (PORT); bind (s, (struct sockaddr ) &sin, sizeof (sin)); listen (s, SOMAXCONN); pthread_barrier_wait (&b); int s2 = paccept (s, NULL, 0, NULL, 0); if (s2 < 0) { puts ("paccept(0) failed"); return 1; } fl = fcntl (s2, F_GETFL); if (fl & O_NONBLOCK) { puts ("paccept(0) set non-blocking mode"); return 1; } close (s2); close (s); pthread_barrier_wait (&b); s = socket (AF_INET, SOCK_STREAM, 0); sin.sin_port = htons (PORT); setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse)); bind (s, (struct sockaddr *) &sin, sizeof (sin)); listen (s, SOMAXCONN); pthread_barrier_wait (&b); s2 = paccept (s, NULL, 0, NULL, SOCK_NONBLOCK); if (s2 < 0) { puts ("paccept(SOCK_NONBLOCK) failed"); return 1; } fl = fcntl (s2, F_GETFL); if ((fl & O_NONBLOCK) == 0) { puts ("paccept(SOCK_NONBLOCK) does not set non-blocking mode"); return 1; } close (s2); close (s); pthread_barrier_wait (&b); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	c019bbc612	flag parameters: paccept w/out set_restore_sigmask Some platforms do not have support to restore the signal mask in the return path from a syscall. For those platforms syscalls like pselect are not defined at all. This is, I think, not a good choice for paccept() since paccept() adds more value on top of accept() than just the signal mask handling. Therefore this patch defines a scaled down version of the sys_paccept function for those platforms. It returns -EINVAL in case the signal mask is non-NULL but behaves the same otherwise. Note that I explicitly included <linux/thread_info.h>. I saw that it is currently included but indirectly two levels down. There is too much risk in relying on this. The header might change and then suddenly the function definition would change without anyone immediately noticing. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	aaca0bdca5	flag parameters: paccept This patch is by far the most complex in the series. It adds a new syscall paccept. This syscall differs from accept in that it adds (at the userlevel) two additional parameters: - a signal mask - a flags value The flags parameter can be used to set flag like SOCK_CLOEXEC. This is imlpemented here as well. Some people argued that this is a property which should be inherited from the file desriptor for the server but this is against POSIX. Additionally, we really want the signal mask parameter as well (similar to pselect, ppoll, etc). So an interface change in inevitable. The flag value is the same as for socket and socketpair. I think diverging here will only create confusion. Similar to the filesystem interfaces where the use of the O_* constants differs, it is acceptable here. The signal mask is handled as for pselect etc. The mask is temporarily installed for the thread and removed before the call returns. I modeled the code after pselect. If there is a problem it's likely also in pselect. For architectures which use socketcall I maintained this interface instead of adding a system call. The symmetry shouldn't be broken. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <errno.h> #include <fcntl.h> #include <pthread.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <netinet/in.h> #include <sys/socket.h> #include <sys/syscall.h> #ifndef __NR_paccept # ifdef __x86_64__ # define __NR_paccept 288 # elif defined __i386__ # define SYS_PACCEPT 18 # define USE_SOCKETCALL 1 # else # error "need __NR_paccept" # endif #endif #ifdef USE_SOCKETCALL # define paccept(fd, addr, addrlen, mask, flags) \ ({ long args[6] = { \ (long) fd, (long) addr, (long) addrlen, (long) mask, 8, (long) flags }; \ syscall (__NR_socketcall, SYS_PACCEPT, args); }) #else # define paccept(fd, addr, addrlen, mask, flags) \ syscall (__NR_paccept, fd, addr, addrlen, mask, 8, flags) #endif #define PORT 57392 #define SOCK_CLOEXEC O_CLOEXEC static pthread_barrier_t b; static void * tf (void arg) { pthread_barrier_wait (&b); int s = socket (AF_INET, SOCK_STREAM, 0); struct sockaddr_in sin; sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK); sin.sin_port = htons (PORT); connect (s, (const struct sockaddr ) &sin, sizeof (sin)); close (s); pthread_barrier_wait (&b); s = socket (AF_INET, SOCK_STREAM, 0); sin.sin_port = htons (PORT); connect (s, (const struct sockaddr ) &sin, sizeof (sin)); close (s); pthread_barrier_wait (&b); pthread_barrier_wait (&b); sleep (2); pthread_kill ((pthread_t) arg, SIGUSR1); return NULL; } static void handler (int s) { } int main (void) { pthread_barrier_init (&b, NULL, 2); struct sockaddr_in sin; pthread_t th; if (pthread_create (&th, NULL, tf, (void ) pthread_self ()) != 0) { puts ("pthread_create failed"); return 1; } int s = socket (AF_INET, SOCK_STREAM, 0); int reuse = 1; setsockopt (s, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof (reuse)); sin.sin_family = AF_INET; sin.sin_addr.s_addr = htonl (INADDR_LOOPBACK); sin.sin_port = htons (PORT); bind (s, (struct sockaddr *) &sin, sizeof (sin)); listen (s, SOMAXCONN); pthread_barrier_wait (&b); int s2 = paccept (s, NULL, 0, NULL, 0); if (s2 < 0) { puts ("paccept(0) failed"); return 1; } int coe = fcntl (s2, F_GETFD); if (coe & FD_CLOEXEC) { puts ("paccept(0) set close-on-exec-flag"); return 1; } close (s2); pthread_barrier_wait (&b); s2 = paccept (s, NULL, 0, NULL, SOCK_CLOEXEC); if (s2 < 0) { puts ("paccept(SOCK_CLOEXEC) failed"); return 1; } coe = fcntl (s2, F_GETFD); if ((coe & FD_CLOEXEC) == 0) { puts ("paccept(SOCK_CLOEXEC) does not set close-on-exec flag"); return 1; } close (s2); pthread_barrier_wait (&b); struct sigaction sa; sa.sa_handler = handler; sa.sa_flags = 0; sigemptyset (&sa.sa_mask); sigaction (SIGUSR1, &sa, NULL); sigset_t ss; pthread_sigmask (SIG_SETMASK, NULL, &ss); sigaddset (&ss, SIGUSR1); pthread_sigmask (SIG_SETMASK, &ss, NULL); sigdelset (&ss, SIGUSR1); alarm (4); pthread_barrier_wait (&b); errno = 0 ; s2 = paccept (s, NULL, 0, &ss, 0); if (s2 != -1 \|\| errno != EINTR) { puts ("paccept did not fail with EINTR"); return 1; } close (s); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: make it compile] [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Roland McGrath <roland@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	a677a039be	flag parameters: socket and socketpair This patch adds support for flag values which are ORed to the type passwd to socket and socketpair. The additional code is minimal. The flag values in this implementation can and must match the O_* flags. This avoids overhead in the conversion. The internal functions sock_alloc_fd and sock_map_fd get a new parameters and all callers are changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <netinet/in.h> #include <sys/socket.h> #define PORT 57392 /* For Linux these must be the same. */ #define SOCK_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = socket (PF_INET, SOCK_STREAM, 0); if (fd == -1) { puts ("socket(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("socket(0) set close-on-exec flag"); return 1; } close (fd); fd = socket (PF_INET, SOCK_STREAM\|SOCK_CLOEXEC, 0); if (fd == -1) { puts ("socket(SOCK_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("socket(SOCK_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); int fds[2]; if (socketpair (PF_UNIX, SOCK_STREAM, 0, fds) == -1) { puts ("socketpair(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { coe = fcntl (fds[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { printf ("socketpair(0) set close-on-exec flag for fds[%d]\n", i); return 1; } close (fds[i]); } if (socketpair (PF_UNIX, SOCK_STREAM\|SOCK_CLOEXEC, 0, fds) == -1) { puts ("socketpair(SOCK_CLOEXEC) failed"); return 1; } for (int i = 0; i < 2; ++i) { coe = fcntl (fds[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { printf ("socketpair(SOCK_CLOEXEC) does not set close-on-exec flag for fds[%d]\n", i); return 1; } close (fds[i]); } puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Jarek Poplawski	f867e6af94	pkt_sched: sch_sfq: dump a real number of flows Dump the "flows" number according to the number of active flows instead of repeating the "limit". Reported-by: Denys Fedoryshchenko <denys@visp.net.lb> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-23 21:34:27 -07:00
Linus Torvalds	26dcce0fab	Merge branch 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits) NR_CPUS: Replace NR_CPUS in speedstep-centrino.c cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP NR_CPUS: Replace NR_CPUS in cpufreq userspace routines NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target cpumask: Provide a generic set of CPUMASK_ALLOC macros cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr Revert "cpumask: introduce new APIs" cpumask: make for_each_cpu_mask a bit smaller net: Pass reference to cpumask variable in net/sunrpc/svc.c ... Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually	2008-07-23 18:37:44 -07:00
Patrick McHardy	70eed75d76	netfilter: make security table depend on NETFILTER_ADVANCED Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-23 16:42:42 -07:00
David S. Miller	4b53fb67e3	tcp: Clear probes_out more aggressively in tcp_ack(). This is based upon an excellent bug report from Eric Dumazet. tcp_ack() should clear ->icsk_probes_out even if there are packets outstanding. Otherwise if we get a sequence of ACKs while we do have packets outstanding over and over again, we'll never clear the probes_out value and eventually think the connection is too sick and we'll reset it. This appears to be some "optimization" added to tcp_ack() in the 2.4.x timeframe. In 2.2.x, probes_out is pretty much always cleared by tcp_ack(). Here is Eric's original report: ---------------------------------------- Apparently, we can in some situations reset TCP connections in a couple of seconds when some frames are lost. In order to reproduce the problem, please try the following program on linux-2.6.25.* Setup some iptables rules to allow two frames per second sent on loopback interface to tcp destination port 12000 iptables -N SLOWLO iptables -A SLOWLO -m hashlimit --hashlimit 2 --hashlimit-burst 1 --hashlimit-mode dstip --hashlimit-name slow2 -j ACCEPT iptables -A SLOWLO -j DROP iptables -A OUTPUT -o lo -p tcp --dport 12000 -j SLOWLO Then run the attached program and see the output : # ./loop State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,1) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,3) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,5) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,7) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,9) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,200ms,11) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,201ms,13) State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 40 127.0.0.1:54455 127.0.0.1:12000 timer:(persist,188ms,15) write(): Connection timed out wrote 890 bytes but was interrupted after 9 seconds ESTAB 0 0 127.0.0.1:12000 127.0.0.1:54455 Exiting read() because no data available (4000 ms timeout). read 860 bytes While this tcp session makes progress (sending frames with 50 bytes of payload, every 500ms), linux tcp stack decides to reset it, when tcp_retries 2 is reached (default value : 15) tcpdump : 15:30:28.856695 IP 127.0.0.1.56554 > 127.0.0.1.12000: S 33788768:33788768(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7> 15:30:28.856711 IP 127.0.0.1.12000 > 127.0.0.1.56554: S 33899253:33899253(0) ack 33788769 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7> 15:30:29.356947 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 1:61(60) ack 1 win 257 15:30:29.356966 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 61 win 257 15:30:29.866415 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 61:111(50) ack 1 win 257 15:30:29.866427 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 111 win 257 15:30:30.366516 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 111:161(50) ack 1 win 257 15:30:30.366527 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 161 win 257 15:30:30.876196 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 161:211(50) ack 1 win 257 15:30:30.876207 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 211 win 257 15:30:31.376282 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 211:261(50) ack 1 win 257 15:30:31.376290 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 261 win 257 15:30:31.885619 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 261:311(50) ack 1 win 257 15:30:31.885631 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 311 win 257 15:30:32.385705 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 311:361(50) ack 1 win 257 15:30:32.385715 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 361 win 257 15:30:32.895249 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 361:411(50) ack 1 win 257 15:30:32.895266 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 411 win 257 15:30:33.395341 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 411:461(50) ack 1 win 257 15:30:33.395351 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 461 win 257 15:30:33.918085 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 461:511(50) ack 1 win 257 15:30:33.918096 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 511 win 257 15:30:34.418163 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 511:561(50) ack 1 win 257 15:30:34.418172 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 561 win 257 15:30:34.927685 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 561:611(50) ack 1 win 257 15:30:34.927698 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 611 win 257 15:30:35.427757 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 611:661(50) ack 1 win 257 15:30:35.427766 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 661 win 257 15:30:35.937359 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 661:711(50) ack 1 win 257 15:30:35.937376 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 711 win 257 15:30:36.437451 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 711:761(50) ack 1 win 257 15:30:36.437464 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 761 win 257 15:30:36.947022 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 761:811(50) ack 1 win 257 15:30:36.947039 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 811 win 257 15:30:37.447135 IP 127.0.0.1.56554 > 127.0.0.1.12000: P 811:861(50) ack 1 win 257 15:30:37.447203 IP 127.0.0.1.12000 > 127.0.0.1.56554: . ack 861 win 257 15:30:41.448171 IP 127.0.0.1.12000 > 127.0.0.1.56554: F 1:1(0) ack 861 win 257 15:30:41.448189 IP 127.0.0.1.56554 > 127.0.0.1.12000: R 33789629:33789629(0) win 0 Source of program : /* * small producer/consumer program. * setup a listener on 127.0.0.1:12000 * Forks a child * child connect to 127.0.0.1, and sends 10 bytes on this tcp socket every 100 ms * Father accepts connection, and read all data / #include <sys/types.h> #include <sys/socket.h> #include <netinet/in.h> #include <unistd.h> #include <stdio.h> #include <time.h> #include <sys/poll.h> int port = 12000; char buffer[4096]; int main(int argc, char argv[]) { int lfd = socket(AF_INET, SOCK_STREAM, 0); struct sockaddr_in socket_address; time_t t0, t1; int on = 1, sfd, res; unsigned long total = 0; socklen_t alen = sizeof(socket_address); pid_t pid; time(&t0); socket_address.sin_family = AF_INET; socket_address.sin_port = htons(port); socket_address.sin_addr.s_addr = htonl(INADDR_LOOPBACK); if (lfd == -1) { perror("socket()"); return 1; } setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(int)); if (bind(lfd, (struct sockaddr )&socket_address, sizeof(socket_address)) == -1) { perror("bind"); close(lfd); return 1; } if (listen(lfd, 1) == -1) { perror("listen()"); close(lfd); return 1; } pid = fork(); if (pid == 0) { int i, cfd = socket(AF_INET, SOCK_STREAM, 0); close(lfd); if (connect(cfd, (struct sockaddr )&socket_address, sizeof(socket_address)) == -1) { perror("connect()"); return 1; } for (i = 0 ; ;) { res = write(cfd, "blablabla\n", 10); if (res > 0) total += res; else if (res == -1) { perror("write()"); break; } else break; usleep(100000); if (++i == 10) { system("ss -on dst 127.0.0.1:12000"); i = 0; } } time(&t1); fprintf(stderr, "wrote %lu bytes but was interrupted after %g seconds\n", total, difftime(t1, t0)); system("ss -on \| grep 127.0.0.1:12000"); close(cfd); return 0; } sfd = accept(lfd, (struct sockaddr *)&socket_address, &alen); if (sfd == -1) { perror("accept"); return 1; } close(lfd); while (1) { struct pollfd pfd[1]; pfd[0].fd = sfd; pfd[0].events = POLLIN; if (poll(pfd, 1, 4000) == 0) { fprintf(stderr, "Exiting read() because no data available (4000 ms timeout).\n"); break; } res = read(sfd, buffer, sizeof(buffer)); if (res > 0) total += res; else if (res == 0) break; else perror("read()"); } fprintf(stderr, "read %lu bytes\n", total); close(sfd); return 0; } ---------------------------------------- Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-23 16:38:45 -07:00
Oliver Hartkopp	b4942af650	net: Update entry in af_family_clock_key_strings In the merge phase of the CAN subsystem the af_family_clock_key_strings[] have been added to sock.c in commit `443aef0edd` (lockdep: fixup sk_callback_lock annotation). This trivial patch adds the missing name for address family 29 (AF_CAN). Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-23 14:06:04 -07:00
David S. Miller	5b3ab1dbd4	netdev: Remove warning from __netif_schedule(). It isn't helping anything and we aren't going to be able to change all the drivers that do queue wakeups in strange situations. Just letting a noop_qdisc get scheduled will work because when qdisc_run() executes via net_tx_work() it will simply find no packets pending when it makes the ->dequeue() call in qdisc_restart. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-23 14:01:29 -07:00
Krzysztof Hałasa	a8817d2f6d	wanmain.c doesn't need syncppp.h Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>	2008-07-23 23:00:36 +02:00
Krzysztof Hałasa	a24e202e3f	Remove dead code from wanmain.c, CONFIG_WANPIPE_MULTPPP doesn't exist Signed-off-by: Krzysztof Hałasa <khc@pm.waw.pl>	2008-07-23 23:00:36 +02:00
Linus Torvalds	5554b35933	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (24 commits) I/OAT: I/OAT version 3.0 support I/OAT: tcp_dma_copybreak default value dependent on I/OAT version I/OAT: Add watchdog/reset functionality to ioatdma iop_adma: cleanup iop_chan_xor_slot_count iop_adma: document how to calculate the minimum descriptor pool size iop_adma: directly reclaim descriptors on allocation failure async_tx: make async_tx_test_ack a boolean routine async_tx: remove depend_tx from async_tx_sync_epilog async_tx: export async_tx_quiesce async_tx: fix handling of the "out of descriptor" condition in async_xor async_tx: ensure the xor destination buffer remains dma-mapped async_tx: list_for_each_entry_rcu() cleanup dmaengine: Driver for the Synopsys DesignWare DMA controller dmaengine: Add slave DMA interface dmaengine: add DMA_COMPL_SKIP_{SRC,DEST}_UNMAP flags to control dma unmap dmaengine: Add dma_client parameter to device_alloc_chan_resources dmatest: Simple DMA memcpy test client dmaengine: DMA engine driver for Marvell XOR engine iop-adma: fix platform driver hotplug/coldplug dmaengine: track the number of clients using a channel ... Fixed up conflict in drivers/dca/dca-sysfs.c manually	2008-07-23 12:03:18 -07:00
Linus Torvalds	c010b2f76c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits) ipw2200: Call netif_*_queue() interfaces properly. netxen: Needs to include linux/vmalloc.h [netdrvr] atl1d: fix !CONFIG_PM build r6040: rework init_one error handling r6040: bump release number to 0.18 r6040: handle RX fifo full and no descriptor interrupts r6040: change the default waiting time r6040: use definitions for magic values in descriptor status r6040: completely rework the RX path r6040: call napi_disable when puting down the interface and set lp->dev accordingly. mv643xx_eth: fix NETPOLL build r6040: rework the RX buffers allocation routine r6040: fix scheduling while atomic in r6040_tx_timeout r6040: fix null pointer access and tx timeouts r6040: prefix all functions with r6040 rndis_host: support WM6 devices as modems at91_ether: use netstats in net_device structure sfc: Create one RX queue and interrupt per CPU package by default sfc: Use a separate workqueue for resets sfc: I2C adapter initialisation fixes ...	2008-07-22 19:09:51 -07:00
Maciej Sosnowski	16a37acaaf	I/OAT: tcp_dma_copybreak default value dependent on I/OAT version I/OAT DMA performance tuning showed different optimal values of tcp_dma_copybreak for different I/OAT versions (4096 for 1.2 and 2048 for 2.0). This patch lets ioatdma driver set tcp_dma_copybreak value according to these results. [dan.j.williams@intel.com: remove some ifdefs] Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2008-07-22 17:30:57 -07:00
Stephen Hemminger	3d0f24a74e	ipv6: icmp6_dst_gc return change Change icmp6_dst_gc to return the one value the caller cares about rather than using call by reference. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:35:50 -07:00
Stephen Hemminger	75307c0fe7	ipv6: use kcalloc Th fib_table_hash is an array, so use kcalloc. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:35:07 -07:00
Stephen Hemminger	a76d7345a3	ipv6: use spin_trylock_bh Now there is spin_trylock_bh, use it rather than open coding. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:34:35 -07:00
Stephen Hemminger	c8a4522245	ipv6: use round_jiffies This timer normally happens once a minute, there is no need to cause an early wakeup for it, so align it to next second boundary to safe power. It can't be deferred because then it could take too long on cleanup or DoS. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:34:09 -07:00
Stephen Hemminger	417f28bb34	netns: dont alloc ipv6 fib timer list FIB timer list is a trivial size structure, avoid indirection and just put it in existing ns. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:33:45 -07:00
Adrian Bunk	888c848ed3	ipv6: make struct ipv6_devconf static struct ipv6_devconf can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:21:58 -07:00
Adrian Bunk	4d6971e909	sctp: remove sctp_assoc_proc_exit() Commit `20c2c1fd6c` (sctp: add sctp/remaddr table to complete RFC remote address table OID) added an unused sctp_assoc_proc_exit() function that seems to have been unintentionally created when copying the assocs code. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:21:30 -07:00
Adrian Bunk	abd0b198ea	sctp: make sctp_outq_flush() static sctp_outq_flush() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:20:45 -07:00
Adrian Bunk	a94f779f9d	pkt_sched: make qdisc_class_hash_alloc() static This patch makes the needlessly global qdisc_class_hash_alloc() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:20:11 -07:00
David S. Miller	cf508b1211	netdev: Handle ->addr_list_lock just like ->_xmit_lock for lockdep. The new address list lock needs to handle the same device layering issues that the _xmit_lock one does. This integrates work done by Patrick McHardy. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:16:42 -07:00
Dave Jones	d29f749e25	net: Fix build failure with 'make mandocs'. The function header comments have to go with the functions they are documenting, or things go horribly wrong when we try to process them with the docbook tools. Warning(include/linux/netdevice.h:1006): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1033): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1067): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1093): No description found for parameter 'dev_queue' Warning(include/linux/netdevice.h:1474): No description found for parameter 'txq' Error(net/core/dev.c:1674): cannot understand prototype: 'u32 simple_tx_hashrnd; ' Signed-off-by: Dave Jones <davej@redhat.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:09:06 -07:00
Greg Kroah-Hartman	16be63fd16	bluetooth: remove improper bluetooth class symlinks. Don't create symlinks in a class to a device that is not owned by the class. If the bluetooth subsystem really wants to point to all of the devices it controls, it needs to create real devices, not fake symlinks. Cc: Maxim Krasnyansky <maxk@qualcomm.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:51 -07:00
David S. Miller	b32d13102d	tcp: Fix bitmask test in tcp_syn_options() As reported by Alexey Dobriyan: CHECK net/ipv4/tcp_output.c net/ipv4/tcp_output.c:475:7: warning: dubious: !x & y And sparse is damn right! if (unlikely(!OPTION_TS & opts->options)) ^^^ size += TCPOLEN_SACKPERM_ALIGNED; OPTION_TS is (1 << 1), so condition will never trigger. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 18:45:34 -07:00
Gerrit Renker	47112e25da	udplite: Protection against coverage value wrap-around This patch clamps the cscov setsockopt values to a maximum of 0xFFFF. Setsockopt values greater than 0xffff can cause an unwanted wrap-around. Further, IPv6 jumbograms are not supported (RFC 3838, 3.5), so that values greater than 0xffff are not even useful. Further changes: fixed a typo in the documentation. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 13:35:08 -07:00
Arjan van de Ven	6579e57b31	net: Print the module name as part of the watchdog message As suggested by Dave: This patch adds a function to get the driver name from a struct net_device, and consequently uses this in the watchdog timeout handler to print as part of the message. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 13:31:48 -07:00
Stephen Hemminger	7943986ca1	net: use kcalloc in netdev_queue alloc Minor nit, use size_t for allocation size and kcalloc to allocate an array. Probably makes no actual code difference. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 13:28:44 -07:00
Stephen Hemminger	847499ce71	ipv6: use timer pending This fixes the bridge reference count problem and cleanups ipv6 FIB timer management. Don't use expires field, because it is not a proper way to test, instead use timer_pending(). Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 13:21:35 -07:00
Patrick McHardy	5547cd0ae8	netfilter: nf_conntrack_sctp: fix sparse warnings Introduced by `a258860e` (netfilter: ctnetlink: add full support for SCTP to ctnetlink): net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: incorrect type in argument 1 (different base types) net/netfilter/nf_conntrack_proto_sctp.c:483:2: expected unsigned int [unsigned] [usertype] x net/netfilter/nf_conntrack_proto_sctp.c:483:2: got restricted unsigned int const <noident> net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:483:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: incorrect type in argument 1 (different base types) net/netfilter/nf_conntrack_proto_sctp.c:487:2: expected unsigned int [unsigned] [usertype] x net/netfilter/nf_conntrack_proto_sctp.c:487:2: got restricted unsigned int const <noident> net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:487:2: warning: cast from restricted type net/netfilter/nf_conntrack_proto_sctp.c:532:42: warning: incorrect type in assignment (different base types) net/netfilter/nf_conntrack_proto_sctp.c:532:42: expected restricted unsigned int <noident> net/netfilter/nf_conntrack_proto_sctp.c:532:42: got unsigned int net/netfilter/nf_conntrack_proto_sctp.c:534:39: warning: incorrect type in assignment (different base types) net/netfilter/nf_conntrack_proto_sctp.c:534:39: expected restricted unsigned int <noident> net/netfilter/nf_conntrack_proto_sctp.c:534:39: got unsigned int Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:11:02 -07:00
Herbert Xu	c71529e42c	netfilter: nf_nat_sip: c= is optional for session According to RFC2327, the connection information is optional in the session description since it can be specified in the media description instead. My provider does exactly that and does not provide any connection information in the session description. As a result the new kernel drops all invite responses. This patch makes it optional as documented. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:11:02 -07:00
Jan Engelhardt	db1a75bdcc	netfilter: xt_TCPMSS: collapse tcpmss_reverse_mtu{4,6} into one function Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:11:01 -07:00
Eric Leblond	72961ecf84	netfilter: nfnetlink_log: send complete hardware header This patch adds some fields to NFLOG to be able to send the complete hardware header with all necessary informations. It sends to userspace: * the type of hardware link * the lenght of hardware header * the hardware header Signed-off-by: Eric Leblond <eric@inl.fr> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:11:00 -07:00
David Howells	280763c053	netfilter: xt_time: fix time's time_mt()'s use of do_div() Fix netfilter xt_time's time_mt()'s use of do_div() on an s64 by using div_s64() instead. This was introduced by patch `ee4411a1b1` ("[NETFILTER]: x_tables: add xt_time match"). Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:10:59 -07:00
Krzysztof Piotr Oledzki	584015727a	netfilter: accounting rework: ct_extend + 64bit counters (v4) Initially netfilter has had 64bit counters for conntrack-based accounting, but it was changed in 2.6.14 to save memory. Unfortunately in-kernel 64bit counters are still required, for example for "connbytes" extension. However, 64bit counters waste a lot of memory and it was not possible to enable/disable it runtime. This patch: - reimplements accounting with respect to the extension infrastructure, - makes one global version of seq_print_acct() instead of two seq_print_counters(), - makes it possible to enable it at boot time (for CONFIG_SYSCTL/CONFIG_SYSFS=n), - makes it possible to enable/disable it at runtime by sysctl or sysfs, - extends counters from 32bit to 64bit, - renames ip_conntrack_counter -> nf_conn_counter, - enables accounting code unconditionally (no longer depends on CONFIG_NF_CT_ACCT), - set initial accounting enable state based on CONFIG_NF_CT_ACCT - removes buggy IPCT_COUNTER_FILLING event handling. If accounting is enabled newly created connections get additional acct extend. Old connections are not changed as it is not possible to add a ct_extend area to confirmed conntrack. Accounting is performed for all connections with acct extend regardless of a current state of "net.netfilter.nf_conntrack_acct". Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:10:58 -07:00
Changli Gao	0dbff689c2	netfilter: nf_nat_core: eliminate useless find_appropriate_src for IP_NAT_RANGE_PROTO_RANDOM Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:10:57 -07:00
David S. Miller	d3678b463d	Revert "pkt_sched: Make default qdisc nonshared-multiqueue safe." This reverts commit `a0c80b80e0`. After discussions with Jamal and Herbert on netdev, we should provide at least minimal prioritization at the qdisc level even in multiqueue situations. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:10:50 -07:00
Linus Torvalds	867d79fb9a	net: In __netif_schedule() use WARN_ON instead of BUG_ON Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:10:49 -07:00
David S. Miller	b6b2fed1f4	net: Improve simple_tx_hash(). Based upon feedback from Eric Dumazet and Andi Kleen. Cure several deficiencies in simple_tx_hash() by using jhash + reciprocol multiply. 1) Eliminates expensive modulus operation. 2) Makes hash less attackable by using random seed. 3) Eliminates endianness hash distribution issues. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 10:10:48 -07:00
Daniel Lezcano	c3ee84163e	pkt_sched: Remove unused variable skb in dev_deactivate_queue function. Removed unused variable 'skb' in the dev_deactivate_queue function Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-21 09:18:07 -07:00
Ingo Molnar	eb6a12c242	Merge branch 'linus' into cpus4096-for-linus Conflicts: net/sunrpc/svc.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-21 17:19:50 +02:00
Linus Torvalds	14b395e35d	Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux * 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits) nfsd: nfs4xdr.c do-while is not a compound statement nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c lockd: Pass "struct sockaddr *" to new failover-by-IP function lockd: get host reference in nlmsvc_create_block() instead of callers lockd: minor svclock.c style fixes lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock lockd: nlm_release_host() checks for NULL, caller needn't file lock: reorder struct file_lock to save space on 64 bit builds nfsd: take file and mnt write in nfs4_upgrade_open nfsd: document open share bit tracking nfsd: tabulate nfs4 xdr encoding functions nfsd: dprint operation names svcrdma: Change WR context get/put to use the kmem cache svcrdma: Create a kmem cache for the WR contexts svcrdma: Add flush_scheduled_work to module exit function svcrdma: Limit ORD based on client's advertised IRD svcrdma: Remove unused wait q from svcrdma_xprt structure svcrdma: Remove unneeded spin locks from __svc_rdma_free svcrdma: Add dma map count and WARN_ON ...	2008-07-20 21:21:46 -07:00
David Miller	702beb87d6	ipv6: Fix warning in addrconf code. Reported by Linus. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-20 21:18:26 -07:00
Linus Torvalds	d1671a9c15	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: pkt_sched: Fix build with NET_SCHED disabled.	2008-07-20 21:17:20 -07:00
David S. Miller	3a682fbd73	pkt_sched: Fix build with NET_SCHED disabled. The stab bits can't be referenced uniless the full packet scheduler layer is enabled. Reported by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-20 18:13:01 -07:00
Linus Torvalds	db6d8c7a40	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits) iucv: Fix bad merging. net_sched: Add size table for qdiscs net_sched: Add accessor function for packet length for qdiscs net_sched: Add qdisc_enqueue wrapper highmem: Export totalhigh_pages. ipv6 mcast: Omit redundant address family checks in ip6_mc_source(). net: Use standard structures for generic socket address structures. ipv6 netns: Make several "global" sysctl variables namespace aware. netns: Use net_eq() to compare net-namespaces for optimization. ipv6: remove unused macros from net/ipv6.h ipv6: remove unused parameter from ip6_ra_control tcp: fix kernel panic with listening_get_next tcp: Remove redundant checks when setting eff_sacks tcp: options clean up tcp: Fix MD5 signatures for non-linear skbs sctp: Update sctp global memory limit allocations. sctp: remove unnecessary byteshifting, calculate directly in big-endian sctp: Allow only 1 listening socket with SO_REUSEADDR sctp: Do not leak memory on multiple listen() calls sctp: Support ipv6only AF_INET6 sockets. ...	2008-07-20 17:43:29 -07:00
Alan Cox	a352def21a	tty: Ldisc revamp Move the line disciplines towards a conventional ->ops arrangement. For the moment the actual 'tty_ldisc' struct in the tty is kept as part of the tty struct but this can then be changed if it turns out that when it all settles down we want to refcount ldiscs separately to the tty. Pull the ldisc code out of /proc and put it with our ldisc code. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-20 17:12:34 -07:00
David S. Miller	fb65a7c091	iucv: Fix bad merging. Noticed by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-20 10:18:44 -07:00
Jussi Kivilinna	175f9c1bba	net_sched: Add size table for qdiscs Add size table functions for qdiscs and calculate packet size in qdisc_enqueue(). Based on patch by Patrick McHardy http://marc.info/?l=linux-netdev&m=115201979221729&w=2 Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-20 00:08:47 -07:00
Jussi Kivilinna	0abf77e55a	net_sched: Add accessor function for packet length for qdiscs Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-20 00:08:27 -07:00
Jussi Kivilinna	5f86173bdf	net_sched: Add qdisc_enqueue wrapper Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-20 00:08:04 -07:00
YOSHIFUJI Hideaki	a6ffb404dc	ipv6 mcast: Omit redundant address family checks in ip6_mc_source(). The caller has alredy checked for them. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 22:36:07 -07:00
YOSHIFUJI Hideaki	230b183921	net: Use standard structures for generic socket address structures. Use sockaddr_storage{} for generic socket address storage and ensures proper alignment. Use sockaddr{} for pointers to omit several casts. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 22:35:47 -07:00
YOSHIFUJI Hideaki	53b7997fd5	ipv6 netns: Make several "global" sysctl variables namespace aware. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 22:35:03 -07:00
YOSHIFUJI Hideaki	721499e893	netns: Use net_eq() to compare net-namespaces for optimization. Without CONFIG_NET_NS, namespace is always &init_net. Compiler will be able to omit namespace comparisons with this patch. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 22:34:43 -07:00
David S. Miller	407d819cf0	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6	2008-07-19 00:30:39 -07:00
Denis V. Lunev	725a8ff04a	ipv6: remove unused parameter from ip6_ra_control Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:28:58 -07:00
Daniel Lezcano	bdccc4ca13	tcp: fix kernel panic with listening_get_next # BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 IP: [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3 PGD 11e4b9067 PUD 11d16c067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map CPU 3 Modules linked in: bridge ipv6 button battery ac loop dm_mod tg3 ext3 jbd edd fan thermal processor thermal_sys hwmon sg sata_svw libata dock serverworks sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table] Pid: 3368, comm: slpd Not tainted 2.6.26-rc2-mm1-lxc4 #1 RIP: 0010:[<ffffffff821ed01e>] [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3 RSP: 0018:ffff81011e1fbe18 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8100be0ad3c0 RCX: ffff8100619f50c0 RDX: ffffffff82475be0 RSI: ffff81011d9ae6c0 RDI: ffff8100be0ad508 RBP: ffff81011f4f1240 R08: 00000000ffffffff R09: ffff8101185b6780 R10: 000000000000002d R11: ffffffff820fdbfa R12: ffff8100be0ad3c8 R13: ffff8100be0ad6a0 R14: ffff8100be0ad3c0 R15: ffffffff825b8ce0 FS: 00007f6a0ebd16d0(0000) GS:ffff81011f424540(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000038 CR3: 000000011dc20000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process slpd (pid: 3368, threadinfo ffff81011e1fa000, task ffff81011f4b8660) Stack: 00000000000002ee ffff81011f5a57c0 ffff81011f4f1240 ffff81011e1fbe90 0000000000001000 0000000000000000 00007fff16bf2590 ffffffff821ed9c8 ffff81011f5a57c0 ffff81011d9ae6c0 000000000000041a ffffffff820b0abd Call Trace: [<ffffffff821ed9c8>] ? tcp_seq_next+0x34/0x7e [<ffffffff820b0abd>] ? seq_read+0x1aa/0x29d [<ffffffff820d21b4>] ? proc_reg_read+0x73/0x8e [<ffffffff8209769c>] ? vfs_read+0xaa/0x152 [<ffffffff82097a7d>] ? sys_read+0x45/0x6e [<ffffffff8200bd2b>] ? system_call_after_swapgs+0x7b/0x80 Code: 31 a9 25 00 e9 b5 00 00 00 ff 45 20 83 7d 0c 01 75 79 4c 8b 75 10 48 8b 0e eb 1d 48 8b 51 20 0f b7 45 08 39 02 75 0e 48 8b 41 28 <4c> 39 78 38 0f 84 93 00 00 00 48 8b 09 48 85 c9 75 de 8b 55 1c RIP [<ffffffff821ed01e>] listening_get_next+0x50/0x1b3 RSP <ffff81011e1fbe18> CR2: 0000000000000038 This kernel panic appears with CONFIG_NET_NS=y. How to reproduce ? On the buggy host (host A) * ip addr add 1.2.3.4/24 dev eth0 On a remote host (host B) * ip addr add 1.2.3.5/24 dev eth0 * iptables -A INPUT -p tcp -s 1.2.3.4 -j DROP * ssh 1.2.3.4 On host A: * netstat -ta or cat /proc/net/tcp This bug happens when reading /proc/net/tcp[6] when there is a req_sock at the SYN_RECV state. When a SYN is received the minisock is created and the sk field is set to NULL. In the listening_get_next function, we try to look at the field req->sk->sk_net. When looking at how to fix this bug, I noticed that is useless to do the check for the minisock belonging to the namespace. A minisock belongs to a listen point and this one is per namespace, so when browsing the minisock they are always per namespace. Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:15:13 -07:00
Adam Langley	4389dded77	tcp: Remove redundant checks when setting eff_sacks Remove redundant checks when setting eff_sacks and make the number of SACKs a compile time constant. Now that the options code knows how many SACK blocks can fit in the header, we don't need to have the SACK code guessing at it. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:07:02 -07:00
Adam Langley	33ad798c92	tcp: options clean up This should fix the following bugs: * Connections with MD5 signatures produce invalid packets whenever SACK options are included * MD5 signatures are counted twice in the MSS calculations Behaviour changes: * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK This is because we can't fit any SACK blocks in a packet with MD5 + TS options. There was discussion about disabling SACK rather than TS in order to fit in better with old, buggy kernels, but that was deemed to be unnecessary. * SYNs with MD5 don't include a TS option See above. Additionally, it removes a bunch of duplicated logic for calculating options, which should help avoid these sort of issues in the future. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:04:31 -07:00
Adam Langley	49a72dfb88	tcp: Fix MD5 signatures for non-linear skbs Currently, the MD5 code assumes that the SKBs are linear and, in the case that they aren't, happily goes off and hashes off the end of the SKB and into random memory. Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy Polyakov. Also includes a couple of missed route_caps from Stephen's patch in [2]. [1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2 [2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2 Signed-off-by: Adam Langley <agl@imperialviolet.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-19 00:01:42 -07:00
Vlad Yasevich	845525a642	sctp: Update sctp global memory limit allocations. Update sctp global memory limit allocations to be the same as TCP. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:08:21 -07:00
Harvey Harrison	336d3262df	sctp: remove unnecessary byteshifting, calculate directly in big-endian Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:07:09 -07:00
Vlad Yasevich	4e54064e0a	sctp: Allow only 1 listening socket with SO_REUSEADDR When multiple socket bind to the same port with SO_REUSEADDR, only 1 can be listining. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:06:32 -07:00
Vlad Yasevich	23b29ed80b	sctp: Do not leak memory on multiple listen() calls SCTP permits multiple listen call and on subsequent calls we leak he memory allocated for the crypto transforms. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:06:07 -07:00
Vlad Yasevich	7dab83de50	sctp: Support ipv6only AF_INET6 sockets. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:05:40 -07:00
Florian Westphal	6d0ccbac68	sctp: Prevent uninitialized memory access valgrind reports uninizialized memory accesses when running sctp inside the network simulation cradle simulator: Conditional jump or move depends on uninitialised value(s) at 0x570E34A: sctp_assoc_sync_pmtu (associola.c:1324) by 0x57427DA: sctp_packet_transmit (output.c:403) by 0x5710EFF: sctp_outq_flush (outqueue.c:824) by 0x5710B88: sctp_outq_uncork (outqueue.c:701) by 0x5745262: sctp_cmd_interpreter (sm_sideeffect.c:1548) by 0x57444B7: sctp_side_effects (sm_sideeffect.c:976) by 0x5744460: sctp_do_sm (sm_sideeffect.c:945) by 0x572157D: sctp_primitive_ASSOCIATE (primitive.c:94) by 0x5725C04: __sctp_connect (socket.c:1094) by 0x57297DC: sctp_connect (socket.c:3297) Conditional jump or move depends on uninitialised value(s) at 0x575D3A5: mod_timer (timer.c:630) by 0x5752B78: sctp_cmd_hb_timers_start (sm_sideeffect.c:555) by 0x5754133: sctp_cmd_interpreter (sm_sideeffect.c:1448) by 0x5753607: sctp_side_effects (sm_sideeffect.c:976) by 0x57535B0: sctp_do_sm (sm_sideeffect.c:945) by 0x571E9AE: sctp_endpoint_bh_rcv (endpointola.c:474) by 0x573347F: sctp_inq_push (inqueue.c:104) by 0x572EF93: sctp_rcv (input.c:256) by 0x5689623: ip_local_deliver_finish (ip_input.c:230) by 0x5689759: ip_local_deliver (ip_input.c:268) by 0x5689CAC: ip_rcv_finish (dst.h:246) #1 is due to "if (t->pmtu_pending)". `8a4794914f` "[SCTP] Flag a pmtu change request" suggests it should be initialized to 0. #2 is the heartbeat timer 'expires' value, which is uninizialised, but test by mod_timer(). T3_rtx_timer seems to be affected by the same problem, so initialize it, too. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:04:39 -07:00
Florian Westphal	c4e85f82ed	sctp: Don't abort initialization when CONFIG_PROC_FS=n This puts CONFIG_PROC_FS defines around the proc init/exit functions and also avoids compiling proc.c if procfs is not supported. Also make SCTP_DBG_OBJCNT depend on procfs. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:03:44 -07:00
Stephen Hemminger	c1e20f7c8b	tcp: RTT metrics scaling Some of the metrics (RTT, RTTVAR and RTAX_RTO_MIN) are stored in kernel units (jiffies) and this leaks out through the netlink API to user space where the units for jiffies are unknown. This patches changes the kernel to convert to/from milliseconds. This changes the ABI, but milliseconds seemed like the most natural unit for these parameters. Values available via syscall in /proc/net/rt_cache and netlink will be in milliseconds. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:02:15 -07:00
David S. Miller	30ee42be00	pkt_sched: Fix noqueue_qdisc initialization. Like noop_qdisc, it needs a dummy backpointer and explicit qdisc->q.lock initialization. Based upon a report by Stephen Hemminger. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 23:00:11 -07:00
David S. Miller	3072367300	pkt_sched: Manage qdisc list inside of root qdisc. Idea is from Patrick McHardy. Instead of managing the list of qdiscs on the device level, manage it in the root qdisc of a netdev_queue. This solves all kinds of visibility issues during qdisc destruction. The way to iterate over all qdiscs of a netdev_queue is to visit the netdev_queue->qdisc, and then traverse it's list. The only special case is to ignore builting qdiscs at the root when dumping or doing a qdisc_lookup(). That was not needed previously because builtin qdiscs were not added to the device's qdisc_list. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 22:50:15 -07:00
David S. Miller	72b25a913e	pkt_sched: Get rid of u32_list. The u32_list is just an indirect way of maintaining a reference to a U32 node on a per-qdisc basis. Just add an explicit node pointer for u32 to struct Qdisc an do away with this global list. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 20:54:17 -07:00
Patrick McHardy	8913336a7e	packet: add PACKET_RESERVE sockopt Add new sockopt to reserve some headroom in the mmaped ring frames in front of the packet payload. This can be used f.i. when the VLAN header needs to be (re)constructed to avoid moving the entire payload. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 18:05:19 -07:00
Mike Travis	65c0118453	cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr * This patch replaces the dangerous lvalue version of cpumask_of_cpu with new cpumask_of_cpu_ptr macros. These are patterned after the node_to_cpumask_ptr macros. In general terms, if there is a cpumask_of_cpu_map[] then a pointer to the cpumask_of_cpu_map[cpu] entry is used. The cpumask_of_cpu_map is provided when there is a large NR_CPUS count, reducing greatly the amount of code generated and stack space used for cpumask_of_cpu(). The pointer to the cpumask_t value is needed for calling set_cpus_allowed_ptr() to reduce the amount of stack space needed to pass the cpumask_t value. If there isn't a cpumask_of_cpu_map[], then a temporary variable is declared and filled in with value from cpumask_of_cpu(cpu) as well as a pointer variable pointing to this temporary variable. Afterwards, the pointer is used to reference the cpumask value. The compiler will optimize out the extra dereference through the pointer as well as the stack space used for the pointer, resulting in identical code. A good example of the orthogonal usages is in net/sunrpc/svc.c: case SVC_POOL_PERCPU: { unsigned int cpu = m->pool_to[pidx]; cpumask_of_cpu_ptr(cpumask, cpu); oldmask = current->cpus_allowed; set_cpus_allowed_ptr(current, cpumask); return 1; } case SVC_POOL_PERNODE: { unsigned int node = m->pool_to[pidx]; node_to_cpumask_ptr(nodecpumask, node); oldmask = current->cpus_allowed; set_cpus_allowed_ptr(current, nodecpumask); return 1; } Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-18 22:02:57 +02:00
Ingo Molnar	bb2c018b09	Merge branch 'linus' into cpus4096 Conflicts: drivers/acpi/processor_throttling.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-18 22:00:54 +02:00
Pavel Emelyanov	b6fcbdb4f2	proc: consolidate per-net single-release callers They are symmetrical to single_open ones :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:07:44 -07:00
Pavel Emelyanov	de05c557b2	proc: consolidate per-net single_open callers There are already 7 of them - time to kill some duplicate code. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:07:21 -07:00
Pavel Emelyanov	60bdde9580	proc: clean the ip_misc_proc_init and ip_proc_init_net error paths After all this stuff is moved outside, this function can look better. Besides, I tuned the error path in ip_proc_init_net to make it have only 2 exit points, not 3. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:06:50 -07:00
Pavel Emelyanov	8e3461d01b	proc: show per-net ip_devconf.forwarding in /proc/net/snmp This one has become per-net long ago, but the appropriate file is per-net only now. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:06:26 -07:00
Pavel Emelyanov	229bf0cbaa	proc: create /proc/net/snmp file in each net All the statistics shown in this file have been made per-net already. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:06:04 -07:00
Pavel Emelyanov	7b7a9dfdf6	proc: create /proc/net/netstat file in each net Now all the shown in it statistics is netnsizated, time to show it in appropriate net. The appropriate net init/exit ops already exist - they make the sockstat file per net - so just extend them. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:05:17 -07:00
Pavel Emelyanov	d89cbbb1e6	ipv4: clean the init_ipv4_mibs error paths After moving all the stuff outside this function it looks a bit ugly - make it look better. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:04:51 -07:00
Pavel Emelyanov	923c6586b0	mib: put icmpmsg statistics on struct net Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:04:22 -07:00
Pavel Emelyanov	b60538a0d7	mib: put icmp statistics on struct net Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:04:02 -07:00
Pavel Emelyanov	386019d351	mib: put udplite statistics on struct net Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:03:45 -07:00
Pavel Emelyanov	2f275f91a4	mib: put udp statistics on struct net Similar to... ouch, I repeat myself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:03:27 -07:00
Pavel Emelyanov	61a7e26028	mib: put net statistics on struct net Similar to ip and tcp ones :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:03:08 -07:00
Pavel Emelyanov	a20f5799ca	mib: put ip statistics on struct net Similar to tcp one. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:02:42 -07:00
Pavel Emelyanov	57ef42d59d	mib: put tcp statistics on struct net Proc temporary uses stats from init_net. BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:02:08 -07:00
Pavel Emelyanov	9b4661bd6e	ipv4: add pernet mib operations These ones are currently empty, but stuff from init_ipv4_mibs will sequentially migrate there. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:01:44 -07:00
David S. Miller	49997d7515	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: Documentation/powerpc/booting-without-of.txt drivers/atm/Makefile drivers/net/fs_enet/fs_enet-main.c drivers/pci/pci-acpi.c net/8021q/vlan.c net/iucv/iucv.c	2008-07-18 02:39:39 -07:00
David S. Miller	a0c80b80e0	pkt_sched: Make default qdisc nonshared-multiqueue safe. Instead of 'pfifo_fast' we have just plain 'fifo_fast'. No priority queues, just a straight FIFO. This is necessary in order to legally have a seperate qdisc per queue in multi-TX-queue setups, and thus get full parallelization. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:33 -07:00
David S. Miller	99194cff39	pkt_sched: Add multiqueue handling to qdisc_graft(). Move the destruction of the old queue into qdisc_graft(). When operating on a root qdisc (ie. "parent == NULL"), apply the operation to all queues. The caller has grabbed a single implicit reference for this graft, therefore when we apply the change to more than one queue we must grab additional qdisc references. Otherwise, we are operating on a class of a specific parent qdisc, and therefore no multiqueue handling is necessary. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:30 -07:00
David S. Miller	8387400092	pkt_sched: Kill netdev_queue lock. We can simply use the qdisc->q.lock for all of the qdisc tree synchronization. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:30 -07:00
David S. Miller	c7e4f3bbb4	pkt_sched: Kill qdisc_lock_tree and qdisc_unlock_tree. No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:29 -07:00
David S. Miller	53049978df	pkt_sched: Make qdisc grafting locking more specific. Lock the root of the qdisc being operated upon. All explicit references to qdisc_tree_lock() are now gone. The only remaining uses are via the sch_tree_{lock,unlock}() and tcf_tree_{lock,unlock}() macros. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:27 -07:00
David S. Miller	ead81cc5fc	netdevice: Move qdisc_list back into net_device proper. And give it it's own lock. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:26 -07:00
David S. Miller	15b458fa65	pkt_sched: Kill qdisc_lock_tree usage in cls_route.c It just wants the qdisc tree to be synchronized, so grabbing qdisc_root_lock() is sufficient. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:25 -07:00
David S. Miller	55dbc640c3	pkt_sched: Remove qdisc_lock_tree usage in cls_api.c It just wants the qdisc tree for the filter to be synchronized. So just BH lock qdisc_root_lock(q) instead. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:24 -07:00
David S. Miller	17715e62a5	pkt_sched: Use per-queue locking in shutdown_scheduler_queue. This eliminates another qdisc_lock_tree user. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:23 -07:00
David S. Miller	8a34c5dc3a	pkt_sched: Perform bulk of qdisc destruction in RCU. This allows less strict control of access to the qdisc attached to a netdev_queue. It is even allowed to enqueue into a qdisc which is in the process of being destroyed. The RCU handler will toss out those packets. We will need this to handle sharing of a qdisc amongst multiple TX queues. In such a setup the lock has to be shared, so will be inside of the qdisc itself. At which point the netdev_queue lock cannot be used to hard synchronize access to the ->qdisc pointer. One operation we have to keep inside of qdisc_destroy() is the list deletion. It is the only piece of state visible after the RCU quiesce period, so we have to undo it early and under the appropriate locking. The operations in the RCU handler do not need any looking because the qdisc tree is no longer visible to anything at that point. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:22 -07:00
David S. Miller	16361127eb	pkt_sched: dev_init_scheduler() does not need to lock qdisc tree. We are registering the device, there is no way anyone can get at this object's qdiscs yet in any meaningful way. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:21 -07:00
David S. Miller	37437bb2e1	pkt_sched: Schedule qdiscs instead of netdev_queue. When we have shared qdiscs, packets come out of the qdiscs for multiple transmit queues. Therefore it doesn't make any sense to schedule the transmit queue when logically we cannot know ahead of time the TX queue of the SKB that the qdisc->dequeue() will give us. Just for sanity I added a BUG check to make sure we never get into a state where the noop_qdisc is scheduled. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:20 -07:00
David S. Miller	7698b4fcab	pkt_sched: Add and use qdisc_root() and qdisc_root_lock(). When code wants to lock the qdisc tree state, the logic operation it's doing is locking the top-level qdisc that sits of the root of the netdev_queue. Add qdisc_root_lock() to represent this and convert the easiest cases. In order for this to work out in all cases, we have to hook up the noop_qdisc to a dummy netdev_queue. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:19 -07:00
David S. Miller	e2627c8c22	pkt_sched: Make QDISC_RUNNING a qdisc state. Currently it is associated with a netdev_queue, but when we have qdisc sharing that no longer makes any sense. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:18 -07:00
David S. Miller	d3b753db7c	pkt_sched: Move gso_skb into Qdisc. We liberate any dangling gso_skb during qdisc destruction. It really only matters for the root qdisc. But when qdiscs can be shared by multiple netdev_queue objects, we can't have the gso_skb in the netdev_queue any more. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:18 -07:00
David S. Miller	8f0f2223cc	net: Implement simple sw TX hashing. It just xor hashes over IPv4/IPv6 addresses and ports of transport. The only assumption it makes is that skb_network_header() is set correctly. With bug fixes from Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:13 -07:00
David S. Miller	51cb6db0f5	mac80211: Reimplement WME using ->select_queue(). The only behavior change is that we do not drop packets under any circumstances. If that is absolutely needed, we could easily add it back. With cleanups and help from Johannes Berg. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:12 -07:00
David S. Miller	eae792b722	netdev: Add netdev->select_queue() method. Devices or device layers can set this to control the queue selection performed by dev_pick_tx(). This function runs under RCU protection, which allows overriding functions to have some way of synchronizing with things like dynamic ->real_num_tx_queues adjustments. This makes the spinlock prefetch in dev_queue_xmit() a little bit less effective, but that's the price right now for correctness. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:10 -07:00
David S. Miller	fd2ea0a79f	net: Use queue aware tests throughout. This effectively "flips the switch" by making the core networking and multiqueue-aware drivers use the new TX multiqueue structures. Non-multiqueue drivers need no changes. The interfaces they use such as netif_stop_queue() degenerate into an operation on TX queue zero. So everything "just works" for them. Code that really wants to do "X" to all TX queues now invokes a routine that does so, such as netif_tx_wake_all_queues(), netif_tx_stop_all_queues(), etc. pktgen and netpoll required a little bit more surgery than the others. In particular the pktgen changes, whilst functional, could be largely improved. The initial check in pktgen_xmit() will sometimes check the wrong queue, which is mostly harmless. The thing to do is probably to invoke fill_packet() earlier. The bulk of the netpoll changes is to make the code operate solely on the TX queue indicated by by the SKB queue mapping. Setting of the SKB queue mapping is entirely confined inside of net/core/dev.c:dev_pick_tx(). If we end up needing any kind of special semantics (drops, for example) it will be implemented here. Finally, we now have a "real_num_tx_queues" which is where the driver indicates how many TX queues are actually active. With IGB changes from Jeff Kirsher. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:07 -07:00
David S. Miller	24344d2600	mac80211: Temporarily mark QoS support BROKEN. We will undo this after a few changsets. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:05 -07:00
David S. Miller	1d8ae3fdeb	pkt_sched: Remove RR scheduler. This actually fixes a bug added by the RR scheduler changes. The ->bands and ->prio2band parameters were being set outside of the sch_tree_lock() and thus could result in strange behavior and inconsistencies. It might be possible, in the new design (where there will be one qdisc per device TX queue) to allow similar functionality via a TX hash algorithm for RR but I really see no reason to export this aspect of how these multiqueue cards actually implement the scheduling of the the individual DMA TX rings and the single physical MAC/PHY port. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:04 -07:00
David S. Miller	09e83b5d7d	netdev: Kill NETIF_F_MULTI_QUEUE. There is no need for a feature bit for something that can be tested by simply checking the TX queue count. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:03 -07:00
David S. Miller	e8a0464cc9	netdev: Allocate multiple queues for TX. alloc_netdev_mq() now allocates an array of netdev_queue structures for TX, based upon the queue_count argument. Furthermore, all accesses to the TX queues are now vectored through the netdev_get_tx_queue() and netdev_for_each_tx_queue() interfaces. This makes it easy to grep the tree for all things that want to get to a TX queue of a net device. Problem spots which are not really multiqueue aware yet, and only work with one queue, can easily be spotted by grepping for all netdev_get_tx_queue() calls that pass in a zero index. Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-17 19:21:00 -07:00
Patrick McHardy	51ce7ec921	garp: retry sending JoinIn messages after allocation failures Increase reliability by retrying to send JoinIn messages after memory allocation failures on each TRANSMIT_PDU event until it succeeds. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:51:47 -07:00
Neil Horman	9a6d276e85	core: add stat to track unresolved discards in neighbor cache in __neigh_event_send, if we have a neighbour entry which is in NUD_INCOMPLETE state, we enqueue any outbound frames to that neighbour to the neighbours arp_queue, which is default capped to a length of 3 skbs. If that queue exceeds its set length, it will drop an skb on the queue to enqueue the newly arrived skb. This results in a drop for which we have no statistics incremented. This patch adds an unresolved_discards stat to /proc/net/stat/ndisc_cache to track these lost frames. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:50:49 -07:00
Pavel Emelyanov	ed88098e25	mib: add net to NET_ADD_STATS_USER Done with NET_XXX_STATS macros :) To be continued... Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:32:45 -07:00
Pavel Emelyanov	f2bf415cfe	mib: add net to NET_ADD_STATS_BH This one is tricky. The thing is that this macro is only used when killing tw buckets, but since this killer is promiscuous wrt to which net each particular tw belongs to, I have to use it only when NET_NS is off. When the net namespaces are on, I use the INET_INC_STATS_BH for each bucket. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:32:25 -07:00
Pavel Emelyanov	6f67c817fc	mib: add net to NET_INC_STATS_USER Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:31:39 -07:00
Pavel Emelyanov	de0744af1f	mib: add net to NET_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:31:16 -07:00
Pavel Emelyanov	4e6734447d	mib: add net to NET_INC_STATS Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:30:14 -07:00
Pavel Emelyanov	1ed834655a	tcp: replace tcp_sock argument with sock in some places These places have a tcp_sock, but we'd prefer the sock itself to get net from it. Fortunately, tcp_sk macro is just a type cast, so this replace is really cheap. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:29:51 -07:00
Pavel Emelyanov	ca12a1a443	inet: prepare net on the stack for NET accounting macros Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:28:42 -07:00
Pavel Emelyanov	5c52ba170f	sock: add net to prot->enter_memory_pressure callback The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't have where to get the net from. I decided to add a sk argument, not the net itself, only to factor all the required sock_net(sk) calls inside the enter_memory_pressure callback itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:28:10 -07:00
Pavel Emelyanov	74688e487a	mib: add net to TCP_DEC_STATS Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:22:46 -07:00
Pavel Emelyanov	63231bddf6	mib: add net to TCP_INC_STATS_BH Same as before - the sock is always there to get the net from, but there are also some places with the net already saved on the stack. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:22:25 -07:00
Pavel Emelyanov	81cc8a75d9	mib: add net to TCP_INC_STATS Fortunately (almost) all the TCP code has a sock to get the net from :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:22:04 -07:00
Pavel Emelyanov	a9c19329ec	tcp: add net to tcp_mib_init This one sets TCP MIBs after zeroing them, and thus requires the net. The existing single caller can use init_net (temporarily). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:21:42 -07:00
Pavel Emelyanov	a86b1e3019	inet: prepare struct net for TCP MIB accounting This is the same as the first patch in the set, but preparing the net for TCP_XXX_STATS - save the struct net on the stack where required and possible. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:20:58 -07:00
Pavel Emelyanov	c5346fe396	mib: add net to IP_ADD_STATS_BH Very simple - only ip_evictor (fragments) requires such. This patch ends up the IP_XXX_STATS patching. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:20:33 -07:00
Pavel Emelyanov	7c73a6faff	mib: add net to IP_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:20:11 -07:00
Pavel Emelyanov	5e38e27044	mib: add net to IP_INC_STATS All the callers already have either the net itself, or the place where to get it from. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:19:49 -07:00
Pavel Emelyanov	84a3aa000e	ipv4: prepare net initialization for IP accounting Some places, that deal with IP statistics already have where to get a struct net from, but use it directly, without declaring a separate variable on the stack. So, save this net on the stack for future IP_XXX_STATS macros. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:19:08 -07:00
Will Newton	70efce27fc	net/ipv4/tcp.c: Fix use of PULLHUP instead of POLLHUP in comments. Change PULLHUP to POLLHUP in tcp_poll comments and clean up another comment for grammar and coding style. Signed-off-by: Will Newton <will.newton@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:13:43 -07:00
Harvey Harrison	7b1c65faa2	net: make __skb_splice_bits static net/core/skbuff.c:1335:5: warning: symbol '__skb_splice_bits' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:12:30 -07:00
David S. Miller	885a4c966b	Merge branch 'stealer/ipvs/sync-daemon-cleanup-for-next' of git://git.stealer.net/linux-2.6	2008-07-16 20:07:06 -07:00
Rumen G. Bogdanovski	9d3a0de7dc	ipvs: More reliable synchronization on connection close This patch enhances the synchronization of the closing connections between the master and the backup director. It prevents the closed connections to expire with the 15 min timeout of the ESTABLISHED state on the backup and makes them expire as they would do on the master with much shorter timeouts. Signed-off-by: Rumen G. Bogdanovski <rumen@voicecho.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-16 20:04:23 -07:00

... 8 9 10 11 12 ...

10389 Commits