Commit Graph

43586 Commits

Author SHA1 Message Date
David Howells
a7056c5ba6 rxrpc: Send an immediate ACK if we fill in a hole
Send an immediate ACK if we fill in a hole in the buffer left by an
out-of-sequence packet.  This may allow the congestion management in the peer
to avoid a retransmission if packets got reordered on the wire.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 23:49:46 +01:00
David Howells
805b21b929 rxrpc: Send an ACK after every few DATA packets we receive
Send an ACK if we haven't sent one for the last two packets we've received.
This keeps the other end apprised of where we've got to - which is
important if they're doing slow-start.

We do this in recvmsg so that we can dispatch a packet directly without the
need to wake up the background thread.

This should possibly be made configurable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-24 18:05:26 +01:00
David S. Miller
2a9aa41fd2 RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+VA9/Sw1s6N8H32AQJufQ//bP1Aq6Ej+3km0o+S1w2lEl7kc1Gnown+
 PLKecg73rtMLEXDCJPKrh/OjGBX1GFaPC5HXWF/4TRav3/qjsi2/P7MSIYbt8ZAk
 0kVcdpIU1RWrPFt0tdhxTDSgPMhRWaKJTFV0CwQgnCqeyrVX44syx42B941Wsgcs
 N/6ANRr1qFLTwT1LKD0EhRWxnds1YI9WPgF5huaAn5RHpymCynwxsxyTDdU+NL0j
 YE75o5rjUzAZzMk5xrTjO8yd7xjUfjO7xT/CzjjVfXB4TOBmYExfOrLNZX0vOPPr
 LBOj3LTLwklWtx5/RJGe3A4y1+yfiUYGTu90ArARHUxP6AMwnxztvjuy2MQvBOgP
 xTlHcEOPoHrxxfIGdJH/0NRl1zgn6IF2lFOH3EgRcw8hzMdFp46AXV74ckst1vOQ
 LsqTbhndG2vkVFV8uZCZO7om0HRbpxbvy6+hEakdZ6k9kJA3amezS6Q8wiXxGhAE
 0wBns3XI/75Zd7QyXRkcHl3iUqS5uW0dKmAqitIlgf3CUsr4cZFlHO8IDD3rkTFD
 WbnvFhaL9Ee5IQjbsHgyzqPSI7sUJUYG1xBAz8wIWnjD0bGu5b/QzgdBSCVIFJ3e
 bNvpe1B6HEHj0EyIj1wK2H8311Y4lPNeJlOj7BDUqqRIZW8vviG26MTGML6GnJwU
 L35wa/s4ebo=
 =+8HJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160923' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Bug fixes and tracepoints

Here are a bunch of bug fixes:

 (1) Need to set the timestamp on a Tx packet before queueing it to avoid
     trouble with the retransmission function.

 (2) Don't send an ACK at the end of the service reply transmission; it's
     the responsibility of the client to send an ACK to close the call.
     The service can resend the last DATA packet or send a PING ACK.

 (3) Wake sendmsg() on abnormal call termination.

 (4) Use ktime_add_ms() not ktime_add_ns() to add millisecond offsets.

 (5) Use before_eq() & co. to compare serial numbers (which may wrap).

 (6) Start the resend timer on DATA packet transmission.

 (7) Don't accidentally cancel a retransmission upon receiving a NACK.

 (8) Fix the call timer setting function to deal with timeouts that are now
     or past.

 (9) Don't use a flag to communicate the presence of the last packet in the
     Tx buffer from sendmsg to the input routines where ACK and DATA
     reception is handled.  The problem is that there's a window between
     queueing the last packet for transmission and setting the flag in
     which ACKs or reply DATA packets can arrive, causing apparent state
     machine violation issues.

     Instead use the annotation buffer to mark the last packet and pick up
     and set the flag in the input routines.

(10) Don't call the tx_ack tracepoint and don't allocate a serial number if
     someone else nicked the ACK we were about to transmit.

There are also new tracepoints and one altered tracepoint used to track
down the above bugs:

(11) Call timer tracepoint.

(12) Data Tx tracepoint (and adjustments to ACK tracepoint).

(13) Injected Rx packet loss tracepoint.

(14) Ack proposal tracepoint.

(15) Retransmission selection tracepoint.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:24:19 -04:00
David S. Miller
1678c1134f Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2016-09-23

Only two patches this time:

1) Fix a comment reference to struct xfrm_replay_state_esn.
   From Richard Guy Briggs.

2) Convert xfrm_state_lookup to rcu, we don't need the
   xfrm_state_lock anymore in the input path.
   From Florian Westphal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:18:19 -04:00
Moshe Shemesh
79aab093a0 net: Update API for VF vlan protocol 802.1ad support
Introduce new rtnl UAPI that exposes a list of vlans per VF, giving
the ability for user-space application to specify it for the VF, as an
option to support 802.1ad.
We adjusted IP Link tool to support this option.

For future use cases, the new UAPI supports multiple vlans. For now we
limit the list size to a single vlan in kernel.
Add IFLA_VF_VLAN_LIST in addition to IFLA_VF_VLAN to keep backward
compatibility with older versions of IP Link tool.

Add a vlan protocol parameter to the ndo_set_vf_vlan callback.
We kept 802.1Q as the drivers' default vlan protocol.
Suitable ip link tool command examples:
  Set vf vlan protocol 802.1ad:
    ip link set eth0 vf 1 vlan 100 proto 802.1ad
  Set vf to VST (802.1Q) mode:
    ip link set eth0 vf 1 vlan 100 proto 802.1Q
  Or by omitting the new parameter
    ip link set eth0 vf 1 vlan 100

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-24 08:01:26 -04:00
David Howells
c6672e3fe4 rxrpc: Add a tracepoint to log which packets will be retransmitted
Add a tracepoint to log in rxrpc_resend() which packets will be
retransmitted.  Note that if a positive ACK comes in whilst we have dropped
the lock to retransmit another packet, the actual retransmission may not
happen, though some of the effects will (such as altering the congestion
management).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
9c7ad43444 rxrpc: Add tracepoint for ACK proposal
Add a tracepoint to log proposed ACKs, including whether the proposal is
used to update a pending ACK or is discarded in favour of an easlier,
higher priority ACK.

Whilst we're at it, get rid of the rxrpc_acks() function and access the
name array directly.  We do, however, need to validate the ACK reason
number given to trace_rxrpc_rx_ack() to make sure we don't overrun the
array.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
89b475abdb rxrpc: Add a tracepoint to log injected Rx packet loss
Add a tracepoint to log received packets that get discarded due to Rx
packet loss.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
be832aecc5 rxrpc: Add data Tx tracepoint and adjust Tx ACK tracepoint
Add a tracepoint to log transmission of DATA packets (including loss
injection).

Adjust the ACK transmission tracepoint to include the packet serial number
and to line this up with the DATA transmission display.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
fc7ab6d29a rxrpc: Add a tracepoint for the call timer
Add a tracepoint to log call timer initiation, setting and expiry.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
b86e218e0d rxrpc: Don't call the tx_ack tracepoint if don't generate an ACK
rxrpc_send_call_packet() is invoking the tx_ack tracepoint before it checks
whether there's an ACK to transmit (another thread may jump in and transmit
it).

Fix this by only invoking the tracepoint if we get a valid ACK to transmit.

Further, only allocate a serial number if we're going to actually transmit
something.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
70790dbe3f rxrpc: Pass the last Tx packet marker in the annotation buffer
When the last packet of data to be transmitted on a call is queued, tx_top
is set and then the RXRPC_CALL_TX_LAST flag is set.  Unfortunately, this
leaves a race in the ACK processing side of things because the flag affects
the interpretation of tx_top and also allows us to start receiving reply
data before we've finished transmitting.

To fix this, make the following changes:

 (1) rxrpc_queue_packet() now sets a marker in the annotation buffer
     instead of setting the RXRPC_CALL_TX_LAST flag.

 (2) rxrpc_rotate_tx_window() detects the marker and sets the flag in the
     same context as the routines that use it.

 (3) rxrpc_end_tx_phase() is simplified to just shift the call state.
     The Tx window must have been rotated before calling to discard the
     last packet.

 (4) rxrpc_receiving_reply() is added to handle the arrival of the first
     DATA packet of a reply to a client call (which is an implicit ACK of
     the Tx phase).

 (5) The last part of rxrpc_input_ack() is reordered to perform Tx
     rotation, then soft-ACK application and then to end the phase if we've
     rotated the last packet.  In the event of a terminal ACK, the soft-ACK
     application will be skipped as nAcks should be 0.

 (6) rxrpc_input_ackall() now has to rotate as well as ending the phase.

In addition:

 (7) Alter the transmit tracepoint to log the rotation of the last packet.

 (8) Remove the no-longer relevant queue_reqack tracepoint note.  The
     ACK-REQUESTED packet header flag is now set as needed when we actually
     transmit the packet and may vary by retransmission.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
01a88f7f6b rxrpc: Fix call timer
Fix the call timer in the following ways:

 (1) If call->resend_at or call->ack_at are before or equal to the current
     time, then ignore that timeout.

 (2) If call->expire_at is before or equal to the current time, then don't
     set the timer at all (possibly we should queue the call).

 (3) Don't skip modifying the timer if timer_pending() is true.  This
     indicates that the timer is working, not that it has expired and is
     running/waiting to run its expiry handler.

Also call rxrpc_set_timer() to start the call timer going rather than
calling add_timer().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:49:19 +01:00
David Howells
be8aa33806 rxrpc: Fix accidental cancellation of scheduled resend by ACK parser
When rxrpc_input_soft_acks() is parsing the soft-ACKs from an ACK packet,
it updates the Tx packet annotations in the annotation buffer.  If a
soft-ACK is an ACK, then we overwrite unack'd, nak'd or to-be-retransmitted
states and that is fine; but if the soft-ACK is an NACK, we overwrite the
to-be-retransmitted with a nak - which isn't.

Instead, we need to let any scheduled retransmission stand if the packet
was NAK'd.

Note that we don't reissue a resend if the annotation is in the
to-be-retransmitted state because someone else must've scheduled the
resend already.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 15:35:45 +01:00
David Howells
dfc3da4404 rxrpc: Need to start the resend timer on initial transmission
When a DATA packet has its initial transmission, we may need to start or
adjust the resend timer.  Without this we end up relying on being sent a
NACK to initiate the resend.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 14:05:12 +01:00
David Howells
98dafac569 rxrpc: Use before_eq() and friends to compare serial numbers
before_eq() and friends should be used to compare serial numbers (when not
checking for (non)equality) rather than casting to int, subtracting and
checking the result.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 14:05:08 +01:00
Daniel Borkmann
7a4b28c6cc bpf: add helper to invalidate hash
Add a small helper that complements 36bbef52c7 ("bpf: direct packet
write and access for helpers for clsact progs") for invalidating the
current skb->hash after mangling on headers via direct packet write.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:40:28 -04:00
Daniel Borkmann
669dc4d76d bpf: use bpf_get_smp_processor_id_proto instead of raw one
Same motivation as in commit 80b48c4457 ("bpf: don't use raw processor
id in generic helper"), but this time for XDP typed programs. Thus, allow
for preemption checks when we have DEBUG_PREEMPT enabled, and otherwise
use the raw variant.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:40:28 -04:00
Daniel Borkmann
2d48c5f933 bpf: use skb_to_full_sk helper in bpf_skb_under_cgroup
We need to use skb_to_full_sk() helper introduced in commit bd5eb35f16
("xfrm: take care of request sockets") as otherwise we miss tcp synack
messages, since ownership is on request socket and therefore it would
miss the sk_fullsock() check. Use skb_to_full_sk() as also done similarly
in the bpf_get_cgroup_classid() helper via 2309236c13 ("cls_cgroup:
get sk_classid only from full sockets") fix to not let this fall through.

Fixes: 4a482f34af ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:40:27 -04:00
Vivien Didelot
732f794c1b net: dsa: add port fast ageing
Today the DSA drivers are in charge of flushing the MAC addresses
associated to a port when its STP state changes from Learning or
Forwarding, to Disabled or Blocking or Listening.

This makes the drivers more complex and hides the generic switch logic.
Introduce a new optional port_fast_age operation to dsa_switch_ops, to
move this logic to the DSA layer and keep drivers simple.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:38:50 -04:00
Vivien Didelot
4acfee8143 net: dsa: add port STP state helper
Add a void helper to set the STP state of a port, checking first if the
required routine is provided by the driver.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 08:38:50 -04:00
David Howells
90bd684ded rxrpc: Should be using ktime_add_ms() not ktime_add_ns()
ktime_add_ms() should be used to add the resend time (in ms) rather than
ktime_add_ns().

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:23:09 +01:00
David Howells
c0d058c21c rxrpc: Make sure sendmsg() is woken on call completion
Make sure that sendmsg() gets woken up if the call it is waiting for
completes abnormally.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:23:09 +01:00
David Howells
9aff212bd6 rxrpc: Don't send an ACK at the end of service call response transmission
Don't send an IDLE ACK at the end of the transmission of the response to a
service call.  The service end resends DATA packets until the client sends an
ACK that hard-acks all the send data.  At that point, the call is complete.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:23:09 +01:00
David Howells
b24d2891cf rxrpc: Preset timestamp on Tx sk_buffs
Set the timestamp on sk_buffs holding packets to be transmitted before
queueing them because the moment the packet is on the queue it can be seen
by the retransmission algorithm - which may see a completely random
timestamp.

If the retransmission algorithm sees such a timestamp, it may retransmit
the packet and, in future, tell the congestion management algorithm that
the retransmit timer expired.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-23 13:17:52 +01:00
Eric Dumazet
fefa569a9d net_sched: sch_fq: account for schedule/timers drifts
It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.

We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)

Add an EWMA of the unthrottle latency to help diagnostics.

This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.

Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17106.04 756876.84   1102.75 1119049.02      0.00      0.00      0.52

After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average:         eth0  17867.00 800245.90   1151.77 1183172.12      0.00      0.00      0.52

A new iproute2 tc can output the 'unthrottle latency' :

$ tc -s qd sh dev eth0 | grep latency
  0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 07:19:06 -04:00
Marcelo Ricardo Leitner
a3007446e5 sctp: fix the handling of SACK Gap Ack blocks
sctp_acked() is using 32bit arithmetics on 16bits vars, via TSN_lte()
macros, which is weird and confusing.

Once the offset to ctsn is calculated, all wrapping is already handled
and thus to verify the Gap Ack blocks we can just use pure
less/big-or-equal than checks.

Also, rename gap variable to tsn_offset, so it's more meaningful, as
it doesn't point to any gap at all.

Even so, I don't think this discrepancy resulted in any practical bug.

This patch is a preparation for the next one, which will introduce
typecheck() for TSN_lte() macros and would cause a compile error here.

Suggested-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 06:54:58 -04:00
WANG Cong
21641c2e1f net_sched: check NULL on error path in route4_change()
On error path in route4_change(), 'f' could be NULL,
so we should check NULL before calling tcf_exts_destroy().

Fixes: b9a24bb76b ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-23 06:51:49 -04:00
David S. Miller
d6989d4bbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-23 06:46:57 -04:00
Linus Torvalds
f887c21e21 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Mostly small bits scattered all over the place, which is usually how
  things go this late in the -rc series.

   1) Proper driver init device resets in bnx2, from Baoquan He.

   2) Fix accounting overflow in __tcp_retransmit_skb(),
      sk_forward_alloc, and ip_idents_reserve, from Eric Dumazet.

   3) Fix crash in bna driver ethtool stats handling, from Ivan Vecera.

   4) Missing check of skb_linearize() return value in mac80211, from
      Johannes Berg.

   5) Endianness fix in nf_table_trace dumps, from Liping Zhang.

   6) SSN comparison fix in SCTP, from Marcelo Ricardo Leitner.

   7) Update DSA and b44 MAINTAINERS entries.

   8) Make input path of vti6 driver work again, from Nicolas Dichtel.

   9) Off-by-one in mlx4, from Sebastian Ott.

  10) Fix fallback route lookup handling in ipv6, from Vincent Bernat.

  11) Fix stack corruption on probe in qed driver, from Yuval Mintz.

  12) PHY init fixes in r8152 from Hayes Wang.

  13) Missing SKB free in irda_accept error path, from Phil Turnbull"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
  tcp: properly account Fast Open SYN-ACK retrans
  tcp: fix under-accounting retransmit SNMP counters
  MAINTAINERS: Update b44 maintainer.
  net: get rid of an signed integer overflow in ip_idents_reserve()
  net/mlx4_core: Fix to clean devlink resources
  net: can: ifi: Configure transmitter delay
  vti6: fix input path
  ipmr, ip6mr: return lastuse relative to now
  r8152: disable ALDPS and EEE before setting PHY
  r8152: remove r8153_enable_eee
  r8152: move PHY settings to hw_phy_cfg
  r8152: move enabling PHY
  r8152: move some functions
  cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter
  qed: Fix stack corruption on probe
  MAINTAINERS: Add an entry for the core network DSA code
  net: ipv6: fallback to full lookup if table lookup is unsuitable
  net/mlx5: E-Switch, Handle mode change failures
  net/mlx5: E-Switch, Fix error flow in the SRIOV e-switch init code
  net/mlx5: Fix flow counter bulk command out mailbox allocation
  ...
2016-09-22 08:49:25 -07:00
David S. Miller
60cd6e63ec RxRPC rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAV+OQV/Sw1s6N8H32AQK/Gw//TF7n19v+gqUenh5m6xPYkVlZl6d/TRi+
 3JoG5pdNORxTDU7UgzkeuCywDTk5XUYsJs3TOzInRAdDedwfgIiwF3ZKw3Bo30vR
 cVUfG7GK4o+CLWifL3BILYMTJfkOnUS4sllylSqX/EOlPDEEspSRWTxXq+DCOGNZ
 1APBRD8XfA+IIC3fleMh+zSpKZ3ffc2c7djelzo2nCG3ku78U57B23TCyzp2tQNQ
 6ClvhOAwL2nMXF5vebtIU7ou6LUV/TdC4qTkQuz3du/+k+LOG/c8/6s6k70MgXQU
 L3DW3rcnrWxkyzDb5oQoGYSWG5x4gp/TazHbJE2kuUVhQma8eDbOAGRWJoxlSzoC
 LqHE+6q3KnwwXpbYd3DJ+jUI7pu7pUvub1cvJr0uxPcjRb4CzhHT/1OBUb9p4CJX
 /n8NFNXk+5qWsvLaPuNNBPs4pc2xgz/cotjKBYUznqObiq2xgeivZpbsEBOpSIT1
 2hl0EuyAi1Gwpi6qfW8oM6EGrlAzuG77cLcLnxrDz+GsHcgqUvdSuTh0P26eOh7D
 1V03kkfX9dIqkOc5xA9xAckopfG5BhQDiFsMV+5McZ2x8GtUdnMw8E7dsG8xIeY5
 yDzk9m6tD79PlqS7HJ7Fzj6owzqLUeJOI08y9EUSacBFKzNak1NVmcYcXd10rDFj
 duNM4rDi6zA=
 =3zfm
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Preparation for slow-start algorithm [ver #2]

Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:

 (1) Stop storing the protocol header in the Tx socket buffers, but rather
     generate it on the fly.  This potentially saves a little space and
     makes it easier to alter the header just before transmission (the
     flags may get altered and the serial number has to be changed).

 (2) Mask off the Tx buffer annotations and add a flag to record which ones
     have already been resent.

 (3) Track RTT on a per-peer basis for use in future changes.  Tracepoints
     are added to log this.

 (4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
     ACK from which RTT data can be calculated.  The response also carries
     other useful information.

 (5) Expedite PING-RESPONSE ACK generation from sendmsg.  If we're actively
     using sendmsg, this allows us, under some circumstances, to avoid
     having to rely on the background work item to run to generate this
     ACK.

     This requires ktime_sub_ms() to be added.

 (6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
     ACKs from which RTT data can be calculated.

 (7) Limit the use of pings and ACK requests for RTT determination.

Changes:

 (V2) Don't use the C division operator for 64-bit division.  One instance
      should use do_div() and the other should be using nsecs_to_jiffies().

      The last two patches got transposed, leading to an undefined symbol
      in one of them.

      Reported-by: kbuild test robot <lkp@intel.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 08:14:59 -04:00
David Howells
fc943f6777 rxrpc: Reduce the number of PING ACKs sent
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic.  Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.

This could probably be made adjustable in future.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:49:22 +01:00
David Howells
0d4b103c00 rxrpc: Reduce the number of ACK-Requests sent
Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic.  We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.

This could be made tunable in future.

Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:49:20 +01:00
Yuchung Cheng
7e32b44361 tcp: properly account Fast Open SYN-ACK retrans
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:33:01 -04:00
Yuchung Cheng
de1d657816 tcp: fix under-accounting retransmit SNMP counters
This patch fixes these under-accounting SNMP rtx stats
LINUX_MIB_TCPFORWARDRETRANS
LINUX_MIB_TCPFASTRETRANS
LINUX_MIB_TCPSLOWSTARTRETRANS
when retransmitting TSO packets

Fixes: 10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:33:01 -04:00
David Howells
50235c4b5a rxrpc: Obtain RTT data by requesting ACKs on DATA packets
In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK.  The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.

This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.

This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:21:24 +01:00
David Howells
7aa51da7c8 rxrpc: Expedite ping response transmission
Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending.  We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).

If we don't expedite it, it's left to the background processing thread to
transmit.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:21:24 +01:00
David Howells
8e83134db4 rxrpc: Send pings to get RTT data
Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for.  The PING RESPONSE ACK packet will tell us
the following about the peer:

 (1) its receive window size

 (2) its MTU sizes

 (3) its support for jumbo DATA packets

 (4) if it supports slow start (similar to RFC 5681)

 (5) an estimate of the RTT

This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.

A pair of tracepoints are added so that RTT determination can be observed.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 08:21:24 +01:00
Marcelo Ricardo Leitner
4a225ce395 sctp: make use of SCTP_TRUNC4 macro
And avoid the usage of '&~3'. This is the last place still not using
the macro.
Also break the line to make it easier to read.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:13:26 -04:00
Marcelo Ricardo Leitner
e2f036a972 sctp: rename WORD_TRUNC/ROUND macros
To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.

So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.

Reported-by: David Laight <David.Laight@ACULAB.COM>
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:13:26 -04:00
David S. Miller
ba1ba25d31 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2016-09-21

1) Propagate errors on security context allocation.
   From Mathias Krause.

2) Fix inbound policy checks for inter address family tunnels.
   From Thomas Zeitlhofer.

3) Fix an old memory leak on aead algorithm usage.
   From Ilan Tayari.

4) A recent patch fixed a possible NULL pointer dereference
   but broke the vti6 input path.
   Fix from Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 02:56:23 -04:00
Eric Dumazet
f9616c35a0 tcp: implement TSQ for retransmits
We saw sch_fq drops caused by the per flow limit of 100 packets and TCP
when dealing with large cwnd and bursts of retransmits.

Even after increasing the limit to 1000, and even after commit
10d3be5692 ("tcp-tso: do not split TSO packets at retransmit time"),
we can still have these drops.

Under certain conditions, TCP can spend a considerable amount of
time queuing thousands of skbs in a single tcp_xmit_retransmit_queue()
invocation, incurring latency spikes and stalls of other softirq
handlers.

This patch implements TSQ for retransmits, limiting number of packets
and giving more chance for scheduling packets in both ways.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 02:44:16 -04:00
Eric Dumazet
adb03115f4 net: get rid of an signed integer overflow in ip_idents_reserve()
Jiri Pirko reported an UBSAN warning happening in ip_idents_reserve()

[] UBSAN: Undefined behaviour in ./arch/x86/include/asm/atomic.h:156:11
[] signed integer overflow:
[] -2117905507 + -695755206 cannot be represented in type 'int'

Since we do not have uatomic_add_return() yet, use atomic_cmpxchg()
so that the arithmetics can be done using unsigned int.

Fixes: 04ca6973f7 ("ip: make IP identifiers less predictable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 02:41:17 -04:00
Shmulik Ladkani
ecf4ee41d2 net: skbuff: Coding: Use eth_type_vlan() instead of open coding it
Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
skb->protocol to ETH_P_8021Q or ETH_P_8021AD.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:35:57 -04:00
Shmulik Ladkani
636c262808 net: skbuff: Remove errornous length validation in skb_vlan_pop()
In 93515d53b1
  "net: move vlan pop/push functions into common code"
skb_vlan_pop was moved from its private location in openvswitch to
skbuff common code.

In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
considered a no-op).

This validation was moved as is into the new common 'skb_vlan_pop'.

Alas, in its original location (openvswitch), there was a guarantee that
'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
condition made sense.
However there's no such guarantee in the generic 'skb_vlan_pop'.

For short packets received in rx path going through 'skb_vlan_pop',
this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
hw-accel case) or to fail moving next tag into hw-accel tag.

Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
It is superfluous since inner '__skb_vlan_pop' already verifies there
are VLAN_ETH_HLEN writable bytes at the mac_header.

Note this presents a slight change to skb_vlan_pop() users:
In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
returns an error, as opposed to previous "no-op" behavior.
Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
'skb_vlan_pop' fails.

Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Pravin Shelar <pshelar@ovn.org>
Reviewed-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:35:57 -04:00
Shmulik Ladkani
45a497f2d1 net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action
TCA_VLAN_ACT_MODIFY allows one to change an existing tag.

It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.

For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:34:20 -04:00
Shmulik Ladkani
bfca4c520f net: skbuff: Export __skb_vlan_pop
This exports the functionality of extracting the tag from the payload,
without moving next vlan tag into hw accel tag.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 01:34:20 -04:00
David Howells
cf1a6474f8 rxrpc: Add per-peer RTT tracker
Add a function to track the average RTT for a peer.  Sources of RTT data
will be added in subsequent patches.

The RTT data will be useful in the future for determining resend timeouts
and for handling the slow-start part of the Rx protocol.

Also add a pair of tracepoints, one to log transmissions to elicit a
response for RTT purposes and one to log responses that contribute RTT
data.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 01:26:25 +01:00
David Howells
f07373ead4 rxrpc: Add re-sent Tx annotation
Add a Tx-phase annotation for packet buffers to indicate that a buffer has
already been retransmitted.  This will be used by future congestion
management.  Re-retransmissions of a packet don't affect the congestion
window managment in the same way as initial retransmissions.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 01:23:50 +01:00
David Howells
5a924b8951 rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs
Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
but rather generate it on the fly and pass it to kernel_sendmsg() as a
separate iov.  This reduces the amount of storage required.

Note that the security header is still stored in the sk_buff as it may get
encrypted along with the data (and doesn't change with each transmission).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-09-22 01:23:50 +01:00