linux/include/net
Tom Herbert fec5e652e5 rfs: Receive Flow Steering
This patch implements receive flow steering (RFS).  RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running.  RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure.  The rxhash is passed in skb's received on
the connection from netif_receive_skb.  For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets.  If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table.  Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue.  Each entry
contains a CPU and a tail queue counter.  The CPU is the "current"
CPU for a matching flow.  The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length.  When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted.  When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table.  This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality.  2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS.  The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue.  Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors.  On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation.  However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch.  The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp.  The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
   No RFS or RPS		104K tps at 30% CPU
   No RFS (best RPS config):    290K tps at 63% CPU
   RFS				303K tps at 61% CPU

RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
  No RFS/RPS	103K	48%	757/900/3185		4472.35
  RPS only:	174K	73%	415/993/2468		491.66
  RFS		223K	73%	379/651/1382		315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 16:01:27 -07:00
..
9p 9p: Make sure we are able to clunk the cached fid on umount 2010-04-05 10:37:36 -05:00
bluetooth Bluetooth: Convert debug files to actually use debugfs instead of sysfs 2010-03-21 05:49:35 +01:00
caif net-caif: add CAIF Link layer device header files 2010-03-30 19:08:45 -07:00
irda tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments 2010-02-15 15:38:10 +01:00
iucv include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
netfilter include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
netns ipv4: ipmr: support multiple tables 2010-04-13 14:49:34 -07:00
phonet Phonet: zero-copy GPRS TX 2010-01-07 00:24:55 -08:00
sctp net: snmp mib cleanup 2010-03-21 18:34:16 -07:00
tc_act pkt_sched: skbedit add support for setting mark 2009-10-22 21:56:42 -07:00
tipc
act_api.h net: restore gnet_stats_basic to previous definition 2009-08-17 21:33:49 -07:00
addrconf.h net: Add checking to rcu_dereference() primitives 2010-02-25 09:41:03 +01:00
af_ieee802154.h af_ieee802154: add support for WANT_ACK socket option 2009-08-12 21:54:50 -07:00
af_rxrpc.h
af_unix.h net: Fix soft lockups/OOM issues w/ unix garbage collector 2008-11-26 15:32:27 -08:00
ah.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
arp.h net: make neigh_ops constant 2009-09-01 17:40:57 -07:00
atmclip.h clip: convert to internal network_device_stats 2009-01-21 14:01:59 -08:00
ax25.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ax88796.h ax88796: Add method to take MAC from platform data 2009-03-24 23:32:03 -07:00
cfg80211.h cfg80211: Add local-state-change-only auth/deauth/disassoc 2010-04-07 14:37:56 -04:00
checksum.h include/net net/ - csum_partial - remove unnecessary casts 2008-11-19 15:44:53 -08:00
cipso_ipv4.h netlabel: Label incoming TCP connections correctly in SELinux 2009-03-28 15:01:36 +11:00
compat.h net: fix compat_sys_recvmmsg parameter type 2009-12-11 15:07:56 -08:00
datalink.h
dcbnl.h dcbnl: Add support for setapp/getapp to netdev dcbnl_rtnl_ops 2009-09-01 01:24:30 -07:00
dn_dev.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-12-08 07:55:01 -08:00
dn_fib.h decnet: Remove unused FIB metric macros. 2010-03-27 19:23:46 -07:00
dn_neigh.h
dn_nsp.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
dn_route.h
dn.h decnet: compile fix for removal of byteorder wrapper 2008-11-27 23:04:13 -08:00
dsa.h dsa: add switch chip cascading support 2009-03-21 19:06:54 -07:00
dsfield.h
dst_ops.h netns: embed ip6_dst_ops directly 2009-09-01 17:40:31 -07:00
dst.h net: sk_dst_cache RCUification 2010-04-13 01:41:33 -07:00
esp.h
ethoc.h net: Add support for the OpenCores 10/100 Mbps Ethernet MAC. 2009-03-27 00:16:21 -07:00
fib_rules.h net: fib_rules: consolidate IPv4 and DECnet ->default_pref() functions. 2010-04-13 14:49:30 -07:00
flow.h flow: virtualize flow cache entry methods 2010-04-07 03:43:18 -07:00
garp.h
gen_stats.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
genetlink.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
icmp.h ipv4: raw: move struct raw_sock and raw_sk() to include/net/raw.h 2010-04-13 14:49:31 -07:00
ieee80211_radiotap.h wireless: update radiotap parser 2010-02-08 16:50:53 -05:00
ieee802154_netdev.h ieee802154: add an mlme_ops call to retrieve PHY object 2009-11-06 14:32:18 +03:00
ieee802154.h ieee802154: move headers out of extra directory 2009-07-23 17:08:51 +04:00
if_inet6.h ipv6: convert idev_list to list macros 2010-03-20 15:45:09 -07:00
inet_common.h
inet_connection_sock.h net: replace ipfragok with skb->local_df 2010-04-15 23:36:37 -07:00
inet_ecn.h net: replace __constant_{endian} uses in net headers 2009-02-14 22:58:35 -08:00
inet_frag.h inet fragments: fix sparse warning: context imbalance 2009-02-26 23:13:35 -08:00
inet_hashtables.h tcp: Fix a connect() race with timewait sockets 2009-12-08 20:17:51 -08:00
inet_sock.h rfs: Receive Flow Steering 2010-04-16 16:01:27 -07:00
inet_timewait_sock.h tcp: Fix a connect() race with timewait sockets 2009-12-08 20:17:51 -08:00
inet6_connection_sock.h net: replace ipfragok with skb->local_df 2010-04-15 23:36:37 -07:00
inet6_hashtables.h tcp: Fix a connect() race with timewait sockets 2009-12-08 20:17:51 -08:00
inetpeer.h inetpeer: Optimize inet_getid() 2009-11-13 20:46:58 -08:00
ip_fib.h Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2009-11-06 00:55:55 -08:00
ip_vs.h ipvs: SCTP Trasport Loadbalancing Support 2010-02-18 12:31:05 +01:00
ip.h net: replace ipfragok with skb->local_df 2010-04-15 23:36:37 -07:00
ip6_checksum.h
ip6_fib.h ipv6 fib: Make rt6_info{} more cache-line aware. 2010-04-01 18:41:41 -07:00
ip6_route.h net: sk_dst_cache RCUification 2010-04-13 01:41:33 -07:00
ip6_tunnel.h ipv6 ip6_tunnel: eliminate unused recursion field from ip6_tnl{}. 2010-03-10 07:32:29 -08:00
ipcomp.h percpu: add __percpu sparse annotations to net 2010-02-16 23:05:38 -08:00
ipconfig.h
ipip.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
ipv6.h net: replace ipfragok with skb->local_df 2010-04-15 23:36:37 -07:00
ipx.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
iw_handler.h include/net/iw_handler.h: Use SIOCIWFIRST not SIOCSIWCOMMIT in comment 2010-03-31 14:49:12 -04:00
lapb.h
lib80211.h wireless: missing include in lib80211.h 2008-11-21 11:42:55 -05:00
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h llc: use a device based hash table to speed up multicast delivery 2009-12-26 20:43:57 -08:00
llc_if.h
llc_pdu.h
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
llc.h llc: convert llc_sap_list to RCU 2009-12-26 20:46:28 -08:00
mac80211.h Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem 2010-04-15 16:21:34 -04:00
mip6.h
ndisc.h sysctl: remove "struct file *" argument of ->proc_handler 2009-09-24 07:21:04 -07:00
neighbour.h percpu: add __percpu sparse annotations to net 2010-02-16 23:05:38 -08:00
net_namespace.h nsproxy: remove INIT_NSPROXY() 2010-03-12 15:52:40 -08:00
netdma.h net_dma: convert to dma_find_channel 2009-01-06 11:38:15 -07:00
netevent.h
netlabel.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
netlink.h netlink: fix unaligned access in nla_get_be64() 2010-03-19 22:47:23 -07:00
netrom.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
nexthop.h
nl802154.h ieee802154: add support for channel pages from IEEE 802.15.4-2006 2009-08-19 23:08:22 +04:00
p8022.h
pkt_cls.h net: rename skb->iif to skb->skb_iif 2009-11-20 15:35:04 -08:00
pkt_sched.h gen_estimator: deadlock fix 2010-04-01 18:38:48 -07:00
protocol.h net: drop capability from protocol definitions 2009-11-05 21:40:17 -08:00
psnap.h snap: use const for descriptor 2009-03-21 19:06:50 -07:00
raw.h ipv4: ipmr: support multiple tables 2010-04-13 14:49:34 -07:00
rawv6.h ipv6: Use correct data types for ICMPv6 type and code 2009-06-23 04:31:07 -07:00
red.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
regulatory.h cfg80211: add regulatory hint disconnect support 2010-02-01 15:40:06 -05:00
request_sock.h tcp: account SYN-ACK timeouts & retransmissions 2010-01-17 19:09:39 -08:00
rose.h NET: ROSE: Don't use static buffer. 2009-07-26 19:11:14 -07:00
route.h percpu: add __percpu sparse annotations to net 2010-02-16 23:05:38 -08:00
rtnetlink.h rtnetlink: support specifying device flags on device creation 2010-02-27 02:43:40 -08:00
sch_generic.h gen_estimator: deadlock fix 2010-04-01 18:38:48 -07:00
scm.h net: cleanup include/net 2009-11-04 05:06:25 -08:00
slhc_vj.h
snmp.h net: snmp mib cleanup 2010-03-21 18:34:16 -07:00
sock.h net: sk_dst_cache RCUification 2010-04-13 01:41:33 -07:00
stp.h
tcp_states.h
tcp.h inet: Remove unused send_check length argument 2010-04-11 15:29:09 -07:00
timewait_sock.h net: Fix memory leak in the proto_register function 2008-11-21 16:45:22 -08:00
transp_v6.h inet: inet_connection_sock_af_ops const 2009-09-02 01:03:49 -07:00
udp.h udp: bind() optimisation 2009-11-10 20:54:38 -08:00
udplite.h udp: introduce struct udp_table and multiple spinlocks 2008-10-29 01:41:45 -07:00
wext.h wext: refactor 2009-10-07 16:39:43 -04:00
wimax.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-12-09 19:43:33 -08:00
wpan-phy.h ieee802154: add support for creation/removal of logic interfaces 2009-11-06 14:32:24 +03:00
x25.h Merge branch 'master' of /home/davem/src/GIT/linux-2.6/ 2010-04-11 02:44:30 -07:00
x25device.h
xfrm.h Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-04-11 14:53:53 -07:00