linux/net
Willem de Bruijn 2ccdbaa6d5 packet: rollover lock contention avoidance
Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
  Ran bench_rollover: a process with 8 sockets in a single fanout
  group, each pinned to a single cpu that receives one nic recv
  interrupt. RPS and RFS are disabled. The benchmark uses packet
  rx_ring, which has to take a lock when determining whether a
  socket has room.

  Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
  uniformly across the packet sockets (and inserted an iptables
  rule to drop in PREROUTING to avoid protocol stack processing).

  Without this patch, all sockets try to migrate traffic to
  neighbors, causing lock contention when searching for a non-
  empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
       - _raw_spin_lock
          - 99.00% spin_lock
    	 + 81.77% packet_rcv_has_room.isra.41
    	 + 18.23% tpacket_rcv
          + 0.84% packet_rcv_has_room.isra.41
    +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
    +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41

  On net-next with this patch, this lock contention is no longer a
  top entry. Most time is spent in the actual read function. Next up
  are other locks:

    +  15.52%  bench_rollover  bench_rollover     [.] reader
    +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
    +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
    +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
    +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
    +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq

  Looking closer at the remaining _raw_spin_lock, the cost of probing
  in rollover is now comparable to the cost of taking the lock later
  in tpacket_rcv.

    -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
       - _raw_spin_lock
          + 33.41% packet_rcv_has_room
          + 28.15% tpacket_rcv
          + 19.54% enqueue_to_backlog
          + 6.45% __free_pages_ok
          + 2.78% packet_rcv_fanout
          + 2.13% fanout_demux_rollover
          + 2.01% netif_receive_skb_internal

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:43:00 -04:00
..
6lowpan 6lowpan: nhc: add other known rfc6282 compressions 2015-02-14 23:08:44 +01:00
9p 9p: patches for 4.1 merge window 2015-04-18 17:45:30 -04:00
802 net: Kill dev_rebuild_header 2015-03-02 16:43:41 -05:00
8021q vlan: implement ndo_get_iflink 2015-04-02 14:05:00 -04:00
appletalk net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
atm net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
ax25 net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
batman-adv dev: introduce dev_get_iflink() 2015-04-02 14:04:59 -04:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
bridge switchdev: don't use anonymous union on switchdev attr/obj structs 2015-05-13 14:20:59 -04:00
caif net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
can net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
ceph net: Add a struct net parameter to sock_create_kern 2015-05-11 10:50:17 -04:00
core flow_dissector: change port array into src, dst tuple 2015-05-13 15:19:47 -04:00
dcb net/dcb: Add IEEE QCN attribute 2015-03-06 21:50:02 -05:00
dccp inet: fix possible panic in reqsk_queue_unlink() 2015-04-24 11:39:15 -04:00
decnet net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
dns_resolver
dsa switchdev: don't use anonymous union on switchdev attr/obj structs 2015-05-13 14:20:59 -04:00
ethernet flow_dissect: use programable dissector in skb_flow_dissect and friends 2015-05-13 15:19:47 -04:00
hsr net/hsr: Fix NULL pointer dereference and refcnt bugs when deleting a HSR interface. 2015-03-01 13:40:23 -05:00
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
ipv4 ipv4: __ip_local_out_sk() is static 2015-05-13 15:21:33 -04:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
ipx net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
irda net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
iucv net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
key net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
l2tp net: Modify sk_alloc to not reference count the netns of kernel sockets. 2015-05-11 10:50:18 -04:00
lapb lapb: move EXPORT_SYMBOL after functions. 2014-10-24 15:51:42 -04:00
llc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
mac802154 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth 2015-05-09 15:51:00 -04:00
mpls mpls: Change reserved label names to be consistent with netbsd 2015-05-09 22:29:50 -04:00
netfilter net: Modify sk_alloc to not reference count the netns of kernel sockets. 2015-05-11 10:50:18 -04:00
netlabel netlink: implement nla_put_in_addr and nla_put_in6_addr 2015-03-31 13:58:35 -04:00
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
netrom net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
nfc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
openvswitch openvswitch: Use eth_proto_is_802_3 2015-05-05 19:24:42 -04:00
packet packet: rollover lock contention avoidance 2015-05-13 15:43:00 -04:00
phonet net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
rfkill Last round of updates for net-next: 2015-02-04 14:57:45 -08:00
rose net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
rxrpc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
sched tc: introduce Flower classifier 2015-05-13 15:19:48 -04:00
sctp net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
sunrpc svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures 2015-05-04 12:02:40 -04:00
switchdev switchdev: don't use anonymous union on switchdev attr/obj structs 2015-05-13 14:20:59 -04:00
tipc net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
unix net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
vmw_vsock net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
wimax
wireless cfg80211: change GO_CONCURRENT to IR_CONCURRENT for STA 2015-05-06 15:50:02 +02:00
x25 net: Pass kern from net_proto_family.create to sk_alloc 2015-05-11 10:50:17 -04:00
xfrm net: make skb_dst_pop routine static 2015-05-12 23:19:49 -04:00
compat.c net: switch importing msghdr from userland to {compat_,}import_iovec() 2015-04-09 00:02:26 -04:00
Kconfig kconfig: use bool instead of boolean for type definition attributes 2015-01-07 13:08:04 +01:00
Makefile mpls: Refactor how the mpls module is built 2015-03-04 00:26:06 -05:00
socket.c net: Add a struct net parameter to sock_create_kern 2015-05-11 10:50:17 -04:00
sysctl_net.c