linux/net
Daniel Borkmann 1f211a1b92 net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

   # tc qdisc add dev foo clsact
   # tc qdisc show dev foo
   qdisc mq 0: root
   qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

   # tc filter add dev foo ingress bpf da obj bar.o sec ingress
   # tc filter add dev foo egress  bpf da obj bar.o sec egress
   # tc filter show dev foo ingress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
   # tc filter show dev foo egress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

  [1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:13:15 -05:00
..
6lowpan 6lowpan: fix debugfs interface entry name 2015-12-20 08:21:00 +01:00
9p IB/cma: Add support for network namespaces 2015-10-28 12:32:48 -04:00
802
8021q net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK 2015-12-15 16:50:08 -05:00
appletalk
atm net: Generalise wq_has_sleeper helper 2015-11-30 14:47:33 -05:00
ax25 net: add validation for the socket syscall protocol argument 2015-12-14 16:09:30 -05:00
batman-adv batman-adv: Add kerneldoc for batadv_neigh_node::refcount 2016-01-09 20:56:00 +08:00
bluetooth Bluetooth: avoid rebuilding hci_sock all the time 2016-01-06 16:36:44 +01:00
bridge bridge: Reflect MDB entries to hardware 2016-01-10 16:50:21 -05:00
caif net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA 2015-12-01 15:45:05 -05:00
can can: avoid using timeval for uapi 2015-10-13 17:42:34 +02:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client 2015-11-13 09:24:40 -08:00
core net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
dcb net/dcb: make dcbnl.c explicitly non-modular 2015-10-09 07:52:27 -07:00
dccp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-03 21:09:12 -05:00
decnet net: add validation for the socket syscall protocol argument 2015-12-14 16:09:30 -05:00
dns_resolver net: dns_resolver: convert time_t to time64_t 2015-11-18 16:27:46 -05:00
dsa dsa: Register netdev before phy 2016-01-07 14:31:26 -05:00
ethernet net: Add eth_platform_get_mac_address() helper. 2016-01-06 16:31:56 -05:00
hsr net/hsr: fix a warning message 2015-11-23 14:56:15 -05:00
ieee802154 inet: kill unused skb_free op 2016-01-05 22:25:57 -05:00
ipv4 ipv4: Namespecify the tcp_keepalive_intvl sysctl knob 2016-01-10 17:32:09 -05:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2016-01-08 20:53:16 -05:00
ipx
irda net: add validation for the socket syscall protocol argument 2015-12-14 16:09:30 -05:00
iucv iucv: call skb_linearize() when needed 2015-12-14 16:16:44 -05:00
key af_key: fix two typos 2015-10-23 03:05:19 -07:00
l2tp l2tp: rely on ppp layer for skb scrubbing 2016-01-04 16:45:24 -05:00
l3mdev net: Add netif_is_l3_slave 2015-10-07 04:27:43 -07:00
lapb
llc tcp: fix recv with flags MSG_WAITALL | MSG_PEEK 2015-07-27 01:06:53 -07:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-17 22:08:28 -05:00
mac802154 mac802154: constify ieee802154_llsec_ops structure 2016-01-04 20:40:41 +01:00
mpls Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-17 22:08:28 -05:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next 2016-01-08 20:53:16 -05:00
netlabel
netlink netlink: add a start callback for starting a netlink dump 2015-12-15 23:25:20 -05:00
netrom
nfc NFC 4.5 pull request 2016-01-04 21:48:15 -05:00
openvswitch Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-31 18:20:10 -05:00
packet packet: Allow packets with only a header (but no payload) 2015-11-29 22:17:17 -05:00
phonet
rds RDS: don't pretend to use cpu notifiers 2015-12-22 15:23:05 -05:00
rfkill Bluetooth: hci_bcm: move all Broadcom ACPI IDs to BCM HCI driver 2016-01-04 19:22:05 +01:00
rose
rxrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-03 21:09:12 -05:00
sched net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
sctp sctp: remove the local_bh_disable/enable in sctp_endpoint_lookup_assoc 2016-01-05 12:24:02 -05:00
sunrpc sched/wait: Fix the signal handling fix 2015-12-13 14:30:59 -08:00
switchdev switchdev: Adding MDB entry offload 2016-01-10 16:50:20 -05:00
tipc ip_tunnel: Move stats update to iptunnel_xmit() 2015-12-25 23:32:23 -05:00
unix Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-01-06 22:54:18 -05:00
vmw_vsock Revert "Merge branch 'vsock-virtio'" 2015-12-08 21:55:49 -05:00
wimax net:wimax: Fix doucble word "the the" in networking.xml 2015-08-09 22:43:52 -07:00
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-12-17 22:08:28 -05:00
x25
xfrm Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec 2015-12-22 16:26:31 -05:00
compat.c
Kconfig net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
Makefile net: Introduce L3 Master device abstraction 2015-09-29 20:40:32 -07:00
socket.c net, socket, socket_wq: fix missing initialization of flags 2015-12-30 16:38:01 -05:00
sysctl_net.c net: sysctl: fix a kmemleak warning 2015-10-23 06:22:08 -07:00