linux/net
Stefano Brivio 3c4287f620 nf_tables: Add set type for arbitrary concatenation of ranges
This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
	http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
	https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10190076pps
    baseline hash (non-ranged entries):             6179564pps
    baseline rbtree (match on first field only):    2950341pps
    set with  1000 full, ranged entries:            2304165pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10143615pps
    baseline hash (non-ranged entries):             6135776pps
    baseline rbtree (match on first field only):    4311934pps
    set with   100 full, ranged entries:            4131471pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9730404pps
    baseline hash (non-ranged entries):             4809557pps
    baseline rbtree (match on first field only):    1501699pps
    set with  1000 full, ranged entries:            1092557pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10812426pps
    baseline hash (non-ranged entries):             6929353pps
    baseline rbtree (match on first field only):    3027105pps
    set with 30000 full, ranged entries:             284147pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9660114pps
    baseline hash (non-ranged entries):             3778877pps
    baseline rbtree (match on first field only):    3179379pps
    set with    10 full, ranged entries:            2082880pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9718324pps
    baseline hash (non-ranged entries):             3799021pps
    baseline rbtree (match on first field only):    1506689pps
    set with  1000 full, ranged entries:             783810pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10190029pps
    baseline hash (non-ranged entries):             5172218pps
    baseline rbtree (match on first field only):    2946863pps
    set with  1000 full, ranged entries:            1279122pps

v4:
 - fix build for 32-bit architectures: 64-bit division needs
   div_u64() (kbuild test robot <lkp@intel.com>)
v3:
 - rework interface for field length specification,
   NFT_SET_SUBKEY disappears and information is stored in
   description
 - remove scratch area to store closing element of ranges,
   as elements now come with an actual attribute to specify
   the upper range limit (Pablo Neira Ayuso)
 - also remove pointer to 'start' element from mapping table,
   closing key is now accessible via extension data
 - use bytes right away instead of bits for field lengths,
   this way we can also double the inner loop of the lookup
   function to take care of upper and lower bits in a single
   iteration (minor performance improvement)
 - make it clearer that set operations are actually atomic
   API-wise, but we can't e.g. implement flush() as one-shot
   action
 - fix type for 'dup' in nft_pipapo_insert(), check for
   duplicates only in the next generation, and in general take
   care of differentiating generation mask cases depending on
   the operation (Pablo Neira Ayuso)
 - report C implementation matching rate in commit message, so
   that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
 - protect access to scratch maps in nft_pipapo_lookup() with
   local_bh_disable/enable() (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
   already implied (Florian Westphal)
 - explain why partial allocation failures don't need handling
   in pipapo_realloc_scratch(), rename 'm' to clone and update
   related kerneldoc to make it clear we're not operating on
   the live copy (Florian Westphal)
 - add expicit check for priv->start_elem in
   nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
   with a NULL start element, and also zero it out in every
   operation that might make it invalid, so that insertion
   doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
..
6lowpan
9p 9p pull request for inclusion in 5.4 2019-09-27 15:10:34 -07:00
802 treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-09 12:13:43 -08:00
appletalk appletalk: enforce CAP_NET_RAW for raw sockets 2019-09-24 16:37:18 +02:00
atm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
ax25 net: Make sock protocol value checks more specific 2020-01-09 18:41:40 -08:00
batman-adv Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
bluetooth netdev: pass the stuck queue to the timeout handler 2019-12-12 21:38:57 -08:00
bpf bpf: Allow to change skb mark in test_run 2019-12-18 17:05:58 -08:00
bpfilter
bridge net: bridge: vlan: add per-vlan state 2020-01-24 12:58:14 +01:00
caif caif_usb: fix spelling mistake "to" -> "too" 2020-01-24 08:12:06 +01:00
can can: j1939: j1939_sk_bind(): take priv after lock is held 2019-12-08 11:52:02 +01:00
ceph libceph, rbd, ceph: convert to use the new mount API 2019-11-27 22:28:37 +01:00
core Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
dcb
dccp treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
decnet net: Make sock protocol value checks more specific 2020-01-09 18:41:40 -08:00
dns_resolver
dsa Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
ethernet net: add annotations on hh->hh_len lockless accesses 2019-11-07 20:07:30 -08:00
ethtool ethtool: potential NULL dereference in strset_prepare_data() 2020-01-08 16:03:35 -08:00
hsr net/hsr: remove seq_nr_after_or_eq 2020-01-21 12:03:21 +01:00
ieee802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-11-02 13:54:56 -07:00
ife net: Fix Kconfig indentation 2019-09-26 08:56:17 +02:00
ipv4 tcp: export count for rehash attempts 2020-01-26 15:28:47 +01:00
ipv6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
iucv treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
kcm kcm: disable preemption in kcm_parse_func_strparser() 2019-09-27 10:27:14 +02:00
key
l2tp l2tp: Remove redundant BUG_ON() check in l2tp_pernet 2020-01-03 12:26:07 -08:00
l3mdev
lapb
llc llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c) 2019-12-20 21:19:36 -08:00
mac80211 Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
mac802154
mpls net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup 2019-12-04 12:27:13 -08:00
mptcp mptcp: Fix code formatting 2020-01-25 08:14:39 +01:00
ncsi net/ncsi: Support for multi host mellanox card 2020-01-09 18:36:22 -08:00
netfilter nf_tables: Add set type for arbitrary concatenation of ranges 2020-01-27 08:54:30 +01:00
netlabel
netlink treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
netrom net: core: add generic lockdep keys 2019-10-24 14:53:48 -07:00
nfc net: nfc: nci: fix a possible sleep-in-atomic-context bug in nci_uart_tty_receive() 2019-12-18 11:57:33 -08:00
nsh
openvswitch net: openvswitch: use skb_list_walk_safe helper for gso segments 2020-01-14 11:48:41 -08:00
packet af_packet: refactoring code for prb_calc_retire_blk_tmo 2019-12-26 15:19:59 -08:00
phonet net: Remove redundant BUG_ON() check in phonet_pernet 2020-01-03 12:25:50 -08:00
psample net: psample: fix skb_over_panic 2019-11-26 14:40:13 -08:00
qrtr net: qrtr: Remove receive worker 2020-01-14 18:36:42 -08:00
rds net/rds: Use prefetch for On-Demand-Paging MR 2020-01-18 11:48:19 +02:00
rfkill rfkill: Fix incorrect check to avoid NULL pointer dereference 2019-12-16 10:15:49 +01:00
rose Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
rxrpc RxRPC fixes 2019-12-24 16:12:47 -08:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-09 12:13:43 -08:00
smc net/smc: allow unprivileged users to read pnet table 2020-01-21 11:39:56 +01:00
strparser
sunrpc xprtrdma: Fix oops in Receive handler after device removal 2020-01-14 13:30:24 -05:00
switchdev
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-09 12:13:43 -08:00
tls Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
unix Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2020-01-21 12:18:20 +01:00
vmw_vsock Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
wimax
wireless Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
x25 net/x25: fix nonblocking connect 2020-01-09 18:39:33 -08:00
xdp xsk, net: Make sock_def_readable() have external linkage 2020-01-22 00:08:52 +01:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-01-26 10:40:21 +01:00
compat.c y2038: socket: use __kernel_old_timespec instead of timespec 2019-11-15 14:38:29 +01:00
Kconfig mptcp: Add MPTCP socket stubs 2020-01-24 13:44:07 +01:00
Makefile mptcp: Add MPTCP socket stubs 2020-01-24 13:44:07 +01:00
socket.c socket: fix unused-function warning 2020-01-08 15:02:21 -08:00
sysctl_net.c