mirror of
https://github.com/torvalds/linux.git
synced 2024-11-27 14:41:39 +00:00
a11c397c43
Always use init_net flow dissector BPF program if it's attached and fall back to the per-net namespace one. Also, deny installing new programs if there is already one attached to the root namespace. Users can still detach their BPF programs, but can't attach any new ones (-EEXIST). Cc: Petar Penkov <ppenkov@google.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
148 lines
5.2 KiB
ReStructuredText
148 lines
5.2 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
============================
|
|
BPF_PROG_TYPE_FLOW_DISSECTOR
|
|
============================
|
|
|
|
Overview
|
|
========
|
|
|
|
Flow dissector is a routine that parses metadata out of the packets. It's
|
|
used in the various places in the networking subsystem (RFS, flow hash, etc).
|
|
|
|
BPF flow dissector is an attempt to reimplement C-based flow dissector logic
|
|
in BPF to gain all the benefits of BPF verifier (namely, limits on the
|
|
number of instructions and tail calls).
|
|
|
|
API
|
|
===
|
|
|
|
BPF flow dissector programs operate on an ``__sk_buff``. However, only the
|
|
limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``.
|
|
``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input
|
|
and output arguments.
|
|
|
|
The inputs are:
|
|
* ``nhoff`` - initial offset of the networking header
|
|
* ``thoff`` - initial offset of the transport header, initialized to nhoff
|
|
* ``n_proto`` - L3 protocol type, parsed out of L2 header
|
|
* ``flags`` - optional flags
|
|
|
|
Flow dissector BPF program should fill out the rest of the ``struct
|
|
bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
|
|
also adjusted accordingly.
|
|
|
|
The return code of the BPF program is either BPF_OK to indicate successful
|
|
dissection, or BPF_DROP to indicate parsing error.
|
|
|
|
__sk_buff->data
|
|
===============
|
|
|
|
In the VLAN-less case, this is what the initial state of the BPF flow
|
|
dissector looks like::
|
|
|
|
+------+------+------------+-----------+
|
|
| DMAC | SMAC | ETHER_TYPE | L3_HEADER |
|
|
+------+------+------------+-----------+
|
|
^
|
|
|
|
|
+-- flow dissector starts here
|
|
|
|
|
|
.. code:: c
|
|
|
|
skb->data + flow_keys->nhoff point to the first byte of L3_HEADER
|
|
flow_keys->thoff = nhoff
|
|
flow_keys->n_proto = ETHER_TYPE
|
|
|
|
In case of VLAN, flow dissector can be called with the two different states.
|
|
|
|
Pre-VLAN parsing::
|
|
|
|
+------+------+------+-----+-----------+-----------+
|
|
| DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
|
|
+------+------+------+-----+-----------+-----------+
|
|
^
|
|
|
|
|
+-- flow dissector starts here
|
|
|
|
.. code:: c
|
|
|
|
skb->data + flow_keys->nhoff point the to first byte of TCI
|
|
flow_keys->thoff = nhoff
|
|
flow_keys->n_proto = TPID
|
|
|
|
Please note that TPID can be 802.1AD and, hence, BPF program would
|
|
have to parse VLAN information twice for double tagged packets.
|
|
|
|
|
|
Post-VLAN parsing::
|
|
|
|
+------+------+------+-----+-----------+-----------+
|
|
| DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
|
|
+------+------+------+-----+-----------+-----------+
|
|
^
|
|
|
|
|
+-- flow dissector starts here
|
|
|
|
.. code:: c
|
|
|
|
skb->data + flow_keys->nhoff point the to first byte of L3_HEADER
|
|
flow_keys->thoff = nhoff
|
|
flow_keys->n_proto = ETHER_TYPE
|
|
|
|
In this case VLAN information has been processed before the flow dissector
|
|
and BPF flow dissector is not required to handle it.
|
|
|
|
|
|
The takeaway here is as follows: BPF flow dissector program can be called with
|
|
the optional VLAN header and should gracefully handle both cases: when single
|
|
or double VLAN is present and when it is not present. The same program
|
|
can be called for both cases and would have to be written carefully to
|
|
handle both cases.
|
|
|
|
|
|
Flags
|
|
=====
|
|
|
|
``flow_keys->flags`` might contain optional input flags that work as follows:
|
|
|
|
* ``BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to
|
|
continue parsing first fragment; the default expected behavior is that
|
|
flow dissector returns as soon as it finds out that the packet is fragmented;
|
|
used by ``eth_get_headlen`` to estimate length of all headers for GRO.
|
|
* ``BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to
|
|
stop parsing as soon as it reaches IPv6 flow label; used by
|
|
``___skb_get_hash`` and ``__skb_get_hash_symmetric`` to get flow hash.
|
|
* ``BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop
|
|
parsing as soon as it reaches encapsulated headers; used by routing
|
|
infrastructure.
|
|
|
|
|
|
Reference Implementation
|
|
========================
|
|
|
|
See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference
|
|
implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]``
|
|
for the loader. bpftool can be used to load BPF flow dissector program as well.
|
|
|
|
The reference implementation is organized as follows:
|
|
* ``jmp_table`` map that contains sub-programs for each supported L3 protocol
|
|
* ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and
|
|
does ``bpf_tail_call`` to the appropriate L3 handler
|
|
|
|
Since BPF at this point doesn't support looping (or any jumping back),
|
|
jmp_table is used instead to handle multiple levels of encapsulation (and
|
|
IPv6 options).
|
|
|
|
|
|
Current Limitations
|
|
===================
|
|
BPF flow dissector doesn't support exporting all the metadata that in-kernel
|
|
C-based implementation can export. Notable example is single VLAN (802.1Q)
|
|
and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys``
|
|
for a set of information that's currently can be exported from the BPF context.
|
|
|
|
When BPF flow dissector is attached to the root network namespace (machine-wide
|
|
policy), users can't override it in their child network namespaces.
|