mirror of
https://github.com/torvalds/linux.git
synced 2024-11-24 21:21:41 +00:00
04efaebd72
This patch does the following to address IETF feedback: * Remove mention of "program type" and reference future docs (and mention platform-specific docs exist) for helper functions and BTF. Addresses Roman Danyliw's comments based on GENART review from Ines Robles [0]. * Add reference for endianness as requested by John Scudder [1]. * Added bit numbers to top of 32-bit wide format diagrams as requested by Paul Wouters [2]. * Added more text about why BPF doesn't stand for anything, based on text from ebpf.io [3], as requested by Eric Vyncke and Gunter Van de Velde [4]. * Replaced "htobe16" (and similar) and the direction-specific description with just "be16" (and similar) and a direction-agnostic description, to match the direction-agnostic description in the Byteswap Instructions section. Based on feedback from Eric Vyncke [5]. [0] https://mailarchive.ietf.org/arch/msg/bpf/DvDgDWOiwk05OyNlWlAmELZFPlM/ [1] https://mailarchive.ietf.org/arch/msg/bpf/eKNXpU4jCLjsbZDSw8LjI29M3tM/ [2] https://mailarchive.ietf.org/arch/msg/bpf/hGk8HkYxeZTpdu9qW_MvbGKj7WU/ [3] https://ebpf.io/what-is-ebpf/#what-do-ebpf-and-bpf-stand-for [4] https://mailarchive.ietf.org/arch/msg/bpf/i93lzdN3ewnzzS_JMbinCIYxAIU/ [5] https://mailarchive.ietf.org/arch/msg/bpf/KBWXbMeDcSrq4vsKR_KkBbV6hI4/ Acked-by: David Vernet <void@manifault.com> Signed-off-by: Dave Thaler <dthaler1968@googlemail.com> Link: https://lore.kernel.org/r/20240623150453.10613-1-dthaler1968@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
783 lines
30 KiB
ReStructuredText
783 lines
30 KiB
ReStructuredText
.. contents::
|
|
.. sectnum::
|
|
|
|
======================================
|
|
BPF Instruction Set Architecture (ISA)
|
|
======================================
|
|
|
|
eBPF, also commonly
|
|
referred to as BPF, is a technology with origins in the Linux kernel
|
|
that can run untrusted programs in a privileged context such as an
|
|
operating system kernel. This document specifies the BPF instruction
|
|
set architecture (ISA).
|
|
|
|
As a historical note, BPF originally stood for Berkeley Packet Filter,
|
|
but now that it can do so much more than packet filtering, the acronym
|
|
no longer makes sense. BPF is now considered a standalone term that
|
|
does not stand for anything. The original BPF is sometimes referred to
|
|
as cBPF (classic BPF) to distinguish it from the now widely deployed
|
|
eBPF (extended BPF).
|
|
|
|
Documentation conventions
|
|
=========================
|
|
|
|
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
|
|
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
|
|
"OPTIONAL" in this document are to be interpreted as described in
|
|
BCP 14 `<https://www.rfc-editor.org/info/rfc2119>`_
|
|
`<https://www.rfc-editor.org/info/rfc8174>`_
|
|
when, and only when, they appear in all capitals, as shown here.
|
|
|
|
For brevity and consistency, this document refers to families
|
|
of types using a shorthand syntax and refers to several expository,
|
|
mnemonic functions when describing the semantics of instructions.
|
|
The range of valid values for those types and the semantics of those
|
|
functions are defined in the following subsections.
|
|
|
|
Types
|
|
-----
|
|
This document refers to integer types with the notation `SN` to specify
|
|
a type's signedness (`S`) and bit width (`N`), respectively.
|
|
|
|
.. table:: Meaning of signedness notation
|
|
|
|
==== =========
|
|
S Meaning
|
|
==== =========
|
|
u unsigned
|
|
s signed
|
|
==== =========
|
|
|
|
.. table:: Meaning of bit-width notation
|
|
|
|
===== =========
|
|
N Bit width
|
|
===== =========
|
|
8 8 bits
|
|
16 16 bits
|
|
32 32 bits
|
|
64 64 bits
|
|
128 128 bits
|
|
===== =========
|
|
|
|
For example, `u32` is a type whose valid values are all the 32-bit unsigned
|
|
numbers and `s16` is a type whose valid values are all the 16-bit signed
|
|
numbers.
|
|
|
|
Functions
|
|
---------
|
|
|
|
The following byteswap functions are direction-agnostic. That is,
|
|
the same function is used for conversion in either direction discussed
|
|
below.
|
|
|
|
* be16: Takes an unsigned 16-bit number and converts it between
|
|
host byte order and big-endian
|
|
(`IEN137 <https://www.rfc-editor.org/ien/ien137.txt>`_) byte order.
|
|
* be32: Takes an unsigned 32-bit number and converts it between
|
|
host byte order and big-endian byte order.
|
|
* be64: Takes an unsigned 64-bit number and converts it between
|
|
host byte order and big-endian byte order.
|
|
* bswap16: Takes an unsigned 16-bit number in either big- or little-endian
|
|
format and returns the equivalent number with the same bit width but
|
|
opposite endianness.
|
|
* bswap32: Takes an unsigned 32-bit number in either big- or little-endian
|
|
format and returns the equivalent number with the same bit width but
|
|
opposite endianness.
|
|
* bswap64: Takes an unsigned 64-bit number in either big- or little-endian
|
|
format and returns the equivalent number with the same bit width but
|
|
opposite endianness.
|
|
* le16: Takes an unsigned 16-bit number and converts it between
|
|
host byte order and little-endian byte order.
|
|
* le32: Takes an unsigned 32-bit number and converts it between
|
|
host byte order and little-endian byte order.
|
|
* le64: Takes an unsigned 64-bit number and converts it between
|
|
host byte order and little-endian byte order.
|
|
|
|
Definitions
|
|
-----------
|
|
|
|
.. glossary::
|
|
|
|
Sign Extend
|
|
To `sign extend an` ``X`` `-bit number, A, to a` ``Y`` `-bit number, B ,` means to
|
|
|
|
#. Copy all ``X`` bits from `A` to the lower ``X`` bits of `B`.
|
|
#. Set the value of the remaining ``Y`` - ``X`` bits of `B` to the value of
|
|
the most-significant bit of `A`.
|
|
|
|
.. admonition:: Example
|
|
|
|
Sign extend an 8-bit number ``A`` to a 16-bit number ``B`` on a big-endian platform:
|
|
::
|
|
|
|
A: 10000110
|
|
B: 11111111 10000110
|
|
|
|
Conformance groups
|
|
------------------
|
|
|
|
An implementation does not need to support all instructions specified in this
|
|
document (e.g., deprecated instructions). Instead, a number of conformance
|
|
groups are specified. An implementation MUST support the base32 conformance
|
|
group and MAY support additional conformance groups, where supporting a
|
|
conformance group means it MUST support all instructions in that conformance
|
|
group.
|
|
|
|
The use of named conformance groups enables interoperability between a runtime
|
|
that executes instructions, and tools such as compilers that generate
|
|
instructions for the runtime. Thus, capability discovery in terms of
|
|
conformance groups might be done manually by users or automatically by tools.
|
|
|
|
Each conformance group has a short ASCII label (e.g., "base32") that
|
|
corresponds to a set of instructions that are mandatory. That is, each
|
|
instruction has one or more conformance groups of which it is a member.
|
|
|
|
This document defines the following conformance groups:
|
|
|
|
* base32: includes all instructions defined in this
|
|
specification unless otherwise noted.
|
|
* base64: includes base32, plus instructions explicitly noted
|
|
as being in the base64 conformance group.
|
|
* atomic32: includes 32-bit atomic operation instructions (see `Atomic operations`_).
|
|
* atomic64: includes atomic32, plus 64-bit atomic operation instructions.
|
|
* divmul32: includes 32-bit division, multiplication, and modulo instructions.
|
|
* divmul64: includes divmul32, plus 64-bit division, multiplication,
|
|
and modulo instructions.
|
|
* packet: deprecated packet access instructions.
|
|
|
|
Instruction encoding
|
|
====================
|
|
|
|
BPF has two instruction encodings:
|
|
|
|
* the basic instruction encoding, which uses 64 bits to encode an instruction
|
|
* the wide instruction encoding, which appends a second 64 bits
|
|
after the basic instruction for a total of 128 bits.
|
|
|
|
Basic instruction encoding
|
|
--------------------------
|
|
|
|
A basic instruction is encoded as follows::
|
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| opcode | regs | offset |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| imm |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
**opcode**
|
|
operation to perform, encoded as follows::
|
|
|
|
+-+-+-+-+-+-+-+-+
|
|
|specific |class|
|
|
+-+-+-+-+-+-+-+-+
|
|
|
|
**specific**
|
|
The format of these bits varies by instruction class
|
|
|
|
**class**
|
|
The instruction class (see `Instruction classes`_)
|
|
|
|
**regs**
|
|
The source and destination register numbers, encoded as follows
|
|
on a little-endian host::
|
|
|
|
+-+-+-+-+-+-+-+-+
|
|
|src_reg|dst_reg|
|
|
+-+-+-+-+-+-+-+-+
|
|
|
|
and as follows on a big-endian host::
|
|
|
|
+-+-+-+-+-+-+-+-+
|
|
|dst_reg|src_reg|
|
|
+-+-+-+-+-+-+-+-+
|
|
|
|
**src_reg**
|
|
the source register number (0-10), except where otherwise specified
|
|
(`64-bit immediate instructions`_ reuse this field for other purposes)
|
|
|
|
**dst_reg**
|
|
destination register number (0-10), unless otherwise specified
|
|
(future instructions might reuse this field for other purposes)
|
|
|
|
**offset**
|
|
signed integer offset used with pointer arithmetic, except where
|
|
otherwise specified (some arithmetic instructions reuse this field
|
|
for other purposes)
|
|
|
|
**imm**
|
|
signed integer immediate value
|
|
|
|
Note that the contents of multi-byte fields ('offset' and 'imm') are
|
|
stored using big-endian byte ordering on big-endian hosts and
|
|
little-endian byte ordering on little-endian hosts.
|
|
|
|
For example::
|
|
|
|
opcode offset imm assembly
|
|
src_reg dst_reg
|
|
07 0 1 00 00 44 33 22 11 r1 += 0x11223344 // little
|
|
dst_reg src_reg
|
|
07 1 0 00 00 11 22 33 44 r1 += 0x11223344 // big
|
|
|
|
Note that most instructions do not use all of the fields.
|
|
Unused fields SHALL be cleared to zero.
|
|
|
|
Wide instruction encoding
|
|
--------------------------
|
|
|
|
Some instructions are defined to use the wide instruction encoding,
|
|
which uses two 32-bit immediate values. The 64 bits following
|
|
the basic instruction format contain a pseudo instruction
|
|
with 'opcode', 'dst_reg', 'src_reg', and 'offset' all set to zero.
|
|
|
|
This is depicted in the following figure::
|
|
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| opcode | regs | offset |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| imm |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| reserved |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| next_imm |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
**opcode**
|
|
operation to perform, encoded as explained above
|
|
|
|
**regs**
|
|
The source and destination register numbers (unless otherwise
|
|
specified), encoded as explained above
|
|
|
|
**offset**
|
|
signed integer offset used with pointer arithmetic, unless
|
|
otherwise specified
|
|
|
|
**imm**
|
|
signed integer immediate value
|
|
|
|
**reserved**
|
|
unused, set to zero
|
|
|
|
**next_imm**
|
|
second signed integer immediate value
|
|
|
|
Instruction classes
|
|
-------------------
|
|
|
|
The three least significant bits of the 'opcode' field store the instruction class:
|
|
|
|
.. table:: Instruction class
|
|
|
|
===== ===== =============================== ===================================
|
|
class value description reference
|
|
===== ===== =============================== ===================================
|
|
LD 0x0 non-standard load operations `Load and store instructions`_
|
|
LDX 0x1 load into register operations `Load and store instructions`_
|
|
ST 0x2 store from immediate operations `Load and store instructions`_
|
|
STX 0x3 store from register operations `Load and store instructions`_
|
|
ALU 0x4 32-bit arithmetic operations `Arithmetic and jump instructions`_
|
|
JMP 0x5 64-bit jump operations `Arithmetic and jump instructions`_
|
|
JMP32 0x6 32-bit jump operations `Arithmetic and jump instructions`_
|
|
ALU64 0x7 64-bit arithmetic operations `Arithmetic and jump instructions`_
|
|
===== ===== =============================== ===================================
|
|
|
|
Arithmetic and jump instructions
|
|
================================
|
|
|
|
For arithmetic and jump instructions (``ALU``, ``ALU64``, ``JMP`` and
|
|
``JMP32``), the 8-bit 'opcode' field is divided into three parts::
|
|
|
|
+-+-+-+-+-+-+-+-+
|
|
| code |s|class|
|
|
+-+-+-+-+-+-+-+-+
|
|
|
|
**code**
|
|
the operation code, whose meaning varies by instruction class
|
|
|
|
**s (source)**
|
|
the source operand location, which unless otherwise specified is one of:
|
|
|
|
.. table:: Source operand location
|
|
|
|
====== ===== ==============================================
|
|
source value description
|
|
====== ===== ==============================================
|
|
K 0 use 32-bit 'imm' value as source operand
|
|
X 1 use 'src_reg' register value as source operand
|
|
====== ===== ==============================================
|
|
|
|
**instruction class**
|
|
the instruction class (see `Instruction classes`_)
|
|
|
|
Arithmetic instructions
|
|
-----------------------
|
|
|
|
``ALU`` uses 32-bit wide operands while ``ALU64`` uses 64-bit wide operands for
|
|
otherwise identical operations. ``ALU64`` instructions belong to the
|
|
base64 conformance group unless noted otherwise.
|
|
The 'code' field encodes the operation as below, where 'src' refers to the
|
|
the source operand and 'dst' refers to the value of the destination
|
|
register.
|
|
|
|
.. table:: Arithmetic instructions
|
|
|
|
===== ===== ======= ==========================================================
|
|
name code offset description
|
|
===== ===== ======= ==========================================================
|
|
ADD 0x0 0 dst += src
|
|
SUB 0x1 0 dst -= src
|
|
MUL 0x2 0 dst \*= src
|
|
DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0
|
|
SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0
|
|
OR 0x4 0 dst \|= src
|
|
AND 0x5 0 dst &= src
|
|
LSH 0x6 0 dst <<= (src & mask)
|
|
RSH 0x7 0 dst >>= (src & mask)
|
|
NEG 0x8 0 dst = -dst
|
|
MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst
|
|
SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst
|
|
XOR 0xa 0 dst ^= src
|
|
MOV 0xb 0 dst = src
|
|
MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src
|
|
ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask)
|
|
END 0xd 0 byte swap operations (see `Byte swap instructions`_ below)
|
|
===== ===== ======= ==========================================================
|
|
|
|
Underflow and overflow are allowed during arithmetic operations, meaning
|
|
the 64-bit or 32-bit value will wrap. If BPF program execution would
|
|
result in division by zero, the destination register is instead set to zero.
|
|
If execution would result in modulo by zero, for ``ALU64`` the value of
|
|
the destination register is unchanged whereas for ``ALU`` the upper
|
|
32 bits of the destination register are zeroed.
|
|
|
|
``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means::
|
|
|
|
dst = (u32) ((u32) dst + (u32) src)
|
|
|
|
where '(u32)' indicates that the upper 32 bits are zeroed.
|
|
|
|
``{ADD, X, ALU64}`` means::
|
|
|
|
dst = dst + src
|
|
|
|
``{XOR, K, ALU}`` means::
|
|
|
|
dst = (u32) dst ^ (u32) imm
|
|
|
|
``{XOR, K, ALU64}`` means::
|
|
|
|
dst = dst ^ imm
|
|
|
|
Note that most arithmetic instructions have 'offset' set to 0. Only three instructions
|
|
(``SDIV``, ``SMOD``, ``MOVSX``) have a non-zero 'offset'.
|
|
|
|
Division, multiplication, and modulo operations for ``ALU`` are part
|
|
of the "divmul32" conformance group, and division, multiplication, and
|
|
modulo operations for ``ALU64`` are part of the "divmul64" conformance
|
|
group.
|
|
The division and modulo operations support both unsigned and signed flavors.
|
|
|
|
For unsigned operations (``DIV`` and ``MOD``), for ``ALU``,
|
|
'imm' is interpreted as a 32-bit unsigned value. For ``ALU64``,
|
|
'imm' is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
|
|
interpreted as a 64-bit unsigned value.
|
|
|
|
For signed operations (``SDIV`` and ``SMOD``), for ``ALU``,
|
|
'imm' is interpreted as a 32-bit signed value. For ``ALU64``, 'imm'
|
|
is first :term:`sign extended<Sign Extend>` from 32 to 64 bits, and then
|
|
interpreted as a 64-bit signed value.
|
|
|
|
Note that there are varying definitions of the signed modulo operation
|
|
when the dividend or divisor are negative, where implementations often
|
|
vary by language such that Python, Ruby, etc. differ from C, Go, Java,
|
|
etc. This specification requires that signed modulo MUST use truncated division
|
|
(where -13 % 3 == -1) as implemented in C, Go, etc.::
|
|
|
|
a % n = a - n * trunc(a / n)
|
|
|
|
The ``MOVSX`` instruction does a move operation with sign extension.
|
|
``{MOVSX, X, ALU}`` :term:`sign extends<Sign Extend>` 8-bit and 16-bit operands into
|
|
32-bit operands, and zeroes the remaining upper 32 bits.
|
|
``{MOVSX, X, ALU64}`` :term:`sign extends<Sign Extend>` 8-bit, 16-bit, and 32-bit
|
|
operands into 64-bit operands. Unlike other arithmetic instructions,
|
|
``MOVSX`` is only defined for register source operands (``X``).
|
|
|
|
``{MOV, K, ALU64}`` means::
|
|
|
|
dst = (s64)imm
|
|
|
|
``{MOV, X, ALU}`` means::
|
|
|
|
dst = (u32)src
|
|
|
|
``{MOVSX, X, ALU}`` with 'offset' 8 means::
|
|
|
|
dst = (u32)(s32)(s8)src
|
|
|
|
|
|
The ``NEG`` instruction is only defined when the source bit is clear
|
|
(``K``).
|
|
|
|
Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31)
|
|
for 32-bit operations.
|
|
|
|
Byte swap instructions
|
|
----------------------
|
|
|
|
The byte swap instructions use instruction classes of ``ALU`` and ``ALU64``
|
|
and a 4-bit 'code' field of ``END``.
|
|
|
|
The byte swap instructions operate on the destination register
|
|
only and do not use a separate source register or immediate value.
|
|
|
|
For ``ALU``, the 1-bit source operand field in the opcode is used to
|
|
select what byte order the operation converts from or to. For
|
|
``ALU64``, the 1-bit source operand field in the opcode is reserved
|
|
and MUST be set to 0.
|
|
|
|
.. table:: Byte swap instructions
|
|
|
|
===== ======== ===== =================================================
|
|
class source value description
|
|
===== ======== ===== =================================================
|
|
ALU LE 0 convert between host byte order and little endian
|
|
ALU BE 1 convert between host byte order and big endian
|
|
ALU64 Reserved 0 do byte swap unconditionally
|
|
===== ======== ===== =================================================
|
|
|
|
The 'imm' field encodes the width of the swap operations. The following widths
|
|
are supported: 16, 32 and 64. Width 64 operations belong to the base64
|
|
conformance group and other swap operations belong to the base32
|
|
conformance group.
|
|
|
|
Examples:
|
|
|
|
``{END, LE, ALU}`` with 'imm' = 16/32/64 means::
|
|
|
|
dst = le16(dst)
|
|
dst = le32(dst)
|
|
dst = le64(dst)
|
|
|
|
``{END, BE, ALU}`` with 'imm' = 16/32/64 means::
|
|
|
|
dst = be16(dst)
|
|
dst = be32(dst)
|
|
dst = be64(dst)
|
|
|
|
``{END, TO, ALU64}`` with 'imm' = 16/32/64 means::
|
|
|
|
dst = bswap16(dst)
|
|
dst = bswap32(dst)
|
|
dst = bswap64(dst)
|
|
|
|
Jump instructions
|
|
-----------------
|
|
|
|
``JMP32`` uses 32-bit wide operands and indicates the base32
|
|
conformance group, while ``JMP`` uses 64-bit wide operands for
|
|
otherwise identical operations, and indicates the base64 conformance
|
|
group unless otherwise specified.
|
|
The 'code' field encodes the operation as below:
|
|
|
|
.. table:: Jump instructions
|
|
|
|
======== ===== ======= ================================= ===================================================
|
|
code value src_reg description notes
|
|
======== ===== ======= ================================= ===================================================
|
|
JA 0x0 0x0 PC += offset {JA, K, JMP} only
|
|
JA 0x0 0x0 PC += imm {JA, K, JMP32} only
|
|
JEQ 0x1 any PC += offset if dst == src
|
|
JGT 0x2 any PC += offset if dst > src unsigned
|
|
JGE 0x3 any PC += offset if dst >= src unsigned
|
|
JSET 0x4 any PC += offset if dst & src
|
|
JNE 0x5 any PC += offset if dst != src
|
|
JSGT 0x6 any PC += offset if dst > src signed
|
|
JSGE 0x7 any PC += offset if dst >= src signed
|
|
CALL 0x8 0x0 call helper function by static ID {CALL, K, JMP} only, see `Helper functions`_
|
|
CALL 0x8 0x1 call PC += imm {CALL, K, JMP} only, see `Program-local functions`_
|
|
CALL 0x8 0x2 call helper function by BTF ID {CALL, K, JMP} only, see `Helper functions`_
|
|
EXIT 0x9 0x0 return {CALL, K, JMP} only
|
|
JLT 0xa any PC += offset if dst < src unsigned
|
|
JLE 0xb any PC += offset if dst <= src unsigned
|
|
JSLT 0xc any PC += offset if dst < src signed
|
|
JSLE 0xd any PC += offset if dst <= src signed
|
|
======== ===== ======= ================================= ===================================================
|
|
|
|
where 'PC' denotes the program counter, and the offset to increment by
|
|
is in units of 64-bit instructions relative to the instruction following
|
|
the jump instruction. Thus 'PC += 1' skips execution of the next
|
|
instruction if it's a basic instruction or results in undefined behavior
|
|
if the next instruction is a 128-bit wide instruction.
|
|
|
|
Example:
|
|
|
|
``{JSGE, X, JMP32}`` means::
|
|
|
|
if (s32)dst s>= (s32)src goto +offset
|
|
|
|
where 's>=' indicates a signed '>=' comparison.
|
|
|
|
``{JLE, K, JMP}`` means::
|
|
|
|
if dst <= (u64)(s64)imm goto +offset
|
|
|
|
``{JA, K, JMP32}`` means::
|
|
|
|
gotol +imm
|
|
|
|
where 'imm' means the branch offset comes from the 'imm' field.
|
|
|
|
Note that there are two flavors of ``JA`` instructions. The
|
|
``JMP`` class permits a 16-bit jump offset specified by the 'offset'
|
|
field, whereas the ``JMP32`` class permits a 32-bit jump offset
|
|
specified by the 'imm' field. A > 16-bit conditional jump may be
|
|
converted to a < 16-bit conditional jump plus a 32-bit unconditional
|
|
jump.
|
|
|
|
All ``CALL`` and ``JA`` instructions belong to the
|
|
base32 conformance group.
|
|
|
|
Helper functions
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Helper functions are a concept whereby BPF programs can call into a
|
|
set of function calls exposed by the underlying platform.
|
|
|
|
Historically, each helper function was identified by a static ID
|
|
encoded in the 'imm' field. Further documentation of helper functions
|
|
is outside the scope of this document and standardization is left for
|
|
future work, but use is widely deployed and more information can be
|
|
found in platform-specific documentation (e.g., Linux kernel documentation).
|
|
|
|
Platforms that support the BPF Type Format (BTF) support identifying
|
|
a helper function by a BTF ID encoded in the 'imm' field, where the BTF ID
|
|
identifies the helper name and type. Further documentation of BTF
|
|
is outside the scope of this document and standardization is left for
|
|
future work, but use is widely deployed and more information can be
|
|
found in platform-specific documentation (e.g., Linux kernel documentation).
|
|
|
|
Program-local functions
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
Program-local functions are functions exposed by the same BPF program as the
|
|
caller, and are referenced by offset from the instruction following the call
|
|
instruction, similar to ``JA``. The offset is encoded in the 'imm' field of
|
|
the call instruction. An ``EXIT`` within the program-local function will
|
|
return to the caller.
|
|
|
|
Load and store instructions
|
|
===========================
|
|
|
|
For load and store instructions (``LD``, ``LDX``, ``ST``, and ``STX``), the
|
|
8-bit 'opcode' field is divided as follows::
|
|
|
|
+-+-+-+-+-+-+-+-+
|
|
|mode |sz |class|
|
|
+-+-+-+-+-+-+-+-+
|
|
|
|
**mode**
|
|
The mode modifier is one of:
|
|
|
|
.. table:: Mode modifier
|
|
|
|
============= ===== ==================================== =============
|
|
mode modifier value description reference
|
|
============= ===== ==================================== =============
|
|
IMM 0 64-bit immediate instructions `64-bit immediate instructions`_
|
|
ABS 1 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
|
|
IND 2 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
|
|
MEM 3 regular load and store operations `Regular load and store operations`_
|
|
MEMSX 4 sign-extension load operations `Sign-extension load operations`_
|
|
ATOMIC 6 atomic operations `Atomic operations`_
|
|
============= ===== ==================================== =============
|
|
|
|
**sz (size)**
|
|
The size modifier is one of:
|
|
|
|
.. table:: Size modifier
|
|
|
|
==== ===== =====================
|
|
size value description
|
|
==== ===== =====================
|
|
W 0 word (4 bytes)
|
|
H 1 half word (2 bytes)
|
|
B 2 byte
|
|
DW 3 double word (8 bytes)
|
|
==== ===== =====================
|
|
|
|
Instructions using ``DW`` belong to the base64 conformance group.
|
|
|
|
**class**
|
|
The instruction class (see `Instruction classes`_)
|
|
|
|
Regular load and store operations
|
|
---------------------------------
|
|
|
|
The ``MEM`` mode modifier is used to encode regular load and store
|
|
instructions that transfer data between a register and memory.
|
|
|
|
``{MEM, <size>, STX}`` means::
|
|
|
|
*(size *) (dst + offset) = src
|
|
|
|
``{MEM, <size>, ST}`` means::
|
|
|
|
*(size *) (dst + offset) = imm
|
|
|
|
``{MEM, <size>, LDX}`` means::
|
|
|
|
dst = *(unsigned size *) (src + offset)
|
|
|
|
Where '<size>' is one of: ``B``, ``H``, ``W``, or ``DW``, and
|
|
'unsigned size' is one of: u8, u16, u32, or u64.
|
|
|
|
Sign-extension load operations
|
|
------------------------------
|
|
|
|
The ``MEMSX`` mode modifier is used to encode :term:`sign-extension<Sign Extend>` load
|
|
instructions that transfer data between a register and memory.
|
|
|
|
``{MEMSX, <size>, LDX}`` means::
|
|
|
|
dst = *(signed size *) (src + offset)
|
|
|
|
Where '<size>' is one of: ``B``, ``H``, or ``W``, and
|
|
'signed size' is one of: s8, s16, or s32.
|
|
|
|
Atomic operations
|
|
-----------------
|
|
|
|
Atomic operations are operations that operate on memory and can not be
|
|
interrupted or corrupted by other access to the same memory region
|
|
by other BPF programs or means outside of this specification.
|
|
|
|
All atomic operations supported by BPF are encoded as store operations
|
|
that use the ``ATOMIC`` mode modifier as follows:
|
|
|
|
* ``{ATOMIC, W, STX}`` for 32-bit operations, which are
|
|
part of the "atomic32" conformance group.
|
|
* ``{ATOMIC, DW, STX}`` for 64-bit operations, which are
|
|
part of the "atomic64" conformance group.
|
|
* 8-bit and 16-bit wide atomic operations are not supported.
|
|
|
|
The 'imm' field is used to encode the actual atomic operation.
|
|
Simple atomic operation use a subset of the values defined to encode
|
|
arithmetic operations in the 'imm' field to encode the atomic operation:
|
|
|
|
.. table:: Simple atomic operations
|
|
|
|
======== ===== ===========
|
|
imm value description
|
|
======== ===== ===========
|
|
ADD 0x00 atomic add
|
|
OR 0x40 atomic or
|
|
AND 0x50 atomic and
|
|
XOR 0xa0 atomic xor
|
|
======== ===== ===========
|
|
|
|
|
|
``{ATOMIC, W, STX}`` with 'imm' = ADD means::
|
|
|
|
*(u32 *)(dst + offset) += src
|
|
|
|
``{ATOMIC, DW, STX}`` with 'imm' = ADD means::
|
|
|
|
*(u64 *)(dst + offset) += src
|
|
|
|
In addition to the simple atomic operations, there also is a modifier and
|
|
two complex atomic operations:
|
|
|
|
.. table:: Complex atomic operations
|
|
|
|
=========== ================ ===========================
|
|
imm value description
|
|
=========== ================ ===========================
|
|
FETCH 0x01 modifier: return old value
|
|
XCHG 0xe0 | FETCH atomic exchange
|
|
CMPXCHG 0xf0 | FETCH atomic compare and exchange
|
|
=========== ================ ===========================
|
|
|
|
The ``FETCH`` modifier is optional for simple atomic operations, and
|
|
always set for the complex atomic operations. If the ``FETCH`` flag
|
|
is set, then the operation also overwrites ``src`` with the value that
|
|
was in memory before it was modified.
|
|
|
|
The ``XCHG`` operation atomically exchanges ``src`` with the value
|
|
addressed by ``dst + offset``.
|
|
|
|
The ``CMPXCHG`` operation atomically compares the value addressed by
|
|
``dst + offset`` with ``R0``. If they match, the value addressed by
|
|
``dst + offset`` is replaced with ``src``. In either case, the
|
|
value that was at ``dst + offset`` before the operation is zero-extended
|
|
and loaded back to ``R0``.
|
|
|
|
64-bit immediate instructions
|
|
-----------------------------
|
|
|
|
Instructions with the ``IMM`` 'mode' modifier use the wide instruction
|
|
encoding defined in `Instruction encoding`_, and use the 'src_reg' field of the
|
|
basic instruction to hold an opcode subtype.
|
|
|
|
The following table defines a set of ``{IMM, DW, LD}`` instructions
|
|
with opcode subtypes in the 'src_reg' field, using new terms such as "map"
|
|
defined further below:
|
|
|
|
.. table:: 64-bit immediate instructions
|
|
|
|
======= ========================================= =========== ==============
|
|
src_reg pseudocode imm type dst type
|
|
======= ========================================= =========== ==============
|
|
0x0 dst = (next_imm << 32) | imm integer integer
|
|
0x1 dst = map_by_fd(imm) map fd map
|
|
0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data address
|
|
0x3 dst = var_addr(imm) variable id data address
|
|
0x4 dst = code_addr(imm) integer code address
|
|
0x5 dst = map_by_idx(imm) map index map
|
|
0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data address
|
|
======= ========================================= =========== ==============
|
|
|
|
where
|
|
|
|
* map_by_fd(imm) means to convert a 32-bit file descriptor into an address of a map (see `Maps`_)
|
|
* map_by_idx(imm) means to convert a 32-bit index into an address of a map
|
|
* map_val(map) gets the address of the first value in a given map
|
|
* var_addr(imm) gets the address of a platform variable (see `Platform Variables`_) with a given id
|
|
* code_addr(imm) gets the address of the instruction at a specified relative offset in number of (64-bit) instructions
|
|
* the 'imm type' can be used by disassemblers for display
|
|
* the 'dst type' can be used for verification and JIT compilation purposes
|
|
|
|
Maps
|
|
~~~~
|
|
|
|
Maps are shared memory regions accessible by BPF programs on some platforms.
|
|
A map can have various semantics as defined in a separate document, and may or
|
|
may not have a single contiguous memory region, but the 'map_val(map)' is
|
|
currently only defined for maps that do have a single contiguous memory region.
|
|
|
|
Each map can have a file descriptor (fd) if supported by the platform, where
|
|
'map_by_fd(imm)' means to get the map with the specified file descriptor. Each
|
|
BPF program can also be defined to use a set of maps associated with the
|
|
program at load time, and 'map_by_idx(imm)' means to get the map with the given
|
|
index in the set associated with the BPF program containing the instruction.
|
|
|
|
Platform Variables
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
Platform variables are memory regions, identified by integer ids, exposed by
|
|
the runtime and accessible by BPF programs on some platforms. The
|
|
'var_addr(imm)' operation means to get the address of the memory region
|
|
identified by the given id.
|
|
|
|
Legacy BPF Packet access instructions
|
|
-------------------------------------
|
|
|
|
BPF previously introduced special instructions for access to packet data that were
|
|
carried over from classic BPF. These instructions used an instruction
|
|
class of ``LD``, a size modifier of ``W``, ``H``, or ``B``, and a
|
|
mode modifier of ``ABS`` or ``IND``. The 'dst_reg' and 'offset' fields were
|
|
set to zero, and 'src_reg' was set to zero for ``ABS``. However, these
|
|
instructions are deprecated and SHOULD no longer be used. All legacy packet
|
|
access instructions belong to the "packet" conformance group.
|