diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index 7cc9e368c1e9..10453c627135 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -36,27 +36,27 @@ consideration important quirks of other architectures) and defines calling convention that is compatible with C calling convention of the linux kernel on those architectures. -Q: can multiple return values be supported in the future? +Q: Can multiple return values be supported in the future? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: NO. BPF allows only register R0 to be used as return value. -Q: can more than 5 function arguments be supported in the future? +Q: Can more than 5 function arguments be supported in the future? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: NO. BPF calling convention only allows registers R1-R5 to be used as arguments. BPF is not a standalone instruction set. (unlike x64 ISA that allows msft, cdecl and other conventions) -Q: can BPF programs access instruction pointer or return address? +Q: Can BPF programs access instruction pointer or return address? ----------------------------------------------------------------- A: NO. -Q: can BPF programs access stack pointer ? +Q: Can BPF programs access stack pointer ? ------------------------------------------ A: NO. Only frame pointer (register R10) is accessible. From compiler point of view it's necessary to have stack pointer. -For example LLVM defines register R11 as stack pointer in its +For example, LLVM defines register R11 as stack pointer in its BPF backend, but it makes sure that generated code never uses it. Q: Does C-calling convention diminishes possible use cases? @@ -66,8 +66,8 @@ A: YES. BPF design forces addition of major functionality in the form of kernel helper functions and kernel objects like BPF maps with seamless interoperability between them. It lets kernel call into -BPF programs and programs call kernel helpers with zero overhead. -As all of them were native C code. That is particularly the case +BPF programs and programs call kernel helpers with zero overhead, +as all of them were native C code. That is particularly the case for JITed BPF programs that are indistinguishable from native kernel C code. @@ -75,9 +75,9 @@ Q: Does it mean that 'innovative' extensions to BPF code are disallowed? ------------------------------------------------------------------------ A: Soft yes. -At least for now until BPF core has support for +At least for now, until BPF core has support for bpf-to-bpf calls, indirect calls, loops, global variables, -jump tables, read only sections and all other normal constructs +jump tables, read-only sections, and all other normal constructs that C code can produce. Q: Can loops be supported in a safe way? @@ -109,16 +109,16 @@ For example why BPF_JNE and other compare and jumps are not cpu-like? A: This was necessary to avoid introducing flags into ISA which are impossible to make generic and efficient across CPU architectures. -Q: why BPF_DIV instruction doesn't map to x64 div? +Q: Why BPF_DIV instruction doesn't map to x64 div? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because if we picked one-to-one relationship to x64 it would have made it more complicated to support on arm64 and other archs. Also it needs div-by-zero runtime check. -Q: why there is no BPF_SDIV for signed divide operation? +Q: Why there is no BPF_SDIV for signed divide operation? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A: Because it would be rarely used. llvm errors in such case and -prints a suggestion to use unsigned divide instead +prints a suggestion to use unsigned divide instead. Q: Why BPF has implicit prologue and epilogue? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index 1d434c3a268d..9a60a5d60e38 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -5,43 +5,35 @@ BPF Type Format (BTF) 1. Introduction *************** -BTF (BPF Type Format) is the meta data format which -encodes the debug info related to BPF program/map. -The name BTF was used initially to describe -data types. The BTF was later extended to include -function info for defined subroutines, and line info -for source/line information. +BTF (BPF Type Format) is the metadata format which encodes the debug info +related to BPF program/map. The name BTF was used initially to describe data +types. The BTF was later extended to include function info for defined +subroutines, and line info for source/line information. -The debug info is used for map pretty print, function -signature, etc. The function signature enables better -bpf program/function kernel symbol. -The line info helps generate -source annotated translated byte code, jited code -and verifier log. +The debug info is used for map pretty print, function signature, etc. The +function signature enables better bpf program/function kernel symbol. The line +info helps generate source annotated translated byte code, jited code and +verifier log. The BTF specification contains two parts, * BTF kernel API * BTF ELF file format -The kernel API is the contract between -user space and kernel. The kernel verifies -the BTF info before using it. -The ELF file format is a user space contract -between ELF file and libbpf loader. +The kernel API is the contract between user space and kernel. The kernel +verifies the BTF info before using it. The ELF file format is a user space +contract between ELF file and libbpf loader. -The type and string sections are part of the -BTF kernel API, describing the debug info -(mostly types related) referenced by the bpf program. -These two sections are discussed in -details in :ref:`BTF_Type_String`. +The type and string sections are part of the BTF kernel API, describing the +debug info (mostly types related) referenced by the bpf program. These two +sections are discussed in details in :ref:`BTF_Type_String`. .. _BTF_Type_String: 2. BTF Type and String Encoding ******************************* -The file ``include/uapi/linux/btf.h`` provides high -level definition on how types/strings are encoded. +The file ``include/uapi/linux/btf.h`` provides high-level definition of how +types/strings are encoded. The beginning of data blob must be:: @@ -59,25 +51,23 @@ The beginning of data blob must be:: }; The magic is ``0xeB9F``, which has different encoding for big and little -endian system, and can be used to test whether BTF is generated for -big or little endian target. -The btf_header is designed to be extensible with hdr_len equal to -``sizeof(struct btf_header)`` when the data blob is generated. +endian systems, and can be used to test whether BTF is generated for big- or +little-endian target. The ``btf_header`` is designed to be extensible with +``hdr_len`` equal to ``sizeof(struct btf_header)`` when a data blob is +generated. 2.1 String Encoding =================== -The first string in the string section must be a null string. -The rest of string table is a concatenation of other null-treminated -strings. +The first string in the string section must be a null string. The rest of +string table is a concatenation of other null-terminated strings. 2.2 Type Encoding ================= -The type id ``0`` is reserved for ``void`` type. -The type section is parsed sequentially and the type id is assigned to -each recognized type starting from id ``1``. -Currently, the following types are supported:: +The type id ``0`` is reserved for ``void`` type. The type section is parsed +sequentially and type id is assigned to each recognized type starting from id +``1``. Currently, the following types are supported:: #define BTF_KIND_INT 1 /* Integer */ #define BTF_KIND_PTR 2 /* Pointer */ @@ -122,9 +112,9 @@ Each type contains the following common data:: }; }; -For certain kinds, the common data are followed by kind specific data. -The ``name_off`` in ``struct btf_type`` specifies the offset in the string table. -The following details encoding of each kind. +For certain kinds, the common data are followed by kind-specific data. The +``name_off`` in ``struct btf_type`` specifies the offset in the string table. +The following sections detail encoding of each kind. 2.2.1 BTF_KIND_INT ~~~~~~~~~~~~~~~~~~ @@ -136,7 +126,7 @@ The following details encoding of each kind. * ``info.vlen``: 0 * ``size``: the size of the int type in bytes. -``btf_type`` is followed by a ``u32`` with following bits arrangement:: +``btf_type`` is followed by a ``u32`` with the following bits arrangement:: #define BTF_INT_ENCODING(VAL) (((VAL) & 0x0f000000) >> 24) #define BTF_INT_OFFSET(VAL) (((VAL & 0x00ff0000)) >> 16) @@ -148,39 +138,33 @@ The ``BTF_INT_ENCODING`` has the following attributes:: #define BTF_INT_CHAR (1 << 1) #define BTF_INT_BOOL (1 << 2) -The ``BTF_INT_ENCODING()`` provides extra information, signness, -char, or bool, for the int type. The char and bool encoding -are mostly useful for pretty print. At most one encoding can -be specified for the int type. +The ``BTF_INT_ENCODING()`` provides extra information: signedness, char, or +bool, for the int type. The char and bool encoding are mostly useful for +pretty print. At most one encoding can be specified for the int type. -The ``BTF_INT_BITS()`` specifies the number of actual bits held by -this int type. For example, a 4-bit bitfield encodes -``BTF_INT_BITS()`` equals to 4. The ``btf_type.size * 8`` -must be equal to or greater than ``BTF_INT_BITS()`` for the type. -The maximum value of ``BTF_INT_BITS()`` is 128. +The ``BTF_INT_BITS()`` specifies the number of actual bits held by this int +type. For example, a 4-bit bitfield encodes ``BTF_INT_BITS()`` equals to 4. +The ``btf_type.size * 8`` must be equal to or greater than ``BTF_INT_BITS()`` +for the type. The maximum value of ``BTF_INT_BITS()`` is 128. -The ``BTF_INT_OFFSET()`` specifies the starting bit offset to -calculate values for this int. For example, a bitfield struct -member has +The ``BTF_INT_OFFSET()`` specifies the starting bit offset to calculate values +for this int. For example, a bitfield struct member has: * btf member bit +offset 100 from the start of the structure, * btf member pointing to an int +type, * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4`` - * btf member bit offset 100 from the start of the structure, - * btf member pointing to an int type, - * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4`` +Then in the struct memory layout, this member will occupy ``4`` bits starting +from bits ``100 + 2 = 102``. -Then in the struct memory layout, this member will occupy -``4`` bits starting from bits ``100 + 2 = 102``. - -Alternatively, the bitfield struct member can be the following to -access the same bits as the above: +Alternatively, the bitfield struct member can be the following to access the +same bits as the above: * btf member bit offset 102, * btf member pointing to an int type, * the int type has ``BTF_INT_OFFSET() = 0`` and ``BTF_INT_BITS() = 4`` -The original intention of ``BTF_INT_OFFSET()`` is to provide -flexibility of bitfield encoding. -Currently, both llvm and pahole generates ``BTF_INT_OFFSET() = 0`` -for all int types. +The original intention of ``BTF_INT_OFFSET()`` is to provide flexibility of +bitfield encoding. Currently, both llvm and pahole generate +``BTF_INT_OFFSET() = 0`` for all int types. 2.2.2 BTF_KIND_PTR ~~~~~~~~~~~~~~~~~~ @@ -204,7 +188,7 @@ No additional type data follow ``btf_type``. * ``info.vlen``: 0 * ``size/type``: 0, not used -btf_type is followed by one "struct btf_array":: +``btf_type`` is followed by one ``struct btf_array``:: struct btf_array { __u32 type; @@ -217,27 +201,25 @@ The ``struct btf_array`` encoding: * ``index_type``: the index type * ``nelems``: the number of elements for this array (``0`` is also allowed). -The ``index_type`` can be any regular int types -(u8, u16, u32, u64, unsigned __int128). -The original design of including ``index_type`` follows dwarf -which has a ``index_type`` for its array type. +The ``index_type`` can be any regular int type (``u8``, ``u16``, ``u32``, +``u64``, ``unsigned __int128``). The original design of including +``index_type`` follows DWARF, which has an ``index_type`` for its array type. Currently in BTF, beyond type verification, the ``index_type`` is not used. The ``struct btf_array`` allows chaining through element type to represent -multiple dimensional arrays. For example, ``int a[5][6]``, the following -type system illustrates the chaining: +multidimensional arrays. For example, for ``int a[5][6]``, the following type +information illustrates the chaining: * [1]: int * [2]: array, ``btf_array.type = [1]``, ``btf_array.nelems = 6`` * [3]: array, ``btf_array.type = [2]``, ``btf_array.nelems = 5`` -Currently, both pahole and llvm collapse multiple dimensional array -into one dimensional array, e.g., ``a[5][6]``, the btf_array.nelems -equal to ``30``. This is because the original use case is map pretty -print where the whole array is dumped out so one dimensional array -is enough. As more BTF usage is explored, pahole and llvm can be -changed to generate proper chained representation for -multiple dimensional arrays. +Currently, both pahole and llvm collapse multidimensional array into +one-dimensional array, e.g., for ``a[5][6]``, the ``btf_array.nelems`` is +equal to ``30``. This is because the original use case is map pretty print +where the whole array is dumped out so one-dimensional array is enough. As +more BTF usage is explored, pahole and llvm can be changed to generate proper +chained representation for multidimensional arrays. 2.2.4 BTF_KIND_STRUCT ~~~~~~~~~~~~~~~~~~~~~ @@ -264,28 +246,26 @@ multiple dimensional arrays. * ``type``: the member type * ``offset``: -If the type info ``kind_flag`` is not set, the offset contains -only bit offset of the member. Note that the base type of the -bitfield can only be int or enum type. If the bitfield size -is 32, the base type can be either int or enum type. -If the bitfield size is not 32, the base type must be int, -and int type ``BTF_INT_BITS()`` encodes the bitfield size. +If the type info ``kind_flag`` is not set, the offset contains only bit offset +of the member. Note that the base type of the bitfield can only be int or enum +type. If the bitfield size is 32, the base type can be either int or enum +type. If the bitfield size is not 32, the base type must be int, and int type +``BTF_INT_BITS()`` encodes the bitfield size. -If the ``kind_flag`` is set, the ``btf_member.offset`` -contains both member bitfield size and bit offset. The -bitfield size and bit offset are calculated as below.:: +If the ``kind_flag`` is set, the ``btf_member.offset`` contains both member +bitfield size and bit offset. The bitfield size and bit offset are calculated +as below.:: #define BTF_MEMBER_BITFIELD_SIZE(val) ((val) >> 24) #define BTF_MEMBER_BIT_OFFSET(val) ((val) & 0xffffff) -In this case, if the base type is an int type, it must -be a regular int type: +In this case, if the base type is an int type, it must be a regular int type: * ``BTF_INT_OFFSET()`` must be 0. * ``BTF_INT_BITS()`` must be equal to ``{1,2,4,8,16} * 8``. -The following kernel patch introduced ``kind_flag`` and -explained why both modes exist: +The following kernel patch introduced ``kind_flag`` and explained why both +modes exist: https://github.com/torvalds/linux/commit/9d5f9f701b1891466fb3dbb1806ad97716f95cc3#diff-fa650a64fdd3968396883d2fe8215ff3 @@ -382,11 +362,11 @@ No additional type data follow ``btf_type``. No additional type data follow ``btf_type``. -A BTF_KIND_FUNC defines, not a type, but a subprogram (function) whose -signature is defined by ``type``. The subprogram is thus an instance of -that type. The BTF_KIND_FUNC may in turn be referenced by a func_info in -the :ref:`BTF_Ext_Section` (ELF) or in the arguments to -:ref:`BPF_Prog_Load` (ABI). +A BTF_KIND_FUNC defines not a type, but a subprogram (function) whose +signature is defined by ``type``. The subprogram is thus an instance of that +type. The BTF_KIND_FUNC may in turn be referenced by a func_info in the +:ref:`BTF_Ext_Section` (ELF) or in the arguments to :ref:`BPF_Prog_Load` +(ABI). 2.2.13 BTF_KIND_FUNC_PROTO ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -405,13 +385,13 @@ the :ref:`BTF_Ext_Section` (ELF) or in the arguments to __u32 type; }; -If a BTF_KIND_FUNC_PROTO type is referred by a BTF_KIND_FUNC type, -then ``btf_param.name_off`` must point to a valid C identifier -except for the possible last argument representing the variable -argument. The btf_param.type refers to parameter type. +If a BTF_KIND_FUNC_PROTO type is referred by a BTF_KIND_FUNC type, then +``btf_param.name_off`` must point to a valid C identifier except for the +possible last argument representing the variable argument. The btf_param.type +refers to parameter type. -If the function has variable arguments, the last parameter -is encoded with ``name_off = 0`` and ``type = 0``. +If the function has variable arguments, the last parameter is encoded with +``name_off = 0`` and ``type = 0``. 3. BTF Kernel API ***************** @@ -459,10 +439,9 @@ The workflow typically looks like: 3.1 BPF_BTF_LOAD ================ -Load a blob of BTF data into kernel. A blob of data -described in :ref:`BTF_Type_String` -can be directly loaded into the kernel. -A ``btf_fd`` returns to userspace. +Load a blob of BTF data into kernel. A blob of data, described in +:ref:`BTF_Type_String`, can be directly loaded into the kernel. A ``btf_fd`` +is returned to a userspace. 3.2 BPF_MAP_CREATE ================== @@ -484,18 +463,18 @@ In libbpf, the map can be defined with extra annotation like below: }; BPF_ANNOTATE_KV_PAIR(btf_map, int, struct ipv_counts); -Here, the parameters for macro BPF_ANNOTATE_KV_PAIR are map name, -key and value types for the map. -During ELF parsing, libbpf is able to extract key/value type_id's -and assigned them to BPF_MAP_CREATE attributes automatically. +Here, the parameters for macro BPF_ANNOTATE_KV_PAIR are map name, key and +value types for the map. During ELF parsing, libbpf is able to extract +key/value type_id's and assign them to BPF_MAP_CREATE attributes +automatically. .. _BPF_Prog_Load: 3.3 BPF_PROG_LOAD ================= -During prog_load, func_info and line_info can be passed to kernel with -proper values for the following attributes: +During prog_load, func_info and line_info can be passed to kernel with proper +values for the following attributes: :: __u32 insn_cnt; @@ -522,9 +501,9 @@ The func_info and line_info are an array of below, respectively.:: __u32 line_col; /* line number and column number */ }; -func_info_rec_size is the size of each func_info record, and line_info_rec_size -is the size of each line_info record. Passing the record size to kernel make -it possible to extend the record itself in the future. +func_info_rec_size is the size of each func_info record, and +line_info_rec_size is the size of each line_info record. Passing the record +size to kernel make it possible to extend the record itself in the future. Below are requirements for func_info: * func_info[0].insn_off must be 0. @@ -532,7 +511,7 @@ Below are requirements for func_info: bpf func boundaries. Below are requirements for line_info: - * the first insn in each func must points to a line_info record. + * the first insn in each func must have a line_info record pointing to it. * the line_info insn_off is in strictly increasing order. For line_info, the line number and column number are defined as below: @@ -543,40 +522,38 @@ For line_info, the line number and column number are defined as below: 3.4 BPF_{PROG,MAP}_GET_NEXT_ID -In kernel, every loaded program, map or btf has a unique id. -The id won't change during the life time of the program, map or btf. +In kernel, every loaded program, map or btf has a unique id. The id won't +change during the lifetime of a program, map, or btf. -The bpf syscall command BPF_{PROG,MAP}_GET_NEXT_ID -returns all id's, one for each command, to user space, for bpf -program or maps, -so the inspection tool can inspect all programs and maps. +The bpf syscall command BPF_{PROG,MAP}_GET_NEXT_ID returns all id's, one for +each command, to user space, for bpf program or maps, respectively, so an +inspection tool can inspect all programs and maps. 3.5 BPF_{PROG,MAP}_GET_FD_BY_ID -The introspection tool cannot use id to get details about program or maps. -A file descriptor needs to be obtained first for reference counting purpose. +An introspection tool cannot use id to get details about program or maps. +A file descriptor needs to be obtained first for reference-counting purpose. 3.6 BPF_OBJ_GET_INFO_BY_FD ========================== -Once a program/map fd is acquired, the introspection tool can -get the detailed information from kernel about this fd, -some of which is btf related. For example, -``bpf_map_info`` returns ``btf_id``, key/value type id. -``bpf_prog_info`` returns ``btf_id``, func_info and line info -for translated bpf byte codes, and jited_line_info. +Once a program/map fd is acquired, an introspection tool can get the detailed +information from kernel about this fd, some of which are BTF-related. For +example, ``bpf_map_info`` returns ``btf_id`` and key/value type ids. +``bpf_prog_info`` returns ``btf_id``, func_info, and line info for translated +bpf byte codes, and jited_line_info. 3.7 BPF_BTF_GET_FD_BY_ID ======================== -With ``btf_id`` obtained in ``bpf_map_info`` and ``bpf_prog_info``, -bpf syscall command BPF_BTF_GET_FD_BY_ID can retrieve a btf fd. -Then, with command BPF_OBJ_GET_INFO_BY_FD, the btf blob, originally -loaded into the kernel with BPF_BTF_LOAD, can be retrieved. +With ``btf_id`` obtained in ``bpf_map_info`` and ``bpf_prog_info``, bpf +syscall command BPF_BTF_GET_FD_BY_ID can retrieve a btf fd. Then, with +command BPF_OBJ_GET_INFO_BY_FD, the btf blob, originally loaded into the +kernel with BPF_BTF_LOAD, can be retrieved. -With the btf blob, ``bpf_map_info`` and ``bpf_prog_info``, the introspection -tool has full btf knowledge and is able to pretty print map key/values, -dump func signatures, dump line info along with byte/jit codes. +With the btf blob, ``bpf_map_info``, and ``bpf_prog_info``, an introspection +tool has full btf knowledge and is able to pretty print map key/values, dump +func signatures and line info, along with byte/jit codes. 4. ELF File Format Interface **************************** @@ -584,19 +561,19 @@ dump func signatures, dump line info along with byte/jit codes. 4.1 .BTF section ================ -The .BTF section contains type and string data. The format of this section -is same as the one describe in :ref:`BTF_Type_String`. +The .BTF section contains type and string data. The format of this section is +same as the one describe in :ref:`BTF_Type_String`. .. _BTF_Ext_Section: 4.2 .BTF.ext section ==================== -The .BTF.ext section encodes func_info and line_info which -needs loader manipulation before loading into the kernel. +The .BTF.ext section encodes func_info and line_info which needs loader +manipulation before loading into the kernel. -The specification for .BTF.ext section is defined at -``tools/lib/bpf/btf.h`` and ``tools/lib/bpf/btf.c``. +The specification for .BTF.ext section is defined at ``tools/lib/bpf/btf.h`` +and ``tools/lib/bpf/btf.c``. The current header of .BTF.ext section:: @@ -613,9 +590,9 @@ The current header of .BTF.ext section:: __u32 line_info_len; }; -It is very similar to .BTF section. Instead of type/string section, -it contains func_info and line_info section. See :ref:`BPF_Prog_Load` -for details about func_info and line_info record format. +It is very similar to .BTF section. Instead of type/string section, it +contains func_info and line_info section. See :ref:`BPF_Prog_Load` for details +about func_info and line_info record format. The func_info is organized as below.:: @@ -624,9 +601,9 @@ The func_info is organized as below.:: btf_ext_info_sec for section #2 /* func_info for section #2 */ ... -``func_info_rec_size`` specifies the size of ``bpf_func_info`` structure -when .BTF.ext is generated. btf_ext_info_sec, defined below, is -the func_info for each specific ELF section.:: +``func_info_rec_size`` specifies the size of ``bpf_func_info`` structure when +.BTF.ext is generated. ``btf_ext_info_sec``, defined below, is a collection of +func_info for each specific ELF section.:: struct btf_ext_info_sec { __u32 sec_name_off; /* offset to section name */ @@ -644,14 +621,14 @@ The line_info is organized as below.:: btf_ext_info_sec for section #2 /* line_info for section #2 */ ... -``line_info_rec_size`` specifies the size of ``bpf_line_info`` structure -when .BTF.ext is generated. +``line_info_rec_size`` specifies the size of ``bpf_line_info`` structure when +.BTF.ext is generated. The interpretation of ``bpf_func_info->insn_off`` and -``bpf_line_info->insn_off`` is different between kernel API and ELF API. -For kernel API, the ``insn_off`` is the instruction offset in the unit -of ``struct bpf_insn``. For ELF API, the ``insn_off`` is the byte offset -from the beginning of section (``btf_ext_info_sec->sec_name_off``). +``bpf_line_info->insn_off`` is different between kernel API and ELF API. For +kernel API, the ``insn_off`` is the instruction offset in the unit of ``struct +bpf_insn``. For ELF API, the ``insn_off`` is the byte offset from the +beginning of section (``btf_ext_info_sec->sec_name_off``). 5. Using BTF ************ @@ -659,10 +636,9 @@ from the beginning of section (``btf_ext_info_sec->sec_name_off``). 5.1 bpftool map pretty print ============================ -With BTF, the map key/value can be printed based on fields rather than -simply raw bytes. This is especially -valuable for large structure or if you data structure -has bitfields. For example, for the following map,:: +With BTF, the map key/value can be printed based on fields rather than simply +raw bytes. This is especially valuable for large structure or if your data +structure has bitfields. For example, for the following map,:: enum A { A1, A2, A3, A4, A5 }; typedef enum A ___A; @@ -702,9 +678,9 @@ bpftool is able to pretty print like below: 5.2 bpftool prog dump ===================== -The following is an example to show func_info and line_info -can help prog dump with better kernel symbol name, function prototype -and line information.:: +The following is an example showing how func_info and line_info can help prog +dump with better kernel symbol names, function prototypes and line +information.:: $ bpftool prog dump jited pinned /sys/fs/bpf/test_btf_haskv [...] @@ -733,10 +709,11 @@ and line information.:: ; counts = bpf_map_lookup_elem(&btf_map, &key); [...] -5.3 verifier log +5.3 Verifier Log ================ -The following is an example how line_info can help verifier failure debug.:: +The following is an example of how line_info can help debugging verification +failure.:: /* The code at tools/testing/selftests/bpf/test_xdp_noinline.c * is modified as below. @@ -765,8 +742,8 @@ You need latest pahole https://git.kernel.org/pub/scm/devel/pahole/pahole.git/ -or llvm (8.0 or later). The pahole acts as a dwarf2btf converter. It doesn't support .BTF.ext -and btf BTF_KIND_FUNC type yet. For example,:: +or llvm (8.0 or later). The pahole acts as a dwarf2btf converter. It doesn't +support .BTF.ext and btf BTF_KIND_FUNC type yet. For example,:: -bash-4.4$ cat t.c struct t { @@ -783,8 +760,9 @@ and btf BTF_KIND_FUNC type yet. For example,:: c type_id=2 bitfield_size=2 bits_offset=5 [2] INT int size=4 bit_offset=0 nr_bits=32 encoding=SIGNED -The llvm is able to generate .BTF and .BTF.ext directly with -g for bpf target only. -The assembly code (-S) is able to show the BTF encoding in assembly format.:: +The llvm is able to generate .BTF and .BTF.ext directly with -g for bpf target +only. The assembly code (-S) is able to show the BTF encoding in assembly +format.:: -bash-4.4$ cat t2.c typedef int __int32; @@ -867,4 +845,4 @@ The assembly code (-S) is able to show the BTF encoding in assembly format.:: 7. Testing ********** -Kernel bpf selftest `test_btf.c` provides extensive set of BTF related tests. +Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests. diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 4ae4f9d8f8fe..e14d7d40fc75 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -295,6 +295,41 @@ using:: For XDP_SKB mode, use the switch "-S" instead of "-N" and all options can be displayed with "-h", as usual. +FAQ +======= + +Q: I am not seeing any traffic on the socket. What am I doing wrong? + +A: When a netdev of a physical NIC is initialized, Linux usually + allocates one Rx and Tx queue pair per core. So on a 8 core system, + queue ids 0 to 7 will be allocated, one per core. In the AF_XDP + bind call or the xsk_socket__create libbpf function call, you + specify a specific queue id to bind to and it is only the traffic + towards that queue you are going to get on you socket. So in the + example above, if you bind to queue 0, you are NOT going to get any + traffic that is distributed to queues 1 through 7. If you are + lucky, you will see the traffic, but usually it will end up on one + of the queues you have not bound to. + + There are a number of ways to solve the problem of getting the + traffic you want to the queue id you bound to. If you want to see + all the traffic, you can force the netdev to only have 1 queue, queue + id 0, and then bind to queue 0. You can use ethtool to do this:: + + sudo ethtool -L combined 1 + + If you want to only see part of the traffic, you can program the + NIC through ethtool to filter out your traffic to a single queue id + that you can bind your XDP socket to. Here is one example in which + UDP traffic to and from port 4242 are sent to queue 2:: + + sudo ethtool -N rx-flow-hash udp4 fn + sudo ethtool -N flow-type udp4 src-port 4242 dst-port \ + 4242 action 2 + + A number of other ways are possible all up to the capabilitites of + the NIC you have. + Credits ======= @@ -309,4 +344,3 @@ Credits - Michael S. Tsirkin - Qi Z Zhang - Willem de Bruijn - diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index b5e060edfc38..319e5e041f38 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -829,7 +829,7 @@ tracing filters may do to maintain counters of events, for example. Register R9 is not used by socket filters either, but more complex filters may be running out of registers and would have to resort to spill/fill to stack. -Internal BPF can used as generic assembler for last step performance +Internal BPF can be used as a generic assembler for last step performance optimizations, socket filters and seccomp are using it as assembler. Tracing filters may use it as assembler to generate code from kernel. In kernel usage may not be bounded by security considerations, since generated internal BPF code diff --git a/include/linux/bpf.h b/include/linux/bpf.h index de18227b3d95..a2132e09dc1c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -16,6 +16,7 @@ #include #include #include +#include struct bpf_verifier_env; struct perf_event; @@ -340,6 +341,12 @@ enum bpf_cgroup_storage_type { #define MAX_BPF_CGROUP_STORAGE_TYPE __BPF_CGROUP_STORAGE_MAX +struct bpf_prog_stats { + u64 cnt; + u64 nsecs; + struct u64_stats_sync syncp; +}; + struct bpf_prog_aux { atomic_t refcnt; u32 used_map_cnt; @@ -389,6 +396,7 @@ struct bpf_prog_aux { * main prog always has linfo_idx == 0 */ u32 linfo_idx; + struct bpf_prog_stats __percpu *stats; union { struct work_struct work; struct rcu_head rcu; @@ -559,6 +567,7 @@ void bpf_map_area_free(void *base); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); extern int sysctl_unprivileged_bpf_disabled; +extern int sysctl_bpf_stats_enabled; int bpf_map_new_fd(struct bpf_map *map, int flags); int bpf_prog_new_fd(struct bpf_prog *prog); diff --git a/include/linux/filter.h b/include/linux/filter.h index 95e2d7ebdf21..7e5e3db11106 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -533,7 +533,24 @@ struct sk_filter { struct bpf_prog *prog; }; -#define BPF_PROG_RUN(filter, ctx) (*(filter)->bpf_func)(ctx, (filter)->insnsi) +DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); + +#define BPF_PROG_RUN(prog, ctx) ({ \ + u32 ret; \ + cant_sleep(); \ + if (static_branch_unlikely(&bpf_stats_enabled_key)) { \ + struct bpf_prog_stats *stats; \ + u64 start = sched_clock(); \ + ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \ + stats = this_cpu_ptr(prog->aux->stats); \ + u64_stats_update_begin(&stats->syncp); \ + stats->cnt++; \ + stats->nsecs += sched_clock() - start; \ + u64_stats_update_end(&stats->syncp); \ + } else { \ + ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \ + } \ + ret; }) #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN @@ -764,6 +781,7 @@ void bpf_prog_free_jited_linfo(struct bpf_prog *prog); void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog); struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags); +struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags); struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags); void __bpf_prog_free(struct bpf_prog *fp); diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 8f0e68e250a7..a8868a32098c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -245,8 +245,10 @@ extern int _cond_resched(void); #endif #ifdef CONFIG_DEBUG_ATOMIC_SLEEP - void ___might_sleep(const char *file, int line, int preempt_offset); - void __might_sleep(const char *file, int line, int preempt_offset); +extern void ___might_sleep(const char *file, int line, int preempt_offset); +extern void __might_sleep(const char *file, int line, int preempt_offset); +extern void __cant_sleep(const char *file, int line, int preempt_offset); + /** * might_sleep - annotation for functions that can sleep * @@ -259,6 +261,13 @@ extern int _cond_resched(void); */ # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) +/** + * cant_sleep - annotation for functions that cannot sleep + * + * this macro will print a stack trace if it is executed with preemption enabled + */ +# define cant_sleep() \ + do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) #else static inline void ___might_sleep(const char *file, int line, @@ -266,6 +275,7 @@ extern int _cond_resched(void); static inline void __might_sleep(const char *file, int line, int preempt_offset) { } # define might_sleep() do { might_resched(); } while (0) +# define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) #endif diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index bcdd2474eee7..3c38ac9a92a7 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2359,6 +2359,13 @@ union bpf_attr { * Return * A **struct bpf_tcp_sock** pointer on success, or NULL in * case of failure. + * + * int bpf_skb_ecn_set_ce(struct sk_buf *skb) + * Description + * Sets ECN of IP header to ce (congestion encountered) if + * current value is ect (ECN capable). Works with IPv6 and IPv4. + * Return + * 1 if set, 0 if not set. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2457,7 +2464,8 @@ union bpf_attr { FN(spin_lock), \ FN(spin_unlock), \ FN(sk_fullsock), \ - FN(tcp_sock), + FN(tcp_sock), \ + FN(skb_ecn_set_ce), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -2813,6 +2821,8 @@ struct bpf_prog_info { __u32 jited_line_info_rec_size; __u32 nr_prog_tags; __aligned_u64 prog_tags; + __u64 run_time_ns; + __u64 run_cnt; } __attribute__((aligned(8))); struct bpf_map_info { diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index ef88b167959d..3f08c257858e 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -78,7 +78,7 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns return NULL; } -struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) +struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags) { gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; struct bpf_prog_aux *aux; @@ -104,6 +104,32 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) return fp; } + +struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; + struct bpf_prog *prog; + int cpu; + + prog = bpf_prog_alloc_no_stats(size, gfp_extra_flags); + if (!prog) + return NULL; + + prog->aux->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); + if (!prog->aux->stats) { + kfree(prog->aux); + vfree(prog); + return NULL; + } + + for_each_possible_cpu(cpu) { + struct bpf_prog_stats *pstats; + + pstats = per_cpu_ptr(prog->aux->stats, cpu); + u64_stats_init(&pstats->syncp); + } + return prog; +} EXPORT_SYMBOL_GPL(bpf_prog_alloc); int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog) @@ -231,7 +257,10 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, void __bpf_prog_free(struct bpf_prog *fp) { - kfree(fp->aux); + if (fp->aux) { + free_percpu(fp->aux->stats); + kfree(fp->aux); + } vfree(fp); } @@ -2069,6 +2098,10 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to, return -EFAULT; } +DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key); +EXPORT_SYMBOL(bpf_stats_enabled_key); +int sysctl_bpf_stats_enabled __read_mostly; + /* All definitions of tracepoints related to BPF. */ #define CREATE_TRACE_POINTS #include diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 583346a0ab29..3dff41403583 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -58,6 +58,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) inner_map_meta->value_size = inner_map->value_size; inner_map_meta->map_flags = inner_map->map_flags; inner_map_meta->max_entries = inner_map->max_entries; + inner_map_meta->spin_lock_off = inner_map->spin_lock_off; /* Misc members not needed in bpf_map_meta_equal() check. */ inner_map_meta->ops = inner_map->ops; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 797a99c7493e..bc34cf9fe9ee 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1283,24 +1283,54 @@ static int bpf_prog_release(struct inode *inode, struct file *filp) return 0; } +static void bpf_prog_get_stats(const struct bpf_prog *prog, + struct bpf_prog_stats *stats) +{ + u64 nsecs = 0, cnt = 0; + int cpu; + + for_each_possible_cpu(cpu) { + const struct bpf_prog_stats *st; + unsigned int start; + u64 tnsecs, tcnt; + + st = per_cpu_ptr(prog->aux->stats, cpu); + do { + start = u64_stats_fetch_begin_irq(&st->syncp); + tnsecs = st->nsecs; + tcnt = st->cnt; + } while (u64_stats_fetch_retry_irq(&st->syncp, start)); + nsecs += tnsecs; + cnt += tcnt; + } + stats->nsecs = nsecs; + stats->cnt = cnt; +} + #ifdef CONFIG_PROC_FS static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp) { const struct bpf_prog *prog = filp->private_data; char prog_tag[sizeof(prog->tag) * 2 + 1] = { }; + struct bpf_prog_stats stats; + bpf_prog_get_stats(prog, &stats); bin2hex(prog_tag, prog->tag, sizeof(prog->tag)); seq_printf(m, "prog_type:\t%u\n" "prog_jited:\t%u\n" "prog_tag:\t%s\n" "memlock:\t%llu\n" - "prog_id:\t%u\n", + "prog_id:\t%u\n" + "run_time_ns:\t%llu\n" + "run_cnt:\t%llu\n", prog->type, prog->jited, prog_tag, prog->pages * 1ULL << PAGE_SHIFT, - prog->aux->id); + prog->aux->id, + stats.nsecs, + stats.cnt); } #endif @@ -2122,6 +2152,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, struct bpf_prog_info __user *uinfo = u64_to_user_ptr(attr->info.info); struct bpf_prog_info info = {}; u32 info_len = attr->info.info_len; + struct bpf_prog_stats stats; char __user *uinsns; u32 ulen; int err; @@ -2161,6 +2192,10 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, if (err) return err; + bpf_prog_get_stats(prog, &stats); + info.run_time_ns = stats.nsecs; + info.run_cnt = stats.cnt; + if (!capable(CAP_SYS_ADMIN)) { info.jited_prog_len = 0; info.xlated_prog_len = 0; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ebc3b264aa4d..a7b96bf0e654 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7320,7 +7320,12 @@ static int jit_subprogs(struct bpf_verifier_env *env) subprog_end = env->subprog_info[i + 1].start; len = subprog_end - subprog_start; - func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER); + /* BPF_PROG_RUN doesn't call subprogs directly, + * hence main prog stats include the runtime of subprogs. + * subprogs don't have IDs and not reachable via prog_get_next_id + * func[i]->aux->stats will never be accessed and stays NULL + */ + func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER); if (!func[i]) goto out_free; memcpy(func[i]->insnsi, &prog->insnsi[subprog_start], diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d8d76a65cfdd..7cbb5658be80 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6162,6 +6162,34 @@ void ___might_sleep(const char *file, int line, int preempt_offset) add_taint(TAINT_WARN, LOCKDEP_STILL_OK); } EXPORT_SYMBOL(___might_sleep); + +void __cant_sleep(const char *file, int line, int preempt_offset) +{ + static unsigned long prev_jiffy; + + if (irqs_disabled()) + return; + + if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) + return; + + if (preempt_count() > preempt_offset) + return; + + if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) + return; + prev_jiffy = jiffies; + + printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line); + printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), + current->pid, current->comm); + + debug_show_held_locks(current); + dump_stack(); + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); +} +EXPORT_SYMBOL_GPL(__cant_sleep); #endif #ifdef CONFIG_MAGIC_SYSRQ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index e815781ed751..a43c601ac252 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -267,6 +267,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ + preempt_disable(); for (; f; f = f->prev) { u32 cur_ret = BPF_PROG_RUN(f->prog, sd); @@ -275,6 +276,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, *match = f; } } + preempt_enable(); return ret; } #endif /* CONFIG_SECCOMP_FILTER */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..7578e21a711b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -224,6 +224,11 @@ static int proc_dostring_coredump(struct ctl_table *table, int write, #endif static int proc_dopipe_max_size(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#ifdef CONFIG_BPF_SYSCALL +static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); +#endif #ifdef CONFIG_MAGIC_SYSRQ /* Note: sysrq code uses its own private copy */ @@ -1229,6 +1234,15 @@ static struct ctl_table kern_table[] = { .extra1 = &one, .extra2 = &one, }, + { + .procname = "bpf_stats_enabled", + .data = &sysctl_bpf_stats_enabled, + .maxlen = sizeof(sysctl_bpf_stats_enabled), + .mode = 0644, + .proc_handler = proc_dointvec_minmax_bpf_stats, + .extra1 = &zero, + .extra2 = &one, + }, #endif #if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU) { @@ -3260,6 +3274,29 @@ int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int write, #endif /* CONFIG_PROC_SYSCTL */ +#ifdef CONFIG_BPF_SYSCALL +static int proc_dointvec_minmax_bpf_stats(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret, bpf_stats = *(int *)table->data; + struct ctl_table tmp = *table; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + tmp.data = &bpf_stats; + ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); + if (write && !ret) { + *(int *)table->data = bpf_stats; + if (bpf_stats) + static_branch_enable(&bpf_stats_enabled_key); + else + static_branch_disable(&bpf_stats_enabled_key); + } + return ret; +} +#endif /* * No sense putting this after each symbol definition, twice, * exception granted :-) diff --git a/lib/test_bpf.c b/lib/test_bpf.c index f3e570722a7e..0845f635f404 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -6668,12 +6668,14 @@ static int __run_one(const struct bpf_prog *fp, const void *data, u64 start, finish; int ret = 0, i; + preempt_disable(); start = ktime_get_ns(); for (i = 0; i < runs; i++) ret = BPF_PROG_RUN(fp, data); finish = ktime_get_ns(); + preempt_enable(); *duration = finish - start; do_div(*duration, runs); diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 7f62d168411d..da7051d62727 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -296,31 +296,45 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, if (!repeat) repeat = 1; + rcu_read_lock(); + preempt_disable(); time_start = ktime_get_ns(); for (i = 0; i < repeat; i++) { - preempt_disable(); - rcu_read_lock(); retval = __skb_flow_bpf_dissect(prog, skb, &flow_keys_dissector, &flow_keys); - rcu_read_unlock(); - preempt_enable(); + + if (signal_pending(current)) { + preempt_enable(); + rcu_read_unlock(); + + ret = -EINTR; + goto out; + } if (need_resched()) { - if (signal_pending(current)) - break; time_spent += ktime_get_ns() - time_start; + preempt_enable(); + rcu_read_unlock(); + cond_resched(); + + rcu_read_lock(); + preempt_disable(); time_start = ktime_get_ns(); } } time_spent += ktime_get_ns() - time_start; + preempt_enable(); + rcu_read_unlock(); + do_div(time_spent, repeat); duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys), retval, duration); +out: kfree_skb(skb); kfree(sk); return ret; diff --git a/net/core/filter.c b/net/core/filter.c index 5132c054c981..5ceba98069d4 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5422,6 +5422,32 @@ static const struct bpf_func_proto bpf_tcp_sock_proto = { .arg1_type = ARG_PTR_TO_SOCK_COMMON, }; +BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb) +{ + unsigned int iphdr_len; + + if (skb->protocol == cpu_to_be16(ETH_P_IP)) + iphdr_len = sizeof(struct iphdr); + else if (skb->protocol == cpu_to_be16(ETH_P_IPV6)) + iphdr_len = sizeof(struct ipv6hdr); + else + return 0; + + if (skb_headlen(skb) < iphdr_len) + return 0; + + if (skb_cloned(skb) && !skb_clone_writable(skb, iphdr_len)) + return 0; + + return INET_ECN_set_ce(skb); +} + +static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = { + .func = bpf_skb_ecn_set_ce, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; #endif /* CONFIG_INET */ bool bpf_helper_changes_pkt_data(void *func) @@ -5581,6 +5607,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) #ifdef CONFIG_INET case BPF_FUNC_tcp_sock: return &bpf_tcp_sock_proto; + case BPF_FUNC_skb_ecn_set_ce: + return &bpf_skb_ecn_set_ce_proto; #endif default: return sk_filter_func_proto(func_id, prog); @@ -6275,6 +6303,7 @@ static bool tc_cls_act_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): case bpf_ctx_range(struct __sk_buff, tstamp): + case bpf_ctx_range(struct __sk_buff, queue_mapping): break; default: return false; @@ -6679,9 +6708,18 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, break; case offsetof(struct __sk_buff, queue_mapping): - *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg, - bpf_target_off(struct sk_buff, queue_mapping, 2, - target_size)); + if (type == BPF_WRITE) { + *insn++ = BPF_JMP_IMM(BPF_JGE, si->src_reg, NO_QUEUE_MAPPING, 1); + *insn++ = BPF_STX_MEM(BPF_H, si->dst_reg, si->src_reg, + bpf_target_off(struct sk_buff, + queue_mapping, + 2, target_size)); + } else { + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg, + bpf_target_off(struct sk_buff, + queue_mapping, + 2, target_size)); + } break; case offsetof(struct __sk_buff, vlan_present): diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore index 8ae4940025f8..dbb817dbacfc 100644 --- a/samples/bpf/.gitignore +++ b/samples/bpf/.gitignore @@ -1,7 +1,6 @@ cpustat fds_example lathist -load_sock_ops lwt_len_hist map_perf_test offwaketime diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index a0ef7eddd0b3..65e667bdf979 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -40,7 +40,6 @@ hostprogs-y += lwt_len_hist hostprogs-y += xdp_tx_iptunnel hostprogs-y += test_map_in_map hostprogs-y += per_socket_stats_example -hostprogs-y += load_sock_ops hostprogs-y += xdp_redirect hostprogs-y += xdp_redirect_map hostprogs-y += xdp_redirect_cpu @@ -53,6 +52,7 @@ hostprogs-y += xdpsock hostprogs-y += xdp_fwd hostprogs-y += task_fd_query hostprogs-y += xdp_sample_pkts +hostprogs-y += hbm # Libbpf dependencies LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a @@ -60,9 +60,9 @@ LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o -fds_example-objs := bpf_load.o fds_example.o -sockex1-objs := bpf_load.o sockex1_user.o -sockex2-objs := bpf_load.o sockex2_user.o +fds_example-objs := fds_example.o +sockex1-objs := sockex1_user.o +sockex2-objs := sockex2_user.o sockex3-objs := bpf_load.o sockex3_user.o tracex1-objs := bpf_load.o tracex1_user.o tracex2-objs := bpf_load.o tracex2_user.o @@ -71,7 +71,6 @@ tracex4-objs := bpf_load.o tracex4_user.o tracex5-objs := bpf_load.o tracex5_user.o tracex6-objs := bpf_load.o tracex6_user.o tracex7-objs := bpf_load.o tracex7_user.o -load_sock_ops-objs := bpf_load.o load_sock_ops.o test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS) lathist-objs := bpf_load.o lathist_user.o @@ -109,6 +108,7 @@ xdpsock-objs := xdpsock_user.o xdp_fwd-objs := xdp_fwd_user.o task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS) xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS) +hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS) # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -163,10 +163,10 @@ always += xdp2skb_meta_kern.o always += syscall_tp_kern.o always += cpustat_kern.o always += xdp_adjust_tail_kern.o -always += xdpsock_kern.o always += xdp_fwd_kern.o always += task_fd_query_kern.o always += xdp_sample_pkts_kern.o +always += hbm_out_kern.o KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -266,6 +266,8 @@ $(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF) $(src)/*.c: verify_target_bpf $(LIBBPF) $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h +$(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h +$(obj)/hbm.o: $(src)/hbm.h # asm/sysreg.h - inline assembly used by it is incompatible with llvm. # But, there is no easy way to fix it, so just exclude it since it is diff --git a/samples/bpf/do_hbm_test.sh b/samples/bpf/do_hbm_test.sh new file mode 100755 index 000000000000..56c8b4115c95 --- /dev/null +++ b/samples/bpf/do_hbm_test.sh @@ -0,0 +1,436 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (c) 2019 Facebook +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of version 2 of the GNU General Public +# License as published by the Free Software Foundation. + +Usage() { + echo "Script for testing HBM (Host Bandwidth Manager) framework." + echo "It creates a cgroup to use for testing and load a BPF program to limit" + echo "egress or ingress bandwidht. It then uses iperf3 or netperf to create" + echo "loads. The output is the goodput in Mbps (unless -D was used)." + echo "" + echo "USAGE: $name [out] [-b=|--bpf=] [-c=|--cc=] [-D]" + echo " [-d=|--delay=] [--debug] [-E]" + echo " [-f=<#flows>|--flows=<#flows>] [-h] [-i=|--id=]" + echo " [-l] [-N] [-p=|--port=] [-P]" + echo " [-q=] [-R] [-s=|--server=|--time=