Kumar Kartikeya Dwivedi says:
====================
This set enables storing pointers of a certain type in BPF map, and extends the
verifier to enforce type safety and lifetime correctness properties.
The infrastructure being added is generic enough for allowing storing any kind
of pointers whose type is available using BTF (user or kernel) in the future
(e.g. strongly typed memory allocation in BPF program), which are internally
tracked in the verifier as PTR_TO_BTF_ID, but for now the series limits them to
two kinds of pointers obtained from the kernel.
Obviously, use of this feature depends on map BTF.
1. Unreferenced kernel pointer
In this case, there are very few restrictions. The pointer type being stored
must match the type declared in the map value. However, such a pointer when
loaded from the map can only be dereferenced, but not passed to any in-kernel
helpers or kernel functions available to the program. This is because while the
verifier's exception handling mechanism coverts BPF_LDX to PROBE_MEM loads,
which are then handled specially by the JIT implementation, the same liberty is
not available to accesses inside the kernel. The pointer by the time it is
passed into a helper has no lifetime related guarantees about the object it is
pointing to, and may well be referencing invalid memory.
2. Referenced kernel pointer
This case imposes a lot of restrictions on the programmer, to ensure safety. To
transfer the ownership of a reference in the BPF program to the map, the user
must use the bpf_kptr_xchg helper, which returns the old pointer contained in
the map, as an acquired reference, and releases verifier state for the
referenced pointer being exchanged, as it moves into the map.
This a normal PTR_TO_BTF_ID that can be used with in-kernel helpers and kernel
functions callable by the program.
However, if BPF_LDX is used to load a referenced pointer from the map, it is
still not permitted to pass it to in-kernel helpers or kernel functions. To
obtain a reference usable with helpers, the user must invoke a kfunc helper
which returns a usable reference (which also must be eventually released before
BPF_EXIT, or moved into a map).
Since the load of the pointer (preserving data dependency ordering) must happen
inside the RCU read section, the kfunc helper will take a pointer to the map
value, which must point to the actual pointer of the object whose reference is
to be raised. The type will be verified from the BTF information of the kfunc,
as the prototype must be:
T *func(T **, ... /* other arguments */);
Then, the verifier checks whether pointer at offset of the map value points to
the type T, and permits the call.
This convention is followed so that such helpers may also be called from
sleepable BPF programs, where RCU read lock is not necessarily held in the BPF
program context, hence necessiating the need to pass in a pointer to the actual
pointer to perform the load inside the RCU read section.
Notes
-----
* C selftests require https://reviews.llvm.org/D119799 to pass.
* Unlike BPF timers, kptr is not reset or freed on map_release_uref.
* Referenced kptr storage is always treated as unsigned long * on kernel side,
as BPF side cannot mutate it. The storage (8 bytes) is sufficient for both
32-bit and 64-bit platforms.
* Use of WRITE_ONCE to reset unreferenced kptr on 32-bit systems is fine, as
the actual pointer is always word sized, so the store tearing into two 32-bit
stores won't be a problem as the other half is always zeroed out.
Changelog:
----------
v5 -> v6
v5: https://lore.kernel.org/bpf/20220415160354.1050687-1-memxor@gmail.com
* Address comments from Alexei
* Drop 'Revisit stack usage' comment
* Rename off_btf to kernel_btf
* Add comment about searching using type from map BTF
* Do kmemdup + btf_get instead of get + kmemdup + put
* Add comment for btf_struct_ids_match
* Add comment for assigning non-zero id for mark_ptr_or_null_reg
* Rename PTR_RELEASE to OBJ_RELEASE
* Rename BPF_MAP_OFF_DESC_TYPE_XXX_KPTR to BPF_KPTR_XXX
* Remove unneeded likely/unlikely in cold functions
* Fix other misc nits
* Keep release_regno instead of replacing with bool + regno
* Add a patch to prevent type match for first member when off == 0 for
release functions (kfunc + BPF helpers)
* Guard kptr/kptr_ref definition in libbpf header with __has_attribute
to prevent selftests compilation error with old clang not support
type tags
v4 -> v5
v4: https://lore.kernel.org/bpf/20220409093303.499196-1-memxor@gmail.com
* Address comments from Joanne
* Move __btf_member_bit_offset before strcmp
* Move strcmp conditional on name to unref kptr patch
* Directly return from btf_find_struct in patch 1
* Use enum btf_field_type vs int field_type
* Put btf and btf_id in off_desc in named struct 'kptr'
* Switch order for BTF_FIELD_IGNORE check
* Drop dead tab->nr_off = 0 store
* Use i instead of tab->nr_off to btf_put on failure
* Replace kzalloc + memcpy with kmemdup (kernel test robot)
* Reject both BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG
* Add logging statement for reject BPF_MODE(insn->code) != BPF_MEM
* Rename off_desc -> kptr_off_desc in check_mem_access
* Drop check for err, fallthrough to end of function
* Remove is_release_function, use meta.release_regno to detect release
function, release reference state, and remove check_release_regno
* Drop off_desc->flags, use off_desc->type
* Update comment for ARG_PTR_TO_KPTR
* Distinguish between direct/indirect access to kptr
* Drop check_helper_mem_access from process_kptr_func, check_mem_reg in kptr_get
* Add verifier test for helper accessing kptr indirectly
* Fix other misc nits, add Acked-by for patch 2
v3 -> v4
v3: https://lore.kernel.org/bpf/20220320155510.671497-1-memxor@gmail.com
* Use btf_parse_kptrs, plural kptrs naming (Joanne, Andrii)
* Remove unused parameters in check_map_kptr_access (Joanne)
* Handle idx < info_cnt kludge using tmp variable (Andrii)
* Validate tags always precede modifiers in BTF (Andrii)
* Split out into https://lore.kernel.org/bpf/20220406004121.282699-1-memxor@gmail.com
* Store u32 type_id in btf_field_info (Andrii)
* Use base_type in map_kptr_match_type (Andrii)
* Free kptr_off_tab when not bpf_capable (Martin)
* Use PTR_RELEASE flag instead of bools in bpf_func_proto (Joanne)
* Drop extra reg->off and reg->ref_obj_id checks in map_kptr_match_type (Martin)
* Use separate u32 and u8 arrays for offs and sizes in off_arr (Andrii)
* Simplify and remove map->value_size sentinel in copy_map_value (Andrii)
* Use sort_r to keep both arrays in sync while sorting (Andrii)
* Rename check_and_free_timers_and_kptr to check_and_free_fields (Andrii)
* Move dtor prototype checks to registration phase (Alexei)
* Use ret variable for checking ASSERT_XXX, use shorter strings (Andrii)
* Fix missing checks for other maps (Jiri)
* Fix various other nits, and bugs noticed during self review
v2 -> v3
v2: https://lore.kernel.org/bpf/20220317115957.3193097-1-memxor@gmail.com
* Address comments from Alexei
* Set name, sz, align in btf_find_field
* Do idx >= info_cnt check in caller of btf_find_field_*
* Use extra element in the info_arr to make this safe
* Remove while loop, reject extra tags
* Remove cases of defensive programming
* Move bpf_capable() check to map_check_btf
* Put check_ptr_off_reg reordering hunk into separate patch
* Warn for ref_ptr once
* Make the meta.ref_obj_id == 0 case simpler to read
* Remove kptr_percpu and kptr_user support, remove their tests
* Store size of field at offset in off_arr
* Fix BPF_F_NO_PREALLOC set wrongly for hash map in C selftest
* Add missing check_mem_reg call for kptr_get kfunc arg#0 check
v1 -> v2
v1: https://lore.kernel.org/bpf/20220220134813.3411982-1-memxor@gmail.com
* Address comments from Alexei
* Rename bpf_btf_find_by_name_kind_all to bpf_find_btf_id
* Reduce indentation level in that function
* Always take reference regardless of module or vmlinux BTF
* Also made it the same for btf_get_module_btf
* Use kptr, kptr_ref, kptr_percpu, kptr_user type tags
* Don't reserve tag namespace
* Refactor btf_find_field to be side effect free, allocate and populate
kptr_off_tab in caller
* Move module reference to dtor patch
* Remove support for BPF_XCHG, BPF_CMPXCHG insn
* Introduce bpf_kptr_xchg helper
* Embed offset array in struct bpf_map, populate and sort it once
* Adjust copy_map_value to memcpy directly using this offset array
* Removed size member from offset array to save space
* Fix some problems pointed out by kernel test robot
* Tidy selftests
* Lots of other minor fixes
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This uses the __kptr and __kptr_ref macros as well, and tries to test
the stuff that is supposed to work, since we have negative tests in
test_verifier suite. Also include some code to test map-in-map support,
such that the inner_map_meta matches the kptr_off_tab of map added as
element.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-12-memxor@gmail.com
The current of behavior of btf_struct_ids_match for release arguments is
that when type match fails, it retries with first member type again
(recursively). Since the offset is already 0, this is akin to just
casting the pointer in normal C, since if type matches it was just
embedded inside parent sturct as an object. However, we want to reject
cases for release function type matching, be it kfunc or BPF helpers.
An example is the following:
struct foo {
struct bar b;
};
struct foo *v = acq_foo();
rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then
// retries with first member type and succeeds, while
// it should fail.
Hence, don't walk the struct and only rely on btf_types_are_same for
strict mode. All users of strict mode must be dealing with zero offset
anyway, since otherwise they would want the struct to be walked.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com
We introduce a new style of kfunc helpers, namely *_kptr_get, where they
take pointer to the map value which points to a referenced kernel
pointer contained in the map. Since this is referenced, only
bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to
change the current value, and each pointer that resides in that location
would be referenced, and RCU protected (this must be kept in mind while
adding kernel types embeddable as reference kptr in BPF maps).
This means that if do the load of the pointer value in an RCU read
section, and find a live pointer, then as long as we hold RCU read lock,
it won't be freed by a parallel xchg + release operation. This allows us
to implement a safe refcount increment scheme. Hence, enforce that first
argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the
right offset to referenced pointer.
For the rest of the arguments, they are subjected to typical kfunc
argument checks, hence allowing some flexibility in passing more intent
into how the reference should be taken.
For instance, in case of struct nf_conn, it is not freed until RCU grace
period ends, but can still be reused for another tuple once refcount has
dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call
refcount_inc_not_zero, but also do a tuple match after incrementing the
reference, and when it fails to match it, put the reference again and
return NULL.
This can be implemented easily if we allow passing additional parameters
to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a
tuple__sz pair.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-9-memxor@gmail.com
A destructor kfunc can be defined as void func(type *), where type may
be void or any other pointer type as per convenience.
In this patch, we ensure that the type is sane and capture the function
pointer into off_desc of ptr_off_tab for the specific pointer offset,
with the invariant that the dtor pointer is always set when 'kptr_ref'
tag is applied to the pointer's pointee type, which is indicated by the
flag BPF_MAP_VALUE_OFF_F_REF.
Note that only BTF IDs whose destructor kfunc is registered, thus become
the allowed BTF IDs for embedding as referenced kptr. Hence it serves
the purpose of finding dtor kfunc BTF ID, as well acting as a check
against the whitelist of allowed BTF IDs for this purpose.
Finally, wire up the actual freeing of the referenced pointer if any at
all available offsets, so that no references are leaked after the BPF
map goes away and the BPF program previously moved the ownership a
referenced pointer into it.
The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem
will free any existing referenced kptr. The same case is with LRU map's
bpf_lru_push_free/htab_lru_push_free functions, which are extended to
reset unreferenced and free referenced kptr.
Note that unlike BPF timers, kptr is not reset or freed when map uref
drops to zero.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
To support storing referenced PTR_TO_BTF_ID in maps, we require
associating a specific BTF ID with a 'destructor' kfunc. This is because
we need to release a live referenced pointer at a certain offset in map
value from the map destruction path, otherwise we end up leaking
resources.
Hence, introduce support for passing an array of btf_id, kfunc_btf_id
pairs that denote a BTF ID and its associated release function. Then,
add an accessor 'btf_find_dtor_kfunc' which can be used to look up the
destructor kfunc of a certain BTF ID. If found, we can use it to free
the object from the map free path.
The registration of these pairs also serve as a whitelist of structures
which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because
without finding the destructor kfunc, we will bail and return an error.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-7-memxor@gmail.com
Since now there might be at most 10 offsets that need handling in
copy_map_value, the manual shuffling and special case is no longer going
to work. Hence, let's generalise the copy_map_value function by using
a sorted array of offsets to skip regions that must be avoided while
copying into and out of a map value.
When the map is created, we populate the offset array in struct map,
Then, copy_map_value uses this sorted offset array is used to memcpy
while skipping timer, spin lock, and kptr. The array is allocated as
in most cases none of these special fields would be present in map
value, hence we can save on space for the common case by not embedding
the entire object inside bpf_map struct.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com
While we can guarantee that even for unreferenced kptr, the object
pointer points to being freed etc. can be handled by the verifier's
exception handling (normal load patching to PROBE_MEM loads), we still
cannot allow the user to pass these pointers to BPF helpers and kfunc,
because the same exception handling won't be done for accesses inside
the kernel. The same is true if a referenced pointer is loaded using
normal load instruction. Since the reference is not guaranteed to be
held while the pointer is used, it must be marked as untrusted.
Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark
all registers loading unreferenced and referenced kptr from BPF maps,
and ensure they can never escape the BPF program and into the kernel by
way of calling stable/unstable helpers.
In check_ptr_to_btf_access, the !type_may_be_null check to reject type
flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER,
MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first
two are checked inside the function and rejected using a proper error
message, but we still want to allow dereference of untrusted case.
Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are
walked, so that this flag is never dropped once it has been set on a
PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one
direction).
In convert_ctx_accesses, extend the switch case to consider untrusted
PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM
conversion for BPF_LDX.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com
Extending the code in previous commits, introduce referenced kptr
support, which needs to be tagged using 'kptr_ref' tag instead. Unlike
unreferenced kptr, referenced kptr have a lot more restrictions. In
addition to the type matching, only a newly introduced bpf_kptr_xchg
helper is allowed to modify the map value at that offset. This transfers
the referenced pointer being stored into the map, releasing the
references state for the program, and returning the old value and
creating new reference state for the returned pointer.
Similar to unreferenced pointer case, return value for this case will
also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer
must either be eventually released by calling the corresponding release
function, otherwise it must be transferred into another map.
It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear
the value, and obtain the old value if any.
BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future
commit will permit using BPF_LDX for such pointers, but attempt at
making it safe, since the lifetime of object won't be guaranteed.
There are valid reasons to enforce the restriction of permitting only
bpf_kptr_xchg to operate on referenced kptr. The pointer value must be
consistent in face of concurrent modification, and any prior values
contained in the map must also be released before a new one is moved
into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg
returns the old value, which the verifier would require the user to
either free or move into another map, and releases the reference held
for the pointer being moved in.
In the future, direct BPF_XCHG instruction may also be permitted to work
like bpf_kptr_xchg helper.
Note that process_kptr_func doesn't have to call
check_helper_mem_access, since we already disallow rdonly/wronly flags
for map, which is what check_map_access_type checks, and we already
ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc,
so check_map_access is also not required.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com
Add the 'zone_cap_mb' kernel module parameter. This parameter defines the
zone capacity. The zone capacity must be less than or equal to the zone
size.
Report that sequential write zones and gap zones are paired in the Zoned
Block Device Characteristics VPD page (page B6h).
This patch has been tested as follows:
modprobe scsi_debug delay=0 sector_size=512 dev_size_mb=128 zbc=host-managed zone_nr_conv=16 zone_size_mb=4 zone_cap_mb=3
modprobe brd rd_nr=1 rd_size=$((1<<20))
mkfs.f2fs -m /dev/ram0 -c /dev/${scsi_debug_dev}
mount /dev/ram0 /mnt
# Run a fio job that uses /mnt
Link: https://lore.kernel.org/r/20220421183023.3462291-10-bvanassche@acm.org
Cc: Douglas Gilbert <dgilbert@interlog.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
[ bvanassche: Switched to reporting a constant zone starting LBA granularity ]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
ZBC-2 allows host-managed disks to report gap zones. This allow zoned disks
to report an offset between data zone starts that is a power of two even if
the number of logical blocks with data per zone is not a power of two.
Another new feature in ZBC-2 is support for constant zone starting LBA
offsets. For zoned disks that report a constant zone starting LBA offset,
hide the gap zones from the block layer. Report the offset between data
zone starts as zone size and report the number of logical blocks with data
per zone as the zone capacity.
Link: https://lore.kernel.org/r/20220421183023.3462291-7-bvanassche@acm.org
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
[ bvanassche: Reworked this patch ]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>