Alexei Starovoitov says:
====================
pull-request: bpf-next 2020-05-15
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 1 day(s) which contain
a total of 67 files changed, 741 insertions(+), 252 deletions(-).
The main changes are:
1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.
2) bpftool can probe CONFIG_HZ, from Daniel.
3) CAP_BPF is introduced to isolate user processes that use BPF infra and
to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
in certain cases, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow DSA to add VLAN entries even if VLAN filtering is disabled, so
enabling it will not block the traffic of existent ports in the bridge
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Buslov says:
====================
Implement classifier-action terse dump mode
Output rate of current upstream kernel TC filter dump implementation if
relatively low (~100k rules/sec depending on configuration). This
constraint impacts performance of software switch implementation that
rely on TC for their datapath implementation and periodically call TC
filter dump to update rules stats. Moreover, TC filter dump output a lot
of static data that don't change during the filter lifecycle (filter
key, specific action details, etc.) which constitutes significant
portion of payload on resulting netlink packets and increases amount of
syscalls necessary to dump all filters on particular Qdisc. In order to
significantly improve filter dump rate this patch sets implement new
mode of TC filter dump operation named "terse dump" mode. In this mode
only parameters necessary to identify the filter (handle, action cookie,
etc.) and data that can change during filter lifecycle (filter flags,
action stats, etc.) are preserved in dump output while everything else
is omitted.
Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
individual classifier support (new tcf_proto_ops->terse_dump()
callback). Support for action terse dump is implemented in act API and
don't require changing individual action implementations.
The following table provides performance comparison between regular
filter dump and new terse dump mode for two classifier-action profiles:
one minimal config with L2 flower classifier and single gact action and
another heavier config with L2+5tuple flower classifier with
tunnel_key+mirred actions.
Classifier-action type | dump | terse dump | X improvement
| (rules/sec) | (rules/sec) |
-----------------------------+-------------+-------------+---------------
L2 with gact | 141.8 | 293.2 | 2.07
L2+5tuple tunnel_key+mirred | 76.4 | 198.8 | 2.60
Benchmark details: to measure the rate tc filter dump and terse dump
commands are invoked on ingress Qdisc that have one million filters
configured using following commands.
> time sudo tc -s filter show dev ens1f0 ingress >/dev/null
> time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null
Value in results table is calculated by dividing 1000000 total rules by
"real" time reported by time command.
Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory
====================
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
"$err" is a variable pointing to a temp file. "$out" is not: only used
as a local variable in "check()" and representing the output of a
command line.
Fixes: eedbc68532 (selftests: add PM netlink functional tests)
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Depending on the WRIOP version, the buffer size on the RX path must by a
multiple of 64 or 256. Handle this restriction properly by aligning down
the buffer size to the necessary value. Also, use the new buffer size
dynamically computed instead of the compile time one.
Fixes: 27c874867c ("dpaa2-eth: Use a single page per Rx buffer")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement two basic tests to verify terse dump functionality of flower
classifier:
- Test that verifies that terse dump works.
- Test that verifies that terse dump doesn't print filter key.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement tcf_proto_ops->terse_dump() callback for flower classifier. Only
dump handle, flags and action data in terse mode.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend tcf_action_dump() with boolean argument 'terse' that is used to
request terse-mode action dump. In terse mode only essential data needed to
identify particular action (action kind, cookie, etc.) and its stats is put
to resulting skb and everything else is omitted. Implement
tcf_exts_terse_dump() helper in cls API that is intended to be used to
request terse dump of all exts (actions) attached to the filter.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
is intended to be used to improve performance of TC filter dump when
userland only needs to obtain stats and not the whole classifier/action
data. Extend struct tcf_proto_ops with new terse_dump() callback that must
be defined by supporting classifier implementations.
Support of the options in specific classifiers and actions is
implemented in following patches in the series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The assumption that a device node is associated either with the
netdev's device, or the parent of that device, does not hold for all
drivers. E.g. Freescale's DPAA has two layers of platform devices
above the netdev. Instead, recursively walk up the tree from the
netdev, allowing any parent to match against the sought after node.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix ADC access synchronization problem with da9052 driver
Fix temperature limit and status reporting in nct7904 driver
Fix drivetemp temperature reporting if SCT is supported
but SCT data tables are not.
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEiHPvMQj9QTOCiqgVyx8mb86fmYEFAl6+s2EACgkQyx8mb86f
mYEJKg/yAmslfyCTM0d6p+O4Vzl7r1PWNjI1d1BxBXJfDxMorSeZqKIlcb5VlvNa
aNXq61NfQdTv2HyMTiKzSuCCu6OuMUmwPnVwO92U/tfEsxX7UNKtcU+XM7CfyPyf
USVadR8Uq9JawJRs0Sq5wNRcPxaTC7DIM9MyKLRdirBl/HWgyLhGH2UL+9we2xy8
fukj060uFxm55QbIQsLhMN1oJ4u2gwvZuyGZdiZ0QRIHrDlacZKrVFRrXmm27qUr
ZGN3gHKVFRkQU9NQdflait/EJdb/AeVJIjZGvP195SFg1DaRQL7q5Xh6BEipadRt
vSZAJP4G8YqgYGsqRlIMiyzlSZjvmioCRrUtbcK1ATCsHtz5KhbtUnACmzDiEA0c
86xEkQ8ucIb+VO2qMLZsjJSFu5SQeQHzIh5scxQSbQBXYmGY9cK0EkvvALlqItM6
/afBISGoAcsCxyZQ9u4bLhzHS/A6pmRTE2tWcRbFfKMU1bvcxYESdF19P5UFSiix
gUEdbyMh8lF3jAB/zqcMR+J/Jlm+UGM+gFlBuwdOiu/kdrHT/F36UVcKclVjmkKL
iKkD37fgm4X+vUo/fPCCU6ISSzsxWBBlb7IetVNsDEGdmHwuivrJN1kNX3ra2pUE
4aovpn0tcLSjdMWNhqTe3c3+rYIC/DL6ZRIiNpTsO1xfednhUw==
=YCH3
-----END PGP SIGNATURE-----
Merge tag 'hwmon-for-v5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Fix ADC access synchronization problem with da9052 driver
- Fix temperature limit and status reporting in nct7904 driver
- Fix drivetemp temperature reporting if SCT is supported but SCT data
tables are not.
* tag 'hwmon-for-v5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (da9052) Synchronize access with mfd
hwmon: (nct7904) Fix incorrect range of temperature limit registers
hwmon: (nct7904) Read all SMI status registers in probe function
hwmon: (drivetemp) Fix SCT support if SCT data tables are not supported
The things look good and calming down; the only change to ALSA core
is the fix for racy rawmidi buffer accesses spotted by syzkaller,
and the rest are all small device-specific quirks for HD-audio and
USB-audio devices.
-----BEGIN PGP SIGNATURE-----
iQJCBAABCAAsFiEEIXTw5fNLNI7mMiVaLtJE4w1nLE8FAl6+NX0OHHRpd2FpQHN1
c2UuZGUACgkQLtJE4w1nLE9P9g//ZE9BbJqlDTyVKToZC1knpIsLe2nBknhsUsZC
AfYYlThPELB+P5YoCN0YfrTRecSVsCfhckEN7LDXjL30G/8R6z7TUe5as/+3PX1I
31G31XAm8mQKURttkjjuc+ImzjG1aWSDJHUqH9S9NL4yskZ6ckuD62tOTuMZhofV
2rjNL8JSwv//OJ0ThQzBvH42i81z1Pji3opqCm6ONEHhJzcv1EfZ20Mx1OcTE2E9
32lDk3xaekd8Vq+ulYqvV9VmeMEd9ffZn/v01s89vSuSI8tT253e3wf84S7dR4U9
1bAU13h74CdCGHcxiRiLRaao/s5txljyEmTaTsdWcUqUvsZ54D7lFdMCkcFDbJJZ
cP9nahKCUK45n2ynUq/NrgNLDJXlyE2QKnGUWABPcP4im/oCllKZFmtGVjb4IWun
tVsHs1jLBg9mF43XeNZXW4fG7RMnfD3g/r2c6P9pfQg/3nJ7pywqkN0PA+vuw/OC
Ff5I1HQq5IqyDi30hQJpbKg6YfgZ61zNZTX7lndYXvQveZIveSkcWp6eKuO6iSAn
u3CdhYEponJ23QJo6XxT4Nh9P0jgsEmweeoCORV14RIMmufDUI4f3QGP1G9C3KPT
h1+T8hNLuRfE0CcW9ddHYLsn8K7yKJZtC0zsRKivssjK7rlgMYpGyS5NyukidS3h
xrPvnQw=
=HlsV
-----END PGP SIGNATURE-----
Merge tag 'sound-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Things look good and calming down; the only change to ALSA core is the
fix for racy rawmidi buffer accesses spotted by syzkaller, and the
rest are all small device-specific quirks for HD-audio and USB-audio
devices"
* tag 'sound-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Limit int mic boost for Thinkpad T530
ALSA: hda/realtek - Add COEF workaround for ASUS ZenBook UX431DA
ALSA: hda/realtek: Enable headset mic of ASUS UX581LV with ALC295
ALSA: hda/realtek - Enable headset mic of ASUS UX550GE with ALC295
ALSA: hda/realtek - Enable headset mic of ASUS GL503VM with ALC295
ALSA: hda/realtek: Add quirk for Samsung Notebook
ALSA: rawmidi: Fix racy buffer resize under concurrent accesses
ALSA: usb-audio: add mapping for ASRock TRX40 Creator
ALSA: hda/realtek - Fix S3 pop noise on Dell Wyse
Revert "ALSA: hda/realtek: Fix pop noise on ALC225"
ALSA: firewire-lib: fix 'function sizeof not defined' error of tracepoints format
ALSA: usb-audio: Add control message quirk delay for Kingston HyperX headset
i915 (two weeks):
- Handle idling during i915_gem_evict_something busy loops (Chris)
- Mark current submissions with a weak-dependency (Chris)
- Propagate error from completed fences (Chris)
- Fixes on execlist to avoid GPU hang situation (Chris)
- Fixes couple deadlocks (Chris)
- Timeslice preemption fixes (Chris)
- Fix Display Port interrupt handling on Tiger Lake (Imre)
- Reduce debug noise around Frame Buffer Compression (Peter)
- Fix logic around IPC W/a for Coffee Lake and Kaby Lake (Sultan)
- Avoid dereferencing a dead context (Chris)
tegra:
- tegra120/4 smmu fixes
amdgpu:
- Clockgating fixes
- Fix fbdev with scatter/gather display
- S4 fix for navi
- Soft recovery for gfx10
- Freesync fixes
- Atomic check cursor fix
- Add a gfxoff quirk
- MST fix
amdkfd:
- Fix GEM reference counting
meson:
- error code propogation fix
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJevjILAAoJEAx081l5xIa+eGMQAJ13DoMXrJUdaH8XCsh77kLn
cQzV/pHFJP8yjHaU1BFebbzcOMTIrNFXn2kncPgCzaPwh8mBoWeYxjMT3eWeJNvy
CDEbBvPTL/j13Yq0H9jzp3LxAMI/atXyL/ujUI3AAEtDoMaMbJ5ai//4rHxoXeQ5
UDXNk9VarbTGKXovJK2/Wq0xNMrMspLQi85+uejF6FA6uztgfyCk+Me5WvuRk2vx
A5OFa3Gl+PhVqjwQ1pVwO8eii33YVosQ4TfnYMrDV1YnjB6Od8oOeUCUIio9lh2x
3hIlhAR1CELdx66U2BmGPd0ZXiE8CBXZ8Eh8JBoFIms3S88kIKypeZglnJnIV9cA
ELC4eXiXNViyHwhgq8+h6le3JXdiYJ+PvUVvKXtKw2N9w67pUP8q571jwA4PGnTv
9iAiXc55K9cBTbIiL4BE9zU/Ap7eMkMMFiQsQeOtXZb8PrhvlfVChDpfcESjh6uR
9qIg1JyRYiOLZv3UOiR1MJVvjOssX0YzUsO4riw+5W1uaMNCcBFG9d4YphrRJjBK
ReBu0YbI9ZwhMGldj8iXANdxqV7B2VELBquDg65ev6epFOw40skAqsIxAfAGRXO9
fI7OWX25TPHqbLWZNBoSyYfMbsbMwfwX+5j7Sg0cF2T/CYXlCd1rJgYye55ifE3b
wYC4wlYNKj9K0LQs8Rik
=b/ek
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-2020-05-15' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"As mentioned last week an i915 PR came in late, but I left it, so the
i915 bits of this cover 2 weeks, which is why it's likely a bit larger
than usual.
Otherwise it's mostly amdgpu fixes, one tegra fix, one meson fix.
i915:
- Handle idling during i915_gem_evict_something busy loops (Chris)
- Mark current submissions with a weak-dependency (Chris)
- Propagate error from completed fences (Chris)
- Fixes on execlist to avoid GPU hang situation (Chris)
- Fixes couple deadlocks (Chris)
- Timeslice preemption fixes (Chris)
- Fix Display Port interrupt handling on Tiger Lake (Imre)
- Reduce debug noise around Frame Buffer Compression (Peter)
- Fix logic around IPC W/a for Coffee Lake and Kaby Lake (Sultan)
- Avoid dereferencing a dead context (Chris)
tegra:
- tegra120/4 smmu fixes
amdgpu:
- Clockgating fixes
- Fix fbdev with scatter/gather display
- S4 fix for navi
- Soft recovery for gfx10
- Freesync fixes
- Atomic check cursor fix
- Add a gfxoff quirk
- MST fix
amdkfd:
- Fix GEM reference counting
meson:
- error code propogation fix"
* tag 'drm-fixes-2020-05-15' of git://anongit.freedesktop.org/drm/drm: (29 commits)
drm/i915: Handle idling during i915_gem_evict_something busy loops
drm/meson: pm resume add return errno branch
drm/amd/amdgpu: Update update_config() logic
drm/amd/amdgpu: add raven1 part to the gfxoff quirk list
drm/i915: Mark concurrent submissions with a weak-dependency
drm/i915: Propagate error from completed fences
drm/i915/gvt: Fix kernel oops for 3-level ppgtt guest
drm/i915/gvt: Init DPLL/DDI vreg for virtual display instead of inheritance.
drm/amd/display: add basic atomic check for cursor plane
drm/amd/display: Fix vblank and pageflip event handling for FreeSync
drm/amdgpu: implement soft_recovery for gfx10
drm/amdgpu: enable hibernate support on Navi1X
drm/amdgpu: Use GEM obj reference for KFD BOs
drm/amdgpu: force fbdev into vram
drm/amd/powerplay: perform PG ungate prior to CG ungate
drm/amdgpu: drop unnecessary cancel_delayed_work_sync on PG ungate
drm/amdgpu: disable MGCG/MGLS also on gfx CG ungate
drm/i915/execlists: Track inflight CCID
drm/i915/execlists: Avoid reusing the same logical CCID
drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane
...
Alexei Starovoitov says:
====================
v6->v7:
- permit SK_REUSEPORT program type under CAP_BPF as suggested by Marek Majkowski.
It's equivalent to SOCKET_FILTER which is unpriv.
v5->v6:
- split allow_ptr_leaks into four flags.
- retain bpf_jit_limit under cap_sys_admin.
- fixed few other issues spotted by Daniel.
v4->v5:
Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of
CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN.
The user process has to have
- CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs.
Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN.
That could be relaxed in the future.
- CAP_BPF and CAP_PERFMON to load tracing programs.
- CAP_BPF and CAP_NET_ADMIN to load networking programs.
(or CAP_SYS_ADMIN for backward compatibility).
CAP_BPF solves three main goals:
1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
More on this below. This is the major difference vs v4 set back from Sep 2019.
2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
prevents pointer leaks and arbitrary kernel memory access.
3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
and making BPF infra more secure. Currently fuzzers run in unpriv.
They will be able to run with CAP_BPF.
The patchset is long overdue follow-up from the last plumbers conference.
Comparing to what was discussed at LPC the CAP* checks at attach time are gone.
For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was
no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was
required at load time and CAP_NET_ADMIN at attach time, but there are several
ways to bypass CAP_NET_ADMIN:
- if networking prog is using tail_call writing FD into prog_array will
effectively attach it, but bpf_map_update_elem is an unprivileged operation.
- freplace prog with CAP_SYS_ADMIN can replace networking prog
Consolidating all CAP checks at load time makes security model similar to
open() syscall. Once the user got an FD it can do everything with it.
read/write/poll don't check permissions. The same way when bpf_prog_load
command returns an FD the user can do everything (including attaching,
detaching, and bpf_test_run).
The important design decision is to allow ID->FD transition for
CAP_SYS_ADMIN only. What it means that user processes can run
with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each
other unless they pass FDs via scm_rights or via pinning in bpffs.
ID->FD is a mechanism for human override and introspection.
An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that
only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user
space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps,
links) that are owned by such processes from each other.
Another significant change from LPC is that the verifier checks are split into
four flags. The allow_ptr_leaks flag allows pointer manipulations. The
bpf_capable flag enables all modern verifier features like bpf-to-bpf calls,
BTF, bounded loops, dead code elimination, etc. All the goodness. The
bypass_spec_v1 flag enables indirect stack access from bpf programs and
disables speculative analysis and bpf array mitigations. The bypass_spec_v4
flag disables store sanitation. That allows networking progs with CAP_BPF +
CAP_NET_ADMIN enjoy modern verifier features while being more secure.
Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON,
since subtracting pointers (like skb->data_end - skb->data) is a pointer leak,
but the verifier may get smarter in the future.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
env->allow_ptr_leaks = bpf_allow_ptr_leaks();
env->bypass_spec_v1 = bpf_bypass_spec_v1();
env->bypass_spec_v4 = bpf_bypass_spec_v4();
env->bpf_capable = bpf_capable();
The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.
'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.
That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.
The end result provides simple safety model for applications that use BPF:
- to load tracing program types
BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
use CAP_BPF and CAP_PERFMON
- to load networking program types
BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
use CAP_BPF and CAP_NET_ADMIN
There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
tracing mechanism, hence this helper needs additional CAP_PERFMON
if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.
CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
hence CAP_NET_ADMIN is required at detach time.
CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.
Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.
CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.
bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.
CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.
Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.
Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com
In Cilium we've recently switched to make use of bpf_jiffies64() for
parts of our tc and XDP datapath since bpf_ktime_get_ns() is more
expensive and high-precision is not needed for our timeouts we have
anyway. Our agent has a probe manager which picks up the json of
bpftool's feature probe and we also use the macro output in our C
programs e.g. to have workarounds when helpers are not available on
older kernels.
Extend the kernel config info dump to also include the kernel's
CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a
macro dump such that CONFIG_HZ can be propagated to BPF C code as a
simple define if available via config. Latter allows to have _compile-
time_ resolution of jiffies <-> sec conversion in our code since all
are propagated as known constants.
Given we cannot generally assume availability of kconfig everywhere,
we also have a kernel hz probe [0] as a fallback. Potentially, bpftool
could have an integrated probe fallback as well, although to derive it,
we might need to place it under 'bpftool feature probe full' or similar
given it would slow down the probing process overall. Yet 'full' doesn't
fit either for us since we don't want to pollute the kernel log with
warning messages from bpf_probe_write_user() and bpf_trace_printk() on
agent startup; I've left it out for the time being.
[0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200513075849.20868-1-daniel@iogearbox.net
Daniel Borkmann says:
====================
Small set of fixes in order to restrict BPF helpers for tracing which are
broken on archs with overlapping address ranges as per discussion in [0].
I've targetted this for -bpf tree so they can be routed as fixes. Thanks!
v1 -> v2:
- switch to reusable %pks, %pus format specifiers (Yonghong)
- fixate %s on kernel_ds probing for archs with overlapping addr space
[0] https://lore.kernel.org/bpf/CAHk-=wjJKo0GVixYLmqPn-Q22WFu0xHaBSjKEo7e7Yw72y5SPQ@mail.gmail.com/T/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the
very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on
archs with overlapping address ranges.
While the helpers have been addressed through work in 6ae08ae3de ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need
an option for bpf_trace_printk() as well to fix it.
Similarly as with the helpers, force users to make an explicit choice by adding
%pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding
strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS.
The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended
for other objects aside strings that are probed and printed under tracing, and
reused out of other facilities like bpf_seq_printf() or BTF based type printing.
Existing behavior of %s for current users is still kept working for archs where it
is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
For archs not having this property we fall-back to pick probing under KERNEL_DS as
a sensible default.
Fixes: 8d3b7dce86 ("bpf: add support for %s specifier to bpf_trace_printk()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net
Given bpf_probe_read{,str}() BPF helpers are now only available under
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in
replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range()
as well to avoid hitting the same issue as in 849fa50662 ("bpf/verifier:
refine retval R0 state for bpf_get_stack helper").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.
To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3de ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").
Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.
However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).
For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCXrz0qQAKCRDj7w1vZxhR
xSImAQDRwP5tNB3ZZmnJ2ABF2uqT+YTzS5oPW//bxCgnq128FQEA4UxFvgVOXAGI
HUsEGriyhoYgPMApmKHsFTTCvphlcQY=
=Igwt
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-fixes-2020-05-14' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Just one meson patch this time to propagate an error code
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20200514073538.wvdtv5s2mt4wdrdj@gilmour.lan
Jesper Dangaard Brouer says:
====================
V4:
- Fixup checkpatch.pl issues
- Collected more ACKs
V3:
- Fix issue on virtio_net patch spotted by Jason Wang
- Adjust name for variable in mlx5 patch
- Collected more ACKs
V2:
- Fix bug in mlx5 for XDP_PASS case
- Collected nitpicks and ACKs from mailing list
V1:
- Fix bug in dpaa2
XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. This have caused the side-effect that
XDP frame data hard end is unknown. This have limited the BPF-helper
bpf_xdp_adjust_tail to only shrink the packet. This patchset address
this and add packet tail extend/grow.
The purpose of the patchset is ALSO to reserve a memory area that can be
used for storing extra information, specifically for extending XDP with
multi-buffer support. One proposal is to use same layout as
skb_shared_info, which is why this area is currently 320 bytes.
When converting xdp_frame to SKB (veth and cpumap), the full tailroom
area can now be used and SKB truesize is now correct. For most
drivers this result in a much larger tailroom in SKB "head" data
area. The network stack can now take advantage of this when doing SKB
coalescing. Thus, a good driver test is to use xdp_redirect_cpu from
samples/bpf/ and do some TCP stream testing.
Use-cases for tail grow/extend:
(1) IPsec / XFRM needs a tail extend[1][2].
(2) DNS-cache responses in XDP.
(3) HAProxy ALOHA would need it to convert to XDP.
(4) Add tail info e.g. timestamp and collect via tcpdump
[1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
[2] http://vger.kernel.org/netconf2019.html
Examples on howto access the tail area of an XDP packet is shown in the
XDP-tutorial example[3].
[3] https://github.com/xdp-project/xdp-tutorial/blob/master/experiment01-tailgrow/
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Extend BPF selftest xdp_adjust_tail with grow tail tests, which is added
as subtest's. The first grow test stays in same form as original shrink
test. The second grow test use the newer bpf_prog_test_run_xattr() calls,
and does extra checking of data contents.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945350567.97035.9632611946765811876.stgit@firesoul
Update the memory requirements, when adding xdp.frame_sz in BPF test_run
function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests.
Specifically add the expected reserved tailroom, but also allocated a
larger memory area to reflect that XDP frames usually comes in this
format. Limit the provided packet data size to 4096 minus headroom +
tailroom, as this also reflect a common 3520 bytes MTU limit with XDP.
Note that bpf_test_init already use a memory allocation method that clears
memory. Thus, this already guards against leaking uninit kernel memory.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945349549.97035.15316291762482444006.stgit@firesoul
Clearing memory of tail when grow happens, because it is too easy
to write a XDP_PASS program that extend the tail, which expose
this memory to users that can run tcpdump.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945349039.97035.5262100484553494.stgit@firesoul
Finally, after all drivers have a frame size, allow BPF-helper
bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.
Remember that helper/macro xdp_data_hard_end have reserved some
tailroom. Thus, this helper makes sure that the BPF-prog don't have
access to this tailroom area.
V2: Remove one chicken check and use WARN_ONCE for other
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul
The mlx5 driver have multiple memory models, which are also changed
according to whether a XDP bpf_prog is attached.
The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.:
# ethtool --set-priv-flags mlx5p2 rx_striding_rq off
On the general case with 4K page_size and regular MTU packet, then
the frame_sz is 2048 and 4096 when XDP is enabled, in both modes.
The info on the given frame size is stored differently depending on the
RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe.
In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which
corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ).
In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored
in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is
what the XDP case cares about.
To reduce effect on fast-path, this patch determine the frame_sz at
setup time, to avoid determining the memory model runtime. Variable
is named frame0_sz to make it clear that this is only the frame
size of the first fragment.
This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe
as it have done a DMA-map on the entire PAGE_SIZE. The driver also
already does a XDP length check against sq->hw_mtu on the possible
XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe().
V3+4: Change variable name first_frame_sz to frame0_sz
V2: Fix that frag_size need to be recalc before creating SKB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Link: https://lore.kernel.org/bpf/158945348021.97035.12295039384250022883.stgit@firesoul
Intel drivers implement native AF_XDP zerocopy in separate C-files,
that have its own invocation of bpf_prog_run_xdp(). The setup of
xdp_buff is also handled in separately from normal code path.
This patch update XDP frame_sz for AF_XDP zerocopy drivers i40e, ice
and ixgbe, as the code changes needed are very similar. Introduce a
helper function xsk_umem_xdp_frame_sz() for calculating frame size.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/158945347511.97035.8536753731329475655.stgit@firesoul
This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".
For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.
The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).
There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945347002.97035.328088795813704587.stgit@firesoul
This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".
For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.
The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).
There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul
This patch mirrors the changes to ixgbe in previous patch.
This VF driver doesn't support XDP_REDIRECT, but correct tailroom is
still necessary for BPF-helper xdp_adjust_tail. In legacy-mode +
larger PAGE_SIZE, due to lacking tailroom, we accept that
xdp_adjust_tail shrink doesn't work.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945345984.97035.13518286183248025173.stgit@firesoul
This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".
For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.
The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).
There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945345455.97035.14334355929030628741.stgit@firesoul
The ixgbe driver have another memory model when compiled on archs with
PAGE_SIZE above 4096 bytes. In this mode it doesn't split the page in
two halves, but instead increment rx_buffer->page_offset by truesize of
packet (which include headroom and tailroom for skb_shared_info).
This is done correctly in ixgbe_build_skb(), but in ixgbe_rx_buffer_flip
which is currently only called on XDP_TX and XDP_REDIRECT, it forgets
to add the tailroom for skb_shared_info. This breaks XDP_REDIRECT, for
veth and cpumap. Fix by adding size of skb_shared_info tailroom.
Maintainers notice: This fix have been queued to Jeff.
Fixes: 6453073987 ("ixgbe: add initial support for xdp redirect")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Link: https://lore.kernel.org/bpf/158945344946.97035.17031588499266605743.stgit@firesoul
The virtio_net driver is running inside the guest-OS. There are two
XDP receive code-paths in virtio_net, namely receive_small() and
receive_mergeable(). The receive_big() function does not support XDP.
In receive_small() the frame size is available in buflen. The buffer
backing these frames are allocated in add_recvbuf_small() with same
size, except for the headroom, but tailroom have reserved room for
skb_shared_info. The headroom is encoded in ctx pointer as a value.
In receive_mergeable() the frame size is more dynamic. There are two
basic cases: (1) buffer size is based on a exponentially weighted
moving average (see DECLARE_EWMA) of packet length. Or (2) in case
virtnet_get_headroom() have any headroom then buffer size is
PAGE_SIZE. The ctx pointer is this time used for encoding two values;
the buffer len "truesize" and headroom. In case (1) if the rx buffer
size is underestimated, the packet will have been split over more
buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
buffer area). If that happens the XDP path does a xdp_linearize_page
operation.
V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.
The code is really hard to follow, so some hints to reviewers.
The receive_mergeable() case gets frames that were allocated in
add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
and 'buf' ptr is advanced this headroom. The headroom can only
be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
simple:
static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
{
return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
}
As frame_sz is an offset size from xdp.data_hard_start, reviewers
should notice how this is calculated in receive_mergeable():
int offset = buf - page_address(page);
[...]
data = page_address(xdp_page) + offset;
xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
The calculated offset will always be VIRTIO_XDP_HEADROOM when
reaching this code. Thus, xdp.data_hard_start will be page-start
address plus vi->hdr_len. Given this xdp.frame_sz need to be
reduced with vi->hdr_len size.
IMHO a followup patch should cleanup this code to make it easier
to maintain and understand, but it is outside the scope of this
patchset.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul
In vhost_net_build_xdp() the 'buf' that gets queued via an xdp_buff
have embedded a struct tun_xdp_hdr (located at xdp->data_hard_start)
which contains the buffer length 'buflen' (with tailroom for
skb_shared_info). Also storing this buflen in xdp->frame_sz, does not
obsolete struct tun_xdp_hdr, as it also contains a struct
virtio_net_hdr with other information.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945343928.97035.4620233649151726289.stgit@firesoul
The tun driver have two code paths for running XDP (bpf_prog_run_xdp).
In both cases 'buflen' contains enough tailroom for skb_shared_info.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945343419.97035.9594485183958037621.stgit@firesoul
The netronome nfp driver use PAGE_SIZE when xdp_prog is set, but
xdp.data_hard_start begins at offset NFP_NET_RX_BUF_HEADROOM.
Thus, adjust for this when setting xdp.frame_sz, as it counts
from data_hard_start.
When doing XDP_TX this driver is smart and instead of a full DMA-map
does a DMA-sync on with packet length. As xdp_adjust_tail can now
grow packet length, add checks to make sure that grow size is within
the DMA-mapped size.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/158945342911.97035.11214251236208648808.stgit@firesoul
To help reviewers these are the defines related to RCV_FRAG_LEN
#define DMA_BUFFER_LEN 1536 /* In multiples of 128bytes */
#define RCV_FRAG_LEN (SKB_DATA_ALIGN(DMA_BUFFER_LEN + NET_SKB_PAD) + \
SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Sunil Goutham <sgoutham@marvell.com>
Cc: Robert Richter <rrichter@marvell.com>
Link: https://lore.kernel.org/bpf/158945342402.97035.12649844447148990032.stgit@firesoul
The mlx4 drivers size of memory backing the RX packet is stored in
frag_stride. For XDP mode this will be PAGE_SIZE (normally 4096).
For normal mode frag_stride is 2048.
Also adjust MLX4_EN_MAX_XDP_MTU to take tailroom into account.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Link: https://lore.kernel.org/bpf/158945341893.97035.2688142527052329942.stgit@firesoul
Frame size ENA_PAGE_SIZE is limited to 16K on systems with larger
PAGE_SIZE than 16K. Change ENA_XDP_MAX_MTU to also take into account
the reserved tailroom.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Sameeh Jubran <sameehj@amazon.com>
Cc: Arthur Kiyanovski <akiyano@amazon.com>
Link: https://lore.kernel.org/bpf/158945341384.97035.907403694833419456.stgit@firesoul
The driver code cpsw.c and cpsw_new.c both use page_pool
with default order-0 pages or their RX-pages.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/bpf/158945340875.97035.752144756428532878.stgit@firesoul
The hyperv NIC driver does memory allocation and copy even without XDP.
In XDP mode it will allocate a new page for each packet and copy over
the payload, before invoking the XDP BPF-prog.
The positive thing it that its easy to determine the xdp.frame_sz.
The XDP implementation for hv_netvsc transparently passes xdp_prog
to the associated VF NIC. Many of the Azure VMs are using SRIOV, so
majority of the data are actually processed directly on the VF driver's XDP
path. So the overhead of the synthetic data path (hv_netvsc) is minimal.
Then XDP is enabled on this driver, XDP_PASS and XDP_TX will create the
SKB via build_skb (based on the newly allocated page). Now using XDP
frame_sz this will provide more skb_tailroom, which netstack can use for
SKB coalescing (e.g tcp_try_coalesce -> skb_try_coalesce).
V3: Adjust patch desc to be more positive.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Link: https://lore.kernel.org/bpf/158945339857.97035.10212138582505736163.stgit@firesoul
The dpaa2-eth driver reserve some headroom used for hardware and
software annotation area in RX/TX buffers. Thus, xdp.data_hard_start
doesn't start at page boundary.
When XDP is configured the area reserved via dpaa2_fd_get_offset(fd) is
448 bytes of which XDP have reserved 256 bytes. As frame_sz is
calculated as an offset from xdp_buff.data_hard_start, an adjust from
the full PAGE_SIZE == DPAA2_ETH_RX_BUF_RAW_SIZE.
When doing XDP_REDIRECT, the driver doesn't need this reserved headroom
any-longer and allows xdp_do_redirect() to use it. This is an advantage
for the drivers own ndo-xdp_xmit, as it uses part of this headroom for
itself. Patch also adjust frame_sz in this case.
The driver cannot support XDP data_meta, because it uses the headroom
just before xdp.data for struct dpaa2_eth_swa (DPAA2_ETH_SWA_SIZE=64),
when transmitting the packet. When transmitting a xdp_frame in
dpaa2_eth_xdp_xmit_frame (call via ndo_xdp_xmit) is uses this area to
store a pointer to xdp_frame and dma_size, which is used in TX
completion (free_tx_fd) to return frame via xdp_return_frame().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Link: https://lore.kernel.org/bpf/158945339348.97035.8562488847066908856.stgit@firesoul
The veth driver can run XDP in "native" mode in it's own NAPI
handler, and since commit 9fc8d518d9 ("veth: Handle xdp_frames in
xdp napi ring") packets can come in two forms either xdp_frame or
skb, calling respectively veth_xdp_rcv_one() or veth_xdp_rcv_skb().
For packets to arrive in xdp_frame format, they will have been
redirected from an XDP native driver. In case of XDP_PASS or no
XDP-prog attached, the veth driver will allocate and create an SKB.
The current code in veth_xdp_rcv_one() xdp_frame case, had to guess
the frame truesize of the incoming xdp_frame, when using
veth_build_skb(). With xdp_frame->frame_sz this is not longer
necessary.
Calculating the frame_sz in veth_xdp_rcv_skb() skb case, is done
similar to the XDP-generic handling code in net/core/dev.c.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://lore.kernel.org/bpf/158945338840.97035.935897116345700902.stgit@firesoul
When native XDP redirect into a veth device, the frame arrives in the
xdp_frame structure. It is then processed in veth_xdp_rcv_one(),
which can run a new XDP bpf_prog on the packet. Doing so requires
converting xdp_frame to xdp_buff, but the tricky part is that
xdp_frame memory area is located in the top (data_hard_start) memory
area that xdp_buff will point into.
The current code tried to protect the xdp_frame area, by assigning
xdp_buff.data_hard_start past this memory. This results in 32 bytes
less headroom to expand into via BPF-helper bpf_xdp_adjust_head().
This protect step is actually not needed, because BPF-helper
bpf_xdp_adjust_head() already reserve this area, and don't allow
BPF-prog to expand into it. Thus, it is safe to point data_hard_start
directly at xdp_frame memory area.
Fixes: 9fc8d518d9 ("veth: Handle xdp_frames in xdp napi ring")
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945338331.97035.5923525383710752178.stgit@firesoul
Knowing the memory size backing the packet/xdp_frame data area, and
knowing it already have reserved room for skb_shared_info, simplifies
using build_skb significantly.
With this change we no-longer lie about the SKB truesize, but more
importantly a significant larger skb_tailroom is now provided, e.g. when
drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be
used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce,
see TCP cases where tcp_queue_rcv() can 'eat' skb).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337822.97035.13557959180460986059.stgit@firesoul