Daniel Borkmann says:
====================
pull-request: bpf-next 2020-09-01
The following pull-request contains BPF updates for your *net-next* tree.
There are two small conflicts when pulling, resolve as follows:
1) Merge conflict in tools/lib/bpf/libbpf.c between 88a8212028 ("libbpf: Factor
out common ELF operations and improve logging") in bpf-next and 1e891e513e
("libbpf: Fix map index used in error message") in net-next. Resolve by taking
the hunk in bpf-next:
[...]
scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
data = elf_sec_data(obj, scn);
if (!scn || !data) {
pr_warn("elf: failed to get %s map definitions for %s\n",
MAPS_ELF_SEC, obj->path);
return -EINVAL;
}
[...]
2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
9647c57b11 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
better performance") in bpf-next and e20f0dbf20 ("net/mlx5e: RX, Add a prefetch
command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:
[...]
xdp_set_data_meta_invalid(xdp);
xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
net_prefetch(xdp->data);
[...]
We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).
The main changes are:
1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
for tracing to reliably access user memory, from Alexei Starovoitov.
2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.
3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.
4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.
5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.
6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.
7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.
8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.
9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.
10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.
11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.
12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.
13) Various smaller cleanups and improvements all over the place.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unneeded return value cast.
This is detected by coccinelle.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove dma_alloc_coherent return value cast.
This is detected by coccinelle.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the getsockopt SOL_TLS TLS_RX which is currently missing. The
primary usecase is to use it in conjunction with TCP_REPAIR to
checkpoint/restore the TLS record layer state.
TLS connection state usually exists on the user space library. So
basically we can easily extract it from there, but when the TLS
connections are delegated to the kTLS, it is not the case. We need to
have a way to extract the TLS state from the kernel for both of TX and
RX side.
The new TLS_RX getsockopt copies the crypto_info to user in the same
way as TLS_TX does.
We have described use cases in our research work in Netdev 0x14
Transport Workshop [1].
Also, there is an TLS implementation called tlse [2] which supports
TLS connection migration. They have support of kTLS and their code
shows that they are expecting the future support of this option.
[1] https://speakerdeck.com/yutarohayakawa/prism-proxies-without-the-pain
[2] https://github.com/eduardsui/tlse
Signed-off-by: Yutaro Hayakawa <yhayakawa3720@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tonghao Zhang says:
====================
net: openvswitch: improve the codes
This series patches are not bug fix, just improve codes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
keep_flows was introduced by [1], which used as flag to delete flows or not.
When rehashing or expanding the table instance, we will not flush the flows.
Now don't use it anymore, remove it.
[1] - acd051f176
Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Decrease table->count and ufid_count unconditionally,
because we only don't use count or ufid_count to count
when flushing the flows. To simplify the codes, we
remove the "count" argument of table_instance_flow_free.
To avoid a bug when deleting flows in the future, add
WARN_ON in flush flows function.
Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Not change the logic, just improve the coding style.
Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions bq_enqueue(), bq_flush_to_queue(), and bq_xmit_all() in
{cpu,dev}map.c always return zero. Changing the return type from int
to void makes the code easier to follow.
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200901083928.6199-1-bjorn.topel@gmail.com
Technically the bpf programs can sleep while attached to bpf_lsm_file_mprotect,
but such programs need to access user memory. So they're in might_fault()
category. Which means they cannot be called from file_mprotect lsm hook that
takes write lock on mm->mmap_lock.
Adjust the test accordingly.
Also add might_fault() to __bpf_prog_enter_sleepable() to catch such deadlocks early.
Fixes: 1e6c62a882 ("bpf: Introduce sleepable BPF programs")
Fixes: e68a144547 ("selftests/bpf: Add sleepable tests")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200831201651.82447-1-alexei.starovoitov@gmail.com
The txpush program in the xdpsock sample application is supposed
to send out all packets in the umem in a round-robin fashion.
The problem is that it only cycled through the first BATCH_SIZE
worth of packets. Fixed this so that it cycles through all buffers
in the umem as intended.
Fixes: 248c7f9c0e ("samples/bpf: convert xdpsock to use libbpf for AF_XDP access")
Signed-off-by: Weqaar Janjua <weqaar.a.janjua@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20200828161717.42705-1-weqaar.a.janjua@intel.com
Optimize the throughput performance of the l2fwd sub-app in the
xdpsock sample application by removing a duplicate syscall and
increasing the size of the fill ring.
The latter needs some further explanation. We recommend that you set
the fill ring size >= HW RX ring size + AF_XDP RX ring size. Make sure
you fill up the fill ring with buffers at regular intervals, and you
will with this setting avoid allocation failures in the driver. These
are usually quite expensive since drivers have not been written to
assume that allocation failures are common. For regular sockets,
kernel allocated memory is used that only runs out in OOM situations
that should be rare.
These two performance optimizations together lead to a 6% percent
improvement for the l2fwd app on my machine.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1598619065-1944-1-git-send-email-magnus.karlsson@intel.com
The arg exact_dif is not used anymore, remove it. inet_exact_dif_match()
is no longer needed after the above is removed, so remove it too.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The arg exact_dif is not used anymore, remove it. inet6_exact_dif_match()
is no longer needed after the above is removed, remove it too.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Ciornei says:
====================
net: phy: add Lynx PCS MDIO module
Add support for the Lynx PCS as a separate module in drivers/net/phy/.
The advantage of this structure is that multiple ethernet or switch
drivers used on NXP hardware (ENETC, Seville, Felix DSA switch etc) can
share the same implementation of PCS configuration and runtime
management.
The module implements phylink_pcs_ops and exports a phylink_pcs
(incorporated into a lynx_pcs) which can be directly passed to phylink
through phylink_pcs_set.
The first 3 patches add some missing pieces in phylink and the locked
mdiobus write accessor. Next, the Lynx PCS MDIO module is added as a
standalone module. The majority of the code is extracted from the Felix
DSA driver. The last patch makes the necessary changes in the Felix and
Seville drivers in order to use the new common PCS implementation.
At the moment, USXGMII (only with in-band AN), SGMII, QSGMII (with and
without in-band AN) and 2500Base-X (only w/o in-band AN) are supported
by the Lynx PCS MDIO module since these were also supported by Felix and
no functional change is intended at this time.
Changes in v2:
* got rid of the mdio_lynx_pcs structure and directly exported the
functions without the need of an indirection
* made the necessary adjustments for this in the Felix DSA driver
* solved the broken allmodconfig build test by making the module
tristate instead of bool
* fixed a memory leakage in the Felix driver (the pcs structure was
allocated twice)
Changes in v3:
* added support for PHYLINK PCS ops in DSA (patch 5/9)
* cleanup in Felix PHYLINK operations and migrate to
phylink_mac_link_up() being the callback of choice for applying MAC
configuration (patches 6-8)
Changes in v4:
* use the newly introduced phylink PCS mechanism
* install the phylink_pcs in the phylink_mac_config DSA ops
* remove the direct implementations of the PCS ops
* do no use the SGMII_ prefix when referring to the IF_MORE register
* add a phylink helper to decode the USXGMII code word
* remove cleanup patches for Felix (these have been already accepted)
* Seville (recently introduced) now has PCS support through the same
Lynx PCS module
Changes in v5:
- move the pcs-lynx driver to drivers/net/pcs
- reword the commit message a bit in 4/5
- add error checking and error propagation in 4/5
- s/IF_MODE_DUPLEX/IF_MODE_HALF_DUPLEX in 4/5
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the helper functions introduced by the newly added
Lynx PCS MDIO module in the Felix VSC9959 and Seville VSC9953.
Instead of representing the PCS as a phy_device, a mdio_device structure
will be passed to the Lynx module which is now actually implementing all
the PCS configuration and status reporting.
All code previously used for PCS monitoring and runtime configuration
is removed and replaced will calls to the Lynx PCS operations.
Tested on the following SERDES protocols of LS1028A: 0x7777
(2500Base-X), 0x85bb (QSGMII), 0x9999 (SGMII) and 0x13bb (USXGMII).
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a Lynx PCS module which exposes the necessary operations to drive
the PCS using phylink.
The majority of the code is extracted from the Felix DSA driver, which
will be also modified in a later patch, and exposed as a separate module
for code reusability purposes.
As such, this aims at feature and bug parity with the existing Felix DSA
driver, and thus USXGMII, SGMII, QSGMII and 2500Base-X (only w/o in-band
AN) are supported by the Lynx PCS module since these were also supported
by Felix.
The module can only be enabled by the drivers in need and not user
selectable.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the locked variant of the clause 45 mdiobus write accessor -
mdiobus_c45_write().
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same link partner advertisement word is used for both QSGMII and
SGMII, thus treat both interface modes using the same
phylink_decode_sgmii_word() function.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the new addition of the USXGMII link partner ability constants we
can now introduce a phylink helper that decodes the USXGMII word and
populates the appropriate fields in the phylink_link_state structure
based on them.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a pure codestyle cleanup patch. No functional change intended.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
What 0xFFFF means here is actually the max mtu of a ip packet. Use help
macro IP_MAX_MTU here.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new bit TX_GENF_CLR_EN has been added in AM65x SR2.0 to fix i2083
errata, which can be just set unconditionally for all SoCs.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree says:
====================
sfc: clean up some W=1 build warnings
A collection of minor fixes to issues flagged up by W=1.
After this series, the only remaining warnings in the sfc driver are
some 'member missing in kerneldoc' warnings from ptp.c.
Tested by building on x86_64 and running 'ethtool -p' on an EF10 NIC;
there was no error, but I couldn't observe the actual LED as I'm
working remotely.
[ Incidentally, ethtool_phys_id()'s behaviour on an error return
looks strange — if I'm reading it right, it will break out of the
inner loop but not the outer one, and eventually return the rc
from the last run of the inner loop. Is this intended? ]
====================
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
W=1 warnings indicated that 'rc' was unused in efx_mcdi_set_id_led();
change the function to return int instead of void and plumb the rc
through the caller efx_ethtool_phys_id().
Since (post-Falcon) all sfc NICs use MCDI for this, there's no point in
indirecting through a nic_type method, so remove that and just call
efx_mcdi_set_id_led() directly.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Missing 'struct' keyword caused "cannot understand function prototype"
warnings.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks to some past refactor, 'spec' is not actually used in this
function; the code using it moved to the callee efx_farch_filter_remove.
Remove the variable to fix a W=1 warning.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some of these RX-event flags aren't used at all, so remove them.
Others are used only #ifdef DEBUG to log a message; suppress the
unused-var warnings #ifndef DEBUG with a void cast.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
wenxu says:
====================
Add ip6_fragment in ipv6_stub
Add ip6_fragment in ipv6_stub and use it in openvswitch
This version add default function eafnosupport_ipv6_fragment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Using ipv6_stub->ipv6_fragment to avoid the netfilter dependency
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add ipv6_fragment to ipv6_stub to avoid calling netfilter when
access ip6_fragment.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel says:
====================
gtp: minor enhancements
The first patch removes a useless rcu lock and the second relax alloc
constraints when a PDP context is added.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When a PDP context is added, the rtnl lock is held, thus no need to force
a GFP_ATOMIC.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rtnl lock is taken just the line above, no need to take the rcu also.
Fixes: 1788b8569f ("gtp: fix use-after-free in gtp_encap_destroy()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we intend to use PCS operations, mac_pcs_get_state() will not be
implemented, so will be NULL. If we also intend to register the PCS
operations in mac_prepare() or mac_config(), then this leads to an
attempt to call NULL function pointer during phylink_start(). Avoid
this, but we must report the link is down.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Luo bin says:
====================
hinic: add debugfs support
add debugfs node for querying sq/rq info and function table
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
add debugfs node for querying function table, for example:
cat /sys/kernel/debug/hinic/0000:15:00.0/func_table/valid
Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add debugfs node for querying rq info, for example:
cat /sys/kernel/debug/hinic/0000:15:00.0/RQs/0x0/rq_hw_pi
Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
add debugfs node for querying sq info, for example:
cat /sys/kernel/debug/hinic/0000:15:00.0/SQs/0x0/sq_pi
Signed-off-by: Luo bin <luobin9@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add documentation for the XDP_SHARED_UMEM feature when a UMEM is
shared between different queues and/or netdevs.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-16-git-send-email-magnus.karlsson@intel.com
This sample code illustrates the packet forwarding between multiple
AF_XDP sockets in multi-threading environment. All the threads and
sockets are sharing a common buffer pool, with each socket having
its own private buffer cache. The sockets are created with the
xsk_socket__create_shared() function, which allows multiple AF_XDP
sockets to share the same UMEM object.
Example 1: Single thread handling two sockets. Packets received
from socket A (on top of interface IFA, queue QA) are forwarded
to socket B (on top of interface IFB, queue QB) and vice-versa.
The thread is affinitized to CPU core C:
./xsk_fwd -i IFA -q QA -i IFB -q QB -c C
Example 2: Two threads, each handling two sockets. Packets from
socket A are sent to socket B (by thread X), packets
from socket B are sent to socket A (by thread X); packets from
socket C are sent to socket D (by thread Y), packets from socket
D are sent to socket C (by thread Y). The two threads are bound
to CPU cores CX and CY:
./xdp_fwd -i IFA -q QA -i IFB -q QB -i IFC -q QC -i IFD -q QD -c CX -c CY
Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-15-git-send-email-magnus.karlsson@intel.com
Add support for shared umems between hardware queues and devices to
the AF_XDP part of libbpf. This so that zero-copy can be achieved in
applications that want to send and receive packets between HW queues
on one device or between different devices/netdevs.
In order to create sockets that share a umem between hardware queues
and devices, a new function has been added called
xsk_socket__create_shared(). It takes the same arguments as
xsk_socket_create() plus references to a fill ring and a completion
ring. So for every socket that share a umem, you need to have one more
set of fill and completion rings. This in order to maintain the
single-producer single-consumer semantics of the rings.
You can create all the sockets via the new xsk_socket__create_shared()
call, or create the first one with xsk_socket__create() and the rest
with xsk_socket__create_shared(). Both methods work.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-14-git-send-email-magnus.karlsson@intel.com
Add support to share a umem between different devices. This mode
can be invoked with the XDP_SHARED_UMEM bind flag. Previously,
sharing was only supported within the same device. Note that when
sharing a umem between devices, just as in the case of sharing a
umem between queue ids, you need to create a fill ring and a
completion ring and tie them to the socket (with two setsockopts,
one for each ring) before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics of the rings can be upheld.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-13-git-send-email-magnus.karlsson@intel.com