Use the more precise __kernel_sa_family_t for smctp_family, to match
struct sockaddr.
Also, use an unsigned int for the network member; negative networks
don't make much sense. We're already using unsigned for mctp_dev and
mctp_skb_cb, but need to change mctp_sock to suit.
Fixes: 60fc639816 ("mctp: Add sockaddr_mctp to uapi")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Acked-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a user uses "devlink dev info", the fw.mgmt.api will be
the major.minor numbers as shown below:
devlink dev info pci/0000:3b:00.0
pci/0000:3b:00.0:
driver ice
serial_number 00-01-00-ff-ff-00-00-00
versions:
fixed:
board.id K91258-000
running:
fw.mgmt 6.1.2
fw.mgmt.api 1.7 <--- No patch number included
fw.mgmt.build 0xd75e7d06
fw.mgmt.srev 5
fw.undi 1.2992.0
fw.undi.srev 5
fw.psid.api 3.10
fw.bundle_id 0x800085cc
fw.app.name ICE OS Default Package
fw.app 1.3.27.0
fw.app.bundle_id 0xc0000001
fw.netlist 3.10.2000-3.1e.0
fw.netlist.build 0x2a76e110
stored:
fw.mgmt.srev 5
fw.undi 1.2992.0
fw.undi.srev 5
fw.psid.api 3.10
fw.bundle_id 0x800085cc
fw.netlist 3.10.2000-3.1e.0
fw.netlist.build 0x2a76e110
There are many features in the driver that depend on the major, minor,
and patch version of the FW. Without the patch number in the output for
fw.mgmt.api debugging issues related to the FW API version is difficult.
Also, using major.minor.patch aligns with the existing firmware version
which uses a 3 digit value.
Fix this by making the fw.mgmt.api print the major.minor.patch
versions. Shown below is the result:
devlink dev info pci/0000:3b:00.0
pci/0000:3b:00.0:
driver ice
serial_number 00-01-00-ff-ff-00-00-00
versions:
fixed:
board.id K91258-000
running:
fw.mgmt 6.1.2
fw.mgmt.api 1.7.9 <--- patch number included
fw.mgmt.build 0xd75e7d06
fw.mgmt.srev 5
fw.undi 1.2992.0
fw.undi.srev 5
fw.psid.api 3.10
fw.bundle_id 0x800085cc
fw.app.name ICE OS Default Package
fw.app 1.3.27.0
fw.app.bundle_id 0xc0000001
fw.netlist 3.10.2000-3.1e.0
fw.netlist.build 0x2a76e110
stored:
fw.mgmt.srev 5
fw.undi 1.2992.0
fw.undi.srev 5
fw.psid.api 3.10
fw.bundle_id 0x800085cc
fw.netlist 3.10.2000-3.1e.0
fw.netlist.build 0x2a76e110
Fixes: ff2e5c700e ("ice: add basic handler for devlink .info_get")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The file sja1105.txt was converted to nxp,sja1105.yaml.
Signed-off-by: Alejandro Concepcion-Rodriguez <asconcepcion@acoro.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL
address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmE3uicACgkQMUZtbf5S
IrtJVA//XdE8qAmw1JukjyYC87JH2ale20eoZ6ERn7/09e4tdv3M6dOTI4YfrM6+
CMNP5MP2qit3IzY+lN0+yt9AAFH7k85z3MA8zLxsXN4z63OJcZvFv/G/OWy4Wp/0
vOo/DH+rF3LR+fZZvjJI+8Xi9/orsRpD12cwGmjGRxybh+XcnHKI/GvK2RgE6oBR
015RfBbbQBpzFQvESLnSwDzabN1XFEL1x/bz7N8ek3okfO/tab+f3E1tb6eYtTy+
jyDyOWpayd4xDttKNMUuxwS1q+/oAWOAq8PzkaF/ZG2sBH1Z4yZN9ZtsLNZmPG8N
5L1FEem/Nmgr54T9v/FhfiryhhGGysVfVgtQcCBkKRmVn1Kk2L6dFvtuanPtFFd3
llbi5PvCDJy3rbMmxKmyoM3T4jpMwWxQRZKsosw+k/WQfb8/SUOjgpY713V1Wx/P
S+2uadU4l9Ql9sF6X0IqZABnnt+j/BuDo6C6vVq7vyj0iQ9hEX9YxC0ybrAHOYpH
suHWKndodRfTxxVOg8xRNYwXyRLNbm1AP6LMDNKBlFUjwNSZ362qFX7W7DuXoRup
Rrnb8V1QFvM+pyFb2a0qNtBS68IXbjCdVQX5e8a5ELaAUnDPefNrfPN+/rrTLEtV
LnusmBF+02llVSYdr88t1e+LmzqS/aqXFy2ry4y6owjq20ld2O0=
=Zvuz
-----END PGP SIGNATURE-----
Merge tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes and stragglers from Jakub Kicinski:
"Networking stragglers and fixes, including changes from netfilter,
wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable
BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing
v6LL address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx
descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address"
* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
ieee802154: Remove redundant initialization of variable ret
net: stmmac: fix MAC not working when system resume back with WoL active
net: phylink: add suspend/resume support
net: renesas: sh_eth: Fix freeing wrong tx descriptor
bonding: 3ad: pass parameter bond_params by reference
cxgb3: fix oops on module removal
can: c_can: fix null-ptr-deref on ioctl()
can: rcar_canfd: add __maybe_unused annotation to silence warning
net: wwan: iosm: Unify IO accessors used in the driver
net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
net: qcom/emac: Replace strlcpy with strscpy
ip6_gre: Revert "ip6_gre: add validation for csum_start"
net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
selftests/bpf: Test XDP bonding nest and unwind
bonding: Fix negative jump label count on nested bonding
MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
stmmac: dwmac-loongson:Fix missing return value
iwlwifi: fix printk format warnings in uefi.c
net: create netdev->dev_addr assignment helpers
bnxt_en: Fix possible unintended driver initiated error recovery
...
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Protect nft_ct template with global mutex, from Pavel Skripkin.
2) Two recent commits switched inet rt and nexthop exception hashes
from jhash to siphash. If those two spots are problematic then
conntrack is affected as well, so switch voer to siphash too.
While at it, add a hard upper limit on chain lengths and reject
insertion if this is hit. Patches from Florian Westphal.
3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN,
from Benjamin Hesmans.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: socket: icmp6: fix use-after-scope
netfilter: refuse insertion if chain has grown too large
netfilter: conntrack: switch to siphash
netfilter: conntrack: sanitize table size default settings
netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
====================
Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- A reworking of PDF generation to yield better results for documents
using CJK fonts in particular.
- A new set of translations into traditional Chinese, a dialect for which
I am assured there is a community of interested readers.
- A lot more regular Chinese translation work as well.
...plus the usual assortment of updates, fixes, typo tweaks, etc.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmEugrgACgkQF0NaE2wM
fliWWQf/RXf34QkMIe+r77WlTRKc+/6R/cO9VlYPtM9vqreKHZZvGgM1t76aOusb
M5QHwQGoZDzaE1wrv0PPm00HtB0Tw7GfZRUbZ4D+niJD1+gcbDTkTR6NdjOvWWUR
zHX2Sx8KJiNrFDtLtRtlUexM8GD124KZ0A8GF6Hpu3WR3HTFDInTdiylUOmj/4eO
3zUGgrJnUVzkqHLGZzV/kmE4kEHGpxyps2JwGq2iF7362t8R6xH3mEdKKKc1pUpx
lGSxfHs+OPWRsNxVJsdYh8kneIpML8OK6lKda1pzwNj8QhIMz/6tZoutKziHsalI
HkbC3exh+SHak2U6Had303vqkIM7cg==
=2QUy
-----END PGP SIGNATURE-----
Merge tag 'docs-5.15' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"Yet another set of documentation changes:
- A reworking of PDF generation to yield better results for documents
using CJK fonts in particular.
- A new set of translations into traditional Chinese, a dialect for
which I am assured there is a community of interested readers.
- A lot more regular Chinese translation work as well.
... plus the usual assortment of updates, fixes, typo tweaks, etc"
* tag 'docs-5.15' of git://git.lwn.net/linux: (55 commits)
docs: sphinx-requirements: Move sphinx_rtd_theme to top
docs: pdfdocs: Enable language-specific font choice of zh_TW translations
docs: pdfdocs: Teach xeCJK about character classes of quotation marks
docs: pdfdocs: Permit AutoFakeSlant for CJK fonts
docs: pdfdocs: One-half spacing for CJK translations
docs: pdfdocs: Add conf.py local to translations for ascii-art alignment
docs: pdfdocs: Preserve inter-phrase space in Korean translations
docs: pdfdocs: Choose Serif font as CJK mainfont if possible
docs: pdfdocs: Add CJK-language-specific font settings
docs: pdfdocs: Refactor config for CJK document
scripts/kernel-doc: Override -Werror from KCFLAGS with KDOC_WERROR
docs/zh_CN: Add zh_CN/accounting/psi.rst
doc: align Italian translation
Documentation/features/vm: riscv supports THP now
docs/zh_CN: add infiniband user_verbs translation
docs/zh_CN: add infiniband user_mad translation
docs/zh_CN: add infiniband tag_matching translation
docs/zh_CN: add infiniband sysfs translation
docs/zh_CN: add infiniband opa_vnic translation
docs/zh_CN: add infiniband ipoib translation
...
Daniel Borkmann says:
====================
bpf-next 2021-08-31
We've added 116 non-merge commits during the last 17 day(s) which contain
a total of 126 files changed, 6813 insertions(+), 4027 deletions(-).
The main changes are:
1) Add opaque bpf_cookie to perf link which the program can read out again,
to be used in libbpf-based USDT library, from Andrii Nakryiko.
2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu.
3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang.
4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch
to another congestion control algorithm during init, from Martin KaFai Lau.
5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima.
6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta.
7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT}
progs, from Xu Liu and Stanislav Fomichev.
8) Support for __weak typed ksyms in libbpf, from Hao Luo.
9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky.
10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov.
11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann.
12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian,
Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others.
13) Another big batch to revamp XDP samples in order to give them consistent look
and feel, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits)
MAINTAINERS: Remove self from powerpc BPF JIT
selftests/bpf: Fix potential unreleased lock
samples: bpf: Fix uninitialized variable in xdp_redirect_cpu
selftests/bpf: Reduce more flakyness in sockmap_listen
bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS
bpf: selftests: Add dctcp fallback test
bpf: selftests: Add connect_to_fd_opts to network_helpers
bpf: selftests: Add sk_state to bpf_tcp_helpers.h
bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt
selftests: xsk: Preface options with opt
selftests: xsk: Make enums lower case
selftests: xsk: Generate packets from specification
selftests: xsk: Generate packet directly in umem
selftests: xsk: Simplify cleanup of ifobjects
selftests: xsk: Decrease sending speed
selftests: xsk: Validate tx stats on tx thread
selftests: xsk: Simplify packet validation in xsk tests
selftests: xsk: Rename worker_* functions that are not thread entry points
selftests: xsk: Disassociate umem size with packets sent
selftests: xsk: Remove end-of-test packet
...
====================
Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Clean up and consolidate ct ecache infrastructure by merging ct and
expect notifiers, from Florian Westphal.
2) Missing counters and timestamp in nfnetlink_queue and _log conntrack
information.
3) Missing error check for xt_register_template() in iptables mangle,
as a incremental fix for the previous pull request, also from
Florian Westphal.
4) Add netfilter hooks for the SRv6 lightweigh tunnel driver, from
Ryoga Sato. The hooks are enabled via nf_hooks_lwtunnel sysctl
to make sure existing netfilter rulesets do not break. There is
a static key to disable the hooks by default.
The pktgen_bench_xmit_mode_netif_receive.sh shows no noticeable
impact in the seg6_input path for non-netfilter users: similar
numbers with and without this patch.
This is a sample of the perf report output:
11.67% kpktgend_0 [ipv6] [k] ipv6_get_saddr_eval
7.89% kpktgend_0 [ipv6] [k] __ipv6_addr_label
7.52% kpktgend_0 [ipv6] [k] __ipv6_dev_get_saddr
6.63% kpktgend_0 [kernel.vmlinux] [k] asm_exc_nmi
4.74% kpktgend_0 [ipv6] [k] fib6_node_lookup_1
3.48% kpktgend_0 [kernel.vmlinux] [k] pskb_expand_head
3.33% kpktgend_0 [ipv6] [k] ip6_rcv_core.isra.29
3.33% kpktgend_0 [ipv6] [k] seg6_do_srh_encap
2.53% kpktgend_0 [ipv6] [k] ipv6_dev_get_saddr
2.45% kpktgend_0 [ipv6] [k] fib6_table_lookup
2.24% kpktgend_0 [kernel.vmlinux] [k] ___cache_free
2.16% kpktgend_0 [ipv6] [k] ip6_pol_route
2.11% kpktgend_0 [kernel.vmlinux] [k] __ipv6_addr_type
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
conntrack has two distinct table size settings:
nf_conntrack_max and nf_conntrack_buckets.
The former limits how many conntrack objects are allowed to exist
in each namespace.
The second sets the size of the hashtable.
As all entries are inserted twice (once for original direction, once for
reply), there should be at least twice as many buckets in the table than
the maximum number of conntrack objects that can exist at the same time.
Change the default multiplier to 1 and increase the chosen bucket sizes.
This results in the same nf_conntrack_max settings as before but reduces
the average bucket list length.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces netfilter hooks for solving the problem that
conntrack couldn't record both inner flows and outer flows.
This patch also introduces a new sysctl toggle for enabling lightweight
tunnel netfilter hooks.
Signed-off-by: Ryoga Saito <contact@proelbtn.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, the pktgen.rst documentation doesn't cover the latest pktgen
sample usage options such as count and IPv6, and so on. Also, this
documentation includes the old sample scripts which are no longer use
because it was removed by the commit a4b6ade835 ("samples/pktgen :
remove remaining old pktgen sample scripts")
Thus, this commit documents pktgen sample usage using the latest options
and removes old sample scripts, and fixes a minor typo.
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, there are many drivers who support CQE mode configuration,
some configure it as a fixed when initialized, some provide an
interface to change it by ethtool private flags. In order to make it
more generic, add two new 'ETHTOOL_A_COALESCE_USE_CQE_TX' and
'ETHTOOL_A_COALESCE_USE_CQE_RX' coalesce attributes, then these
parameters can be accessed by ethtool netlink coalesce uAPI.
Also add an new structure kernel_ethtool_coalesce, then the
new parameter can be added into this struct.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As suggested by David, document a somewhat unexpected behavior that results
from net.ipv4.tcp_l3mdev_accept=1. This behavior was encountered while
debugging FRR, a VRF-aware application, on a system which used
net.ipv4.tcp_l3mdev_accept=1 and where TCP connections for BGP with MD5
keys were failing to establish.
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Two new methods have been introduced, add some verbiage about what they do.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function has disappeared in commit edac6f6332 ("Revert "net: dsa:
Allow drivers to filter packets they can decode source port from"").
Also, since commit 4e50025129 ("net: dsa: generalize overhead for
taggers that use both headers and trailers"), the next paragraph is no
longer true (it is still discouraged to do that, but it is now
supported, so no point in mentioning it). Delete.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the paragraphs that talk about the various modes of traffic
support, bridging with foreign interfaces, etc etc. There is nothing
that the user needs to know now, it should all work out of the box as
expected.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sja1105 driver has removed its devlink params, so there is nothing
to see here.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series introduces the support for two new mlx5 features:
1) Sample offload for tunneled traffic
2) devlink rate objects support
1) From Chris Mi: Sample offload for tunneled traffic
=====================================================
Background and solution
-----------------------
Currently the sample offload actions send the encapsulated packet
to software. This series de-capsulates the packet before performing
the sampling and set the tunnel properties on the skb metadata
fields to make the behavior consistent with OVS sFlow.
If de-capsulating first, we can't use the same match like before in
default table. So instantiate a post action instance to continue
processing the action list. If HW can preserve reg_c, also use the
post action instance.
Post action infrastructure
--------------------------
Some tc actions are modeled in hardware using multiple tables
causing a tc action list split. For example, CT action is modeled
by jumping to a ct table which is controlled by nf flow table.
sFlow jumps in hardware to a sample table, which continues to a
"default table" where it should continue processing the action list.
Multi table actions are modeled in hardware using a unique fte_id.
The fte_id is set before jumping to a table. Split actions continue
to a post-action table where the matched fte_id value continues the
execution the tc action list.
This series also introduces post action infrastructure. Both ct and
sample use it.
Sample for tunnel in TC SW
--------------------------
tc filter add dev vxlan1 protocol ip parent ffff: prio 3 \
flower src_mac 24:25:d0:e1:00:00 dst_mac 02:25:d0:13:01:02 \
enc_src_ip 192.168.1.14 enc_dst_ip 192.168.1.13 \
enc_dst_port 4789 enc_key_id 4 \
action sample rate 1 group 6 \
action tunnel_key unset \
action mirred egress redirect dev enp4s0f0_1
MLX5 sample HW offload
----------------------
For the following typical flow table:
+-------------------------------+
+ original flow table +
+-------------------------------+
+ original match +
+-------------------------------+
+ sample action + other actions +
+-------------------------------+
We translate the tc filter with sample action to the following HW model:
+---------------------+
+ original flow table +
+---------------------+
+ original match +
+---------------------+
| set fte_id (if reg_c preserve cap)
| do decap
v
+------------------------------------------------+
+ Flow Sampler Object +
+------------------------------------------------+
+ sample ratio +
+------------------------------------------------+
+ sample table id | default table id +
+------------------------------------------------+
| |
v v
+-----------------------------+ +-------------------+
+ sample table + + default table +
+-----------------------------+ +-------------------+
+ forward to management vport + |
+-----------------------------+ |
+-------+------+
| |reg_c preserve cap
| |or decap action
v v
+-----------------+ +-------------+
+ per vport table + + post action +
+-----------------+ +-------------+
+ original match +
+-----------------+
+ other actions +
+-----------------+
2) From Dmytro Linkin: devlink rate object support for mlx5_core driver
=======================================================================
HIGH-LEVEL OVERVIEW
Devlink leaf rate objects created per vport (VF/SF, and PF on BlueField)
in switchdev mode on devlink port registration.
Implement devlink ops callbacks to create/destroy rate groups, set TX
rate values of the vport/group, assign vport to the group.
Driver accepts TX rate values as fraction of 1Mbps.
Refactor existing eswitch QoS infrastructure to be accessible by legacy
NDO rate API and new devlink rate API. NDO rate API is not
removed/disabled in switchdev mode to not break existing users. Rate
values configured with NDO rate API are not visible for devlink
infrastructure, therefore APIs should not be used simultaneously.
IMPLEMENTATION DETAILS
Driver provide two level rate hierarchy to manage bandwidth - group
level and vport level. Initially each vport added to internal unlimited
group created by default. Each rate element (vport or group) receive
bandwidth relative to its parent element (for groups the parent is a
physical link itself) in a Round Robin manner, where element get
bandwidth value according to its weight. Example:
Created four rate groups with tx_share limits:
$ devlink port function rate add \
pci/0000:06:00.0/group_1 tx_share 30gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_2 tx_share 20gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_3 tx_share 20gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_4 tx_share 10gbit
Weights created in HW for each group are relative to the bigest tx_share
value, which is 30gbit:
<group_1> 1.0
<group_2> 0.67
<group_3> 0.67
<group_4> 0.33
Assuming link speed is 50 Gbit/sec and each group can sustain such
amount of traffic, maximum bandwidth is 50 / (1.0 + 0.67 + 0.67 + 0.33)
= ~18.75 Gbit/sec. Normilized bandwidth values for groups:
<group_1> 18.75 * 1.0 = 18.75 Gbit/sec
<group_2> 18.75 * 0.67 = 12.5 Gbit/sec
<group_3> 18.75 * 0.67 = 12.5 Gbit/sec
<group_4> 18.75 * 0.33 = 6.25 Gbit/sec
If in example above group_1 doesn't produce any traffic, then maximum
bandwidth becomes 50 / (0.67 + 0.67 + 0.33) = ~30.0 Gbit/sec. Normalized
values:
<group_2> 30.0 * 0.67 = 20.0 Gbit/sec
<group_3> 30.0 * 0.67 = 20.0 Gbit/sec
<group_4> 30.0 * 0.33 = 10.0 Gbit/sec
Same normalization applied to each vport in the group.
Normalized values are internal, therefore driver provides QoS
tracepoints for next events:
* vport rate element creation/deletion:
* vport rate element configuration;
* group rate element creation/deletion;
* group rate element configuration.
PATCHES OVERVIEW
1 - Moving and isolation of eswitch QoS logic in separate file;
2 - Implement devlink leaf rate object support for vports;
3 - Implement rate groups creation/deletion;
4 - Implement TX rate management for the groups;
5 - Implement parent set for vports;
6 - Eswitch QoS tracepoints.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmEfNKEACgkQSD+KveBX
+j4DVgf/ZTX3n7xDVrqNgM3hkUOT7QKVCq5zUlDKw1IizE6I+8xB6LO3KmUPyWIn
+VBXE3c+aqUNZu4XlqdkVn1JskVdfTdAGXIIeBgMQskJ/VFCNqU4E9uieEmpiHnI
DUUkEI6eBURJSu1KPD0xdMqtdGJE+/KjwmfZFnCsa4uxmRuV7B0BdxzGIA6AMFKn
+jNS/PFbQM6bUDqP2UUwd97sThtTzDIVH86gu36yK/mdcwdLreqKeuxoHJWePWHC
qLReBC5OQ9zXD2F1Dv2u3WU7EJT7qyCLNrBUrTwHcR9N0Di+2a6lGvjRL5tjWKKC
KOrNkkviurmPp+VieJU+rHHYQwjoGQ==
=XfOY
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-08-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-08-19
This series introduces the support for two new mlx5 features:
1) Sample offload for tunneled traffic
2) devlink rate objects support
1) From Chris Mi: Sample offload for tunneled traffic
=====================================================
Background and solution
-----------------------
Currently the sample offload actions send the encapsulated packet
to software. This series de-capsulates the packet before performing
the sampling and set the tunnel properties on the skb metadata
fields to make the behavior consistent with OVS sFlow.
If de-capsulating first, we can't use the same match like before in
default table. So instantiate a post action instance to continue
processing the action list. If HW can preserve reg_c, also use the
post action instance.
Post action infrastructure
--------------------------
Some tc actions are modeled in hardware using multiple tables
causing a tc action list split. For example, CT action is modeled
by jumping to a ct table which is controlled by nf flow table.
sFlow jumps in hardware to a sample table, which continues to a
"default table" where it should continue processing the action list.
Multi table actions are modeled in hardware using a unique fte_id.
The fte_id is set before jumping to a table. Split actions continue
to a post-action table where the matched fte_id value continues the
execution the tc action list.
This series also introduces post action infrastructure. Both ct and
sample use it.
Sample for tunnel in TC SW
--------------------------
tc filter add dev vxlan1 protocol ip parent ffff: prio 3 \
flower src_mac 24:25:d0:e1:00:00 dst_mac 02:25:d0:13:01:02 \
enc_src_ip 192.168.1.14 enc_dst_ip 192.168.1.13 \
enc_dst_port 4789 enc_key_id 4 \
action sample rate 1 group 6 \
action tunnel_key unset \
action mirred egress redirect dev enp4s0f0_1
MLX5 sample HW offload
----------------------
For the following typical flow table:
+-------------------------------+
+ original flow table +
+-------------------------------+
+ original match +
+-------------------------------+
+ sample action + other actions +
+-------------------------------+
We translate the tc filter with sample action to the following HW model:
+---------------------+
+ original flow table +
+---------------------+
+ original match +
+---------------------+
| set fte_id (if reg_c preserve cap)
| do decap
v
+------------------------------------------------+
+ Flow Sampler Object +
+------------------------------------------------+
+ sample ratio +
+------------------------------------------------+
+ sample table id | default table id +
+------------------------------------------------+
| |
v v
+-----------------------------+ +-------------------+
+ sample table + + default table +
+-----------------------------+ +-------------------+
+ forward to management vport + |
+-----------------------------+ |
+-------+------+
| |reg_c preserve cap
| |or decap action
v v
+-----------------+ +-------------+
+ per vport table + + post action +
+-----------------+ +-------------+
+ original match +
+-----------------+
+ other actions +
+-----------------+
2) From Dmytro Linkin: devlink rate object support for mlx5_core driver
=======================================================================
HIGH-LEVEL OVERVIEW
Devlink leaf rate objects created per vport (VF/SF, and PF on BlueField)
in switchdev mode on devlink port registration.
Implement devlink ops callbacks to create/destroy rate groups, set TX
rate values of the vport/group, assign vport to the group.
Driver accepts TX rate values as fraction of 1Mbps.
Refactor existing eswitch QoS infrastructure to be accessible by legacy
NDO rate API and new devlink rate API. NDO rate API is not
removed/disabled in switchdev mode to not break existing users. Rate
values configured with NDO rate API are not visible for devlink
infrastructure, therefore APIs should not be used simultaneously.
IMPLEMENTATION DETAILS
Driver provide two level rate hierarchy to manage bandwidth - group
level and vport level. Initially each vport added to internal unlimited
group created by default. Each rate element (vport or group) receive
bandwidth relative to its parent element (for groups the parent is a
physical link itself) in a Round Robin manner, where element get
bandwidth value according to its weight. Example:
Created four rate groups with tx_share limits:
$ devlink port function rate add \
pci/0000:06:00.0/group_1 tx_share 30gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_2 tx_share 20gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_3 tx_share 20gbit
$ devlink port function rate add \
pci/0000:06:00.0/group_4 tx_share 10gbit
Weights created in HW for each group are relative to the bigest tx_share
value, which is 30gbit:
<group_1> 1.0
<group_2> 0.67
<group_3> 0.67
<group_4> 0.33
Assuming link speed is 50 Gbit/sec and each group can sustain such
amount of traffic, maximum bandwidth is 50 / (1.0 + 0.67 + 0.67 + 0.33)
= ~18.75 Gbit/sec. Normilized bandwidth values for groups:
<group_1> 18.75 * 1.0 = 18.75 Gbit/sec
<group_2> 18.75 * 0.67 = 12.5 Gbit/sec
<group_3> 18.75 * 0.67 = 12.5 Gbit/sec
<group_4> 18.75 * 0.33 = 6.25 Gbit/sec
If in example above group_1 doesn't produce any traffic, then maximum
bandwidth becomes 50 / (0.67 + 0.67 + 0.33) = ~30.0 Gbit/sec. Normalized
values:
<group_2> 30.0 * 0.67 = 20.0 Gbit/sec
<group_3> 30.0 * 0.67 = 20.0 Gbit/sec
<group_4> 30.0 * 0.33 = 10.0 Gbit/sec
Same normalization applied to each vport in the group.
Normalized values are internal, therefore driver provides QoS
tracepoints for next events:
* vport rate element creation/deletion:
* vport rate element configuration;
* group rate element creation/deletion;
* group rate element configuration.
PATCHES OVERVIEW
1 - Moving and isolation of eswitch QoS logic in separate file;
2 - Implement devlink leaf rate object support for vports;
3 - Implement rate groups creation/deletion;
4 - Implement TX rate management for the groups;
5 - Implement parent set for vports;
6 - Eswitch QoS tracepoints.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- update docs about move IRC channel away from freenode,
by Sven Eckelmann
- Switch to kstrtox.h for kstrtou64, by Sven Eckelmann
- Update NULL checks, by Sven Eckelmann (2 patches)
- remove remaining skb-copy calls for broadcast packets,
by Linus Lüssing
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmEeeK8WHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeofhWD/0YwNndM/FFo/NHcO3GFDZx9eLM
dFuO7zdMilzgg462q7+mgi0jXA2Kp50Y+JcCqS2XVRIsMgKTVABflgmSlUIOdDoC
A3KKRVgQ1HNPD4WREaEV2CLvBdhR9wEI0jRHvZou7n/VWrfJcUHgdl9aDA2/ptlP
NcuYCKC99HCQmvaBt4GZgOunYDeplmo2qLip2gpwJWf9/vkL7HiBe3HtQSh1HI2y
EIn4SExZOFcxMmKeJMsYl35OZh9oFv7nTnpZBGyKjA+HS0pu03aaPNRGMjW/pdhF
f7V61aDJBU0xU6PjWvUegY4VMInrjW8F10EEJck461J/B9PXjUHUaH8BXXuGBkRM
0kU0Cv21a3Ovz23lgnXSnXu/xjqq5/zZHjnGvyPAMMppAI5f73q/0THtv9iOu+Cz
Qf/tYl0BIRir20ZWtddQ9x2W3+cBYPOYrf/tnmWqFhPddenn+xitwTysVA6fOykQ
pVksQ5UVpDZasZI9Al+R2M0CBttn7tS/iu95PV9CMST8aRgUuU90yd2Ocg3rRDNQ
iEor0AozmO879W460BFQcTILw+D7OdlErUV8H8VW4507imZ7JXGPwZTFxhjM2Xhx
wUXo/o2sxt/ITSdtZAeQj8zOXQMtOi3KlXtTl8ZzyRT//YLWah0j4oBf0a8K62/y
i1Pd5MgXDQAm8fHkBg==
=sHz+
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-pullrequest-20210819' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- update docs about move IRC channel away from freenode,
by Sven Eckelmann
- Switch to kstrtox.h for kstrtou64, by Sven Eckelmann
- Update NULL checks, by Sven Eckelmann (2 patches)
- remove remaining skb-copy calls for broadcast packets,
by Linus Lüssing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn BPF_PROG_RUN into a proper always inlined function. No functional and
performance changes are intended, but it makes it much easier to understand
what's going on with how BPF programs are actually get executed. It's more
obvious what types and callbacks are expected. Also extra () around input
parameters can be dropped, as well as `__` variable prefixes intended to avoid
naming collisions, which makes the code simpler to read and write.
This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
BPF_PROG_RUN into a function causes naming conflict compilation error. So
rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
Add documentation for two bad signal integrity substates:
ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_REFERENCE_CLOCK_LOST
ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_ALOS.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The msk can use backup subflows to transmit in-sequence data
only if there are no other active subflow. On active backup
scenario, the MPTCP connection can do forward progress only
due to MPTCP retransmissions - rtx can pick backup subflows.
This patch introduces a new flag flow MPTCP subflows: if the
underlying TCP connection made no progresses for long time,
and there are other less problematic subflows available, the
given subflow become stale.
Stale subflows are not considered active: if all non backup
subflows become stale, the MPTCP scheduler can pick backup
subflows for plain transmissions.
Stale subflows can return in active state, as soon as any reply
from the peer is observed.
Active backup scenarios can now leverage the available b/w
with no restrinction.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new device generic parameter to enable/disable creation of
VDPA net auxiliary device and associated device functionality
in the devlink instance.
User who prefers to disable such functionality can disable it using below
example.
$ devlink dev param set pci/0000:06:00.0 \
name enable_vnet value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device for the
VDPA net functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new device generic parameter to enable/disable creation of
RDMA auxiliary device and associated device functionality
in the devlink instance.
User who prefers to disable such functionality can disable it using below
example.
$ devlink dev param set pci/0000:06:00.0 \
name enable_rdma value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device for the
RDMA functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new device generic parameter to enable/disable creation of
Ethernet auxiliary device and associated device functionality
in the devlink instance.
User who prefers to disable such functionality can disable it using below
example.
$ devlink dev param set pci/0000:06:00.0 \
name enable_eth value false cmode driverinit
$ devlink dev reload pci/0000:06:00.0
At this point devlink instance do not create auxiliary device for the
Ethernet functionality.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to recent developments around the Freenode.org IRC network, the
opinions about the usage of this service shifted dramatically. The majority
of the still active users of the #batman channel prefers a move to the
hackint.org network.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Restrict range element expansion in ipset to avoid soft lockup,
from Jozsef Kadlecsik.
2) Memleak in error path for nf_conntrack_bridge for IPv4 packets,
from Yajun Deng.
3) Simplify conntrack garbage collection strategy to avoid frequent
wake-ups, from Florian Westphal.
4) Fix NFNLA_HOOK_FUNCTION_NAME string, do not include module name.
5) Missing chain family netlink attribute in chain description
in nfnetlink_hook.
6) Incorrect sequence number on nfnetlink_hook dumps.
7) Use netlink request family in reply message for consistency.
8) Remove offload_pickup sysctl, use conntrack for established state
instead, from Florian Westphal.
9) Translate NFPROTO_INET/ingress to NFPROTO_NETDEV/ingress, since
NFPROTO_INET is not exposed through nfnetlink_hook.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: nfnetlink_hook: translate inet ingress to netdev
netfilter: conntrack: remove offload_pickup sysctl again
netfilter: nfnetlink_hook: Use same family as request message
netfilter: nfnetlink_hook: use the sequence number of the request message
netfilter: nfnetlink_hook: missing chain family
netfilter: nfnetlink_hook: strip off module name from hookfn
netfilter: conntrack: collect all entries in one cycle
netfilter: nf_conntrack_bridge: Fix memory leak when error
netfilter: ipset: Limit the maximal range of consecutive elements to add/delete
====================
Link: https://lore.kernel.org/r/20210806151149.6356-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These two sysctls were added because the hardcoded defaults (2 minutes,
tcp, 30 seconds, udp) turned out to be too low for some setups.
They appeared in 5.14-rc1 so it should be fine to remove it again.
Marcelo convinced me that there should be no difference between a flow
that was offloaded vs. a flow that was not wrt. timeout handling.
Thus the default is changed to those for TCP established and UDP stream,
5 days and 120 seconds, respectively.
Marcelo also suggested to account for the timeout value used for the
offloading, this avoids increase beyond the value in the conntrack-sysctl
and will also instantly expire the conntrack entry with altered sysctls.
Example:
nf_conntrack_udp_timeout_stream=60
nf_flowtable_udp_timeout=60
This will remove offloaded udp flows after one minute, rather than two.
An earlier version of this patch also cleared the ASSURED bit to
allow nf_conntrack to evict the entry via early_drop (i.e., table full).
However, it looks like we can safely assume that connection timed out
via HW is still in established state, so this isn't needed.
Quoting Oz:
[..] the hardware sends all packets with a set FIN flags to sw.
[..] Connections that are aged in hardware are expected to be in the
established state.
In case it turns out that back-to-sw-path transition can occur for
'dodgy' connections too (e.g., one side disappeared while software-path
would have been in RETRANS timeout), we can adjust this later.
Cc: Oz Shlomo <ozsh@nvidia.com>
Cc: Paul Blakey <paulb@nvidia.com>
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).
Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
589918df93 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
0fac6aa098 ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")
Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df93 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are aspects of netdevsim which are commonly
misunderstood and pointed out in review. Cong
suggest we document them.
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option lacp_active, which is similar with team's runner.active.
This option specifies whether to send LACPDU frames periodically. If set
on, the LACPDU frames are sent along with the configured lacp_rate
setting. If set off, the LACPDU frames acts as "speak when spoken to".
Note, the LACPDU state frames still will be sent when init or unbind port.
v2: remove module parameter
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IF_OPER_TESTING is in fact used today.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrii Nakryiko says:
====================
bpf-next 2021-07-30
We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).
The main changes are:
1) BTF-guided binary data dumping libbpf API, from Alan.
2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.
3) Ambient BPF run context and cgroup storage cleanup, from Andrii.
4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.
5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.
6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.
7) BPF map pinning improvements in libbpf, from Martynas.
8) Improved module BTF support in libbpf and bpftool, from Quentin.
9) Bpftool cleanups and documentation improvements, from Quentin.
10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.
11) Increased maximum cgroup storage size, from Stanislav.
12) Small fixes and improvements to BPF tests and samples, from various folks.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
tools: bpftool: Document and add bash completion for -L, -B options
selftests/bpf: Update bpftool's consistency script for checking options
tools: bpftool: Update and synchronise option list in doc and help msg
tools: bpftool: Complete and synchronise attach or map types
selftests/bpf: Check consistency between bpftool source, doc, completion
tools: bpftool: Slightly ease bash completion updates
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
tools: bpftool: Support dumping split BTF by id
libbpf: Add split BTF support for btf__load_from_kernel_by_id()
tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
tools: Free BTF objects at various locations
libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
libbpf: Rename btf__load() as btf__load_into_kernel()
libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
bpf: Increase supported cgroup storage value size
libbpf: Fix race when pinning maps in parallel
...
====================
Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Document the mirroring capabilities of the dpaa2-switch driver,
any restrictions that are imposed and some example commands.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds a brief document about the sockets API provided for
sending and receiving MCTP messages from userspace.
This is roughly based on the OpenBMC design document, at:
https://github.com/openbmc/docs/blob/master/designs/mctp/mctp-kernel.md
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Append ioam6-sysctl to toctree in order to get rid of building warnings.
Signed-off-by: Hu Haowen <src.res@email.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
All other user triggered operations are gone from ndo_ioctl, so move
the SIOCBOND family into a custom operation as well.
The .ndo_ioctl() helper is no longer called by the dev_ioctl.c code now,
but there are still a few definitions in obsolete wireless drivers as well
as the appletalk and ieee802154 layers to call SIOCSIFADDR/SIOCGIFADDR
helpers from inside the kernel.
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to further reduce the scope of ndo_do_ioctl(), move
out the SIOCWANDEV handling into a new network device operation
function.
Adjust the prototype to only pass the if_settings sub-structure
in place of the ifreq, and remove the redundant 'cmd' argument
in the process.
Cc: Krzysztof Halasa <khc@pm.waw.pl>
Cc: "Jan \"Yenya\" Kasprzak" <kas@fi.muni.cz>
Cc: Kevin Curtis <kevin.curtis@farsite.co.uk>
Cc: Zhao Qiang <qiang.zhao@nxp.com>
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: linux-x25@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most users of ndo_do_ioctl are ethernet drivers that implement
the MII commands SIOCGMIIPHY/SIOCGMIIREG/SIOCSMIIREG, or hardware
timestamping with SIOCSHWTSTAMP/SIOCGHWTSTAMP.
Separate these from the few drivers that use ndo_do_ioctl to
implement SIOCBOND, SIOCBR and SIOCWANDEV commands.
This is a purely cosmetic change intended to help readers find
their way through the implementation.
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Vladimir Oltean <olteanv@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCDEVPRIVATE ioctl commands are mainly used in really old
drivers, and they have a number of problems:
- They hide behind the normal .ndo_do_ioctl function that
is also used for other things in modern drivers, so it's
hard to spot a driver that actually uses one of these
- Since drivers use a number different calling conventions,
it is impossible to support compat mode for them in
a generic way.
- With all drivers using the same 16 commands codes, there
is no way to introspect the data being passed through
things like strace.
Add a new net_device_ops callback pointer, to address the
first two of these. Separating them from .ndo_do_ioctl
makes it easy to grep for drivers with a .ndo_siocdevprivate
callback, and the unwieldy name hopefully makes it easier
to spot in code review.
By passing the ifreq structure and the ifr_data pointer
separately, it is no longer necessary to overload these,
and the driver can use either one for a given command.
Cc: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a documentation entry for the DPAA2 switch listing its
requirements, features and some examples to go along them.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a file to document devlink support for hns3 driver, now support devlink
info and devlink reload.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the DPAA2 DPIO driver chapter title by adding the necessary
overline. Without this, the index page of the DPAA2 documentation
doesn't display properly.
Fixes: d8e516bac7 ("soc: fsl: dpio: Convert DPIO documentation to .rst")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://lore.kernel.org/r/20210722100356.635078-5-ciorneiioana@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Multiple complaints have been raised from the TFO users on the internet
stating that the TFO blackhole logic is too aggressive and gets falsely
triggered too often.
(e.g. https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/)
Considering that most middleboxes no longer drop TFO packets, we decide
to disable the blackhole logic by setting
/proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_set to 0 by default.
Fixes: cf1ef3f071 ("net/tcp_fastopen: Disable active side TFO in certain scenarios")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add documentation for new IOAM sysctls:
- ioam6_id and ioam6_id_wide: two per-namespace sysctls
- ioam6_enabled, ioam6_id and ioam6_id_wide: three per-interface sysctls
Example of IOAM configuration based on the following simple topology:
_____ _____ _____
| | eth0 eth0 | | eth1 eth0 | |
| A |.----------.| B |.----------.| C |
|_____| |_____| |_____|
1) Node and interface IDs can be configured for IOAM:
# IOAM ID of A = 1, IOAM ID of A.eth0 = 11
(A) sysctl -w net.ipv6.ioam6_id=1
(A) sysctl -w net.ipv6.conf.eth0.ioam6_id=11
# IOAM ID of B = 2, IOAM ID of B.eth0 = 21, IOAM ID of B.eth1 = 22
(B) sysctl -w net.ipv6.ioam6_id=2
(B) sysctl -w net.ipv6.conf.eth0.ioam6_id=21
(B) sysctl -w net.ipv6.conf.eth1.ioam6_id=22
# IOAM ID of C = 3, IOAM ID of C.eth0 = 31
(C) sysctl -w net.ipv6.ioam6_id=3
(C) sysctl -w net.ipv6.conf.eth0.ioam6_id=31
Note that "_wide" IDs equivalents can be configured the same way.
2) Each node can be configured to form an IOAM domain. For instance,
we allow IOAM from A to C only (not the reverse path), i.e. enable
IOAM on ingress for B.eth0 and C.eth0:
(B) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
(C) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
3) An IOAM domain (e.g. ID=123) is defined and made known to each node:
(A) ip ioam namespace add 123
(B) ip ioam namespace add 123
(C) ip ioam namespace add 123
4) Finally, an IOAM Pre-allocated Trace can be inserted in traffic sent
by A when C (e.g. db02::2) is the destination:
(A) ip -6 route add db02::2/128 encap ioam6 trace type 0x800000 ns 123
size 12 dev eth0
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>