Commit Graph

1073501 Commits

Author SHA1 Message Date
Geliang Tang
a224a847ae selftests: mptcp: add the id argument for set_flags
This patch added the id argument for setting the address flags in
pm_nl_ctl.

Usage:

    pm_nl_ctl set id 1 flags backup

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:24 -08:00
Geliang Tang
f014038625 selftests: mptcp: add wrapper for setting flags
This patch implemented a new function named pm_nl_set_endpoint(), wrapped
the PM netlink commands 'ip mptcp endpoint change flags' and 'pm_nl_ctl
set flags' in it, and used a new argument 'ip_mptcp' to choose which one
to use to set the flags of the PM endpoint.

'ip mptcp' used the ID number argument to find out the address to change
flags, while 'pm_nl_ctl' used the address and port number arguments. So
we need to parse the address ID from the PM dump output as well as the
address and port number.

Used this wrapper in do_transfer() instead of using the pm_nl_ctl command
directly.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:24 -08:00
Geliang Tang
dda61b3dbe selftests: mptcp: add wrapper for showing addrs
This patch implemented a new function named pm_nl_show_endpoints(), wrapped
the PM netlink commands 'ip mptcp endpoint show' and 'pm_nl_ctl dump' in
it, used a new argument 'ip_mptcp' to choose which one to use to show all
the PM endpoints.

Used this wrapper in do_transfer() instead of using the pm_nl_ctl commands
directly.

The original 'pos+=5' in the remoing tests only works for the output of
'pm_nl_ctl show':

  id 1 flags subflow 10.0.1.1

It doesn't work for the output of 'ip mptcp endpoint show':

  10.0.1.1 id 1 subflow

So implemented a more flexible approach to get the address ID from the PM
dump output to fit for both commands.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:24 -08:00
Geliang Tang
34aa6e3bcc selftests: mptcp: add ip mptcp wrappers
This patch added four basic 'ip mptcp' wrappers:

pm_nl_set_limits()
pm_nl_add_endpoint()
pm_nl_del_endpoint()
pm_nl_flush_endpoint().

Wrapped the PM netlink commands 'ip mptcp' and 'pm_nl_ctl' in them, and
used a new argument 'ip_mptcp' to choose which one to use for setting the
PM limits, adding or deleting the PM endpoint.

Used the wrappers in all the selftests in mptcp_join.sh instead of using
the pm_nl_ctl commands directly.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:23 -08:00
Geliang Tang
33397b83ee selftests: mptcp: add backup with port testcase
This patch added the backup testcase using an address with a port number.

The original backup tests only work for the output of 'pm_nl_ctl dump'
without the port number. It chooses the last item in the dump to parse
the address in it, and in this case, the address is showed at the end
of the item.

But it doesn't work for the dump with the port number, in this case, the
port number is showed at the end of the item, not the address.

So implemented a more flexible approach to get the address and the port
number from the dump to fit for the port number case.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:23 -08:00
Geliang Tang
d6a676e0e1 selftests: mptcp: add the port argument for set_flags
This patch added the port argument for setting the address flags in
pm_nl_ctl.

Usage:

    pm_nl_ctl set 10.0.2.1 flags backup port 10100

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:23 -08:00
Geliang Tang
09f12c3ab7 mptcp: allow to use port and non-signal in set_flags
It's illegal to use both port and non-signal flags for adding address.
But it's legal to use both of them for setting flags, which always uses
non-signal flags, backup or fullmesh.

This patch moves this non-signal flag with port check from
mptcp_pm_parse_addr() to mptcp_nl_cmd_add_addr(). Do the check only when
adding addresses, not setting flags or deleting addresses.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:30:23 -08:00
Jakub Kicinski
660a38bf6f Merge branch 'support-for-the-ioam-insertion-frequency'
Justin Iurman says:

====================
Support for the IOAM insertion frequency

The insertion frequency is represented as "k/n", meaning IOAM will be
added to {k} packets over {n} packets, with 0 < k <= n and 1 <= {k,n} <=
1000000. Therefore, it provides the following percentages of insertion
frequency: [0.0001% (min) ... 100% (max)].

Not only this solution allows an operator to apply dynamic frequencies
based on the current traffic load, but it also provides some
flexibility, i.e., by distinguishing similar cases (e.g., "1/2" and
"2/4").

"1/2" = Y N Y N Y N Y N ...
"2/4" = Y Y N N Y Y N N ...
====================

Link: https://lore.kernel.org/r/20220202142554.9691-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:24:47 -08:00
Justin Iurman
08731d30e7 ipv6: ioam: Insertion frequency in lwtunnel output
Add support for the IOAM insertion frequency inside its lwtunnel output
function. This patch introduces a new (atomic) counter for packets,
based on which the algorithm will decide if IOAM should be added or not.

Default frequency is "1/1" (i.e., applied to all packets) for backward
compatibility. The iproute2 patch is ready and will be submitted as soon
as this one is accepted.

Previous iproute2 command:
ip -6 ro ad fc00::1/128 encap ioam6 [ mode ... ] ...

New iproute2 command:
ip -6 ro ad fc00::1/128 encap ioam6 [ freq k/n ] [ mode ... ] ...

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:24:45 -08:00
Justin Iurman
be847673cf uapi: ioam: Insertion frequency
Add the insertion frequency uapi for IOAM lwtunnels.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 20:24:45 -08:00
Yonghong Song
0908a66ad1 libbpf: Fix build issue with llvm-readelf
There are cases where clang compiler is packaged in a way
readelf is a symbolic link to llvm-readelf. In such cases,
llvm-readelf will be used instead of default binutils readelf,
and the following error will appear during libbpf build:

  Warning: Num of global symbols in
   /home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/build/libbpf/sharedobjs/libbpf-in.o (367)
   does NOT match with num of versioned symbols in
   /home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/build/libbpf/libbpf.so libbpf.map (383).
   Please make sure all LIBBPF_API symbols are versioned in libbpf.map.
  --- /home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/build/libbpf/libbpf_global_syms.tmp ...
  +++ /home/yhs/work/bpf-next/tools/testing/selftests/bpf/tools/build/libbpf/libbpf_versioned_syms.tmp ...
  @@ -324,6 +324,22 @@
   btf__str_by_offset
   btf__type_by_id
   btf__type_cnt
  +LIBBPF_0.0.1
  +LIBBPF_0.0.2
  +LIBBPF_0.0.3
  +LIBBPF_0.0.4
  +LIBBPF_0.0.5
  +LIBBPF_0.0.6
  +LIBBPF_0.0.7
  +LIBBPF_0.0.8
  +LIBBPF_0.0.9
  +LIBBPF_0.1.0
  +LIBBPF_0.2.0
  +LIBBPF_0.3.0
  +LIBBPF_0.4.0
  +LIBBPF_0.5.0
  +LIBBPF_0.6.0
  +LIBBPF_0.7.0
   libbpf_attach_type_by_name
   libbpf_find_kernel_btf
   libbpf_find_vmlinux_btf_id
  make[2]: *** [Makefile:184: check_abi] Error 1
  make[1]: *** [Makefile:140: all] Error 2

The above failure is due to different printouts for some ABS
versioned symbols. For example, with the same libbpf.so,
  $ /bin/readelf --dyn-syms --wide tools/lib/bpf/libbpf.so | grep "LIBBPF" | grep ABS
     134: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  ABS LIBBPF_0.5.0
     202: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  ABS LIBBPF_0.6.0
     ...
  $ /opt/llvm/bin/readelf --dyn-syms --wide tools/lib/bpf/libbpf.so | grep "LIBBPF" | grep ABS
     134: 0000000000000000     0 OBJECT  GLOBAL DEFAULT   ABS LIBBPF_0.5.0@@LIBBPF_0.5.0
     202: 0000000000000000     0 OBJECT  GLOBAL DEFAULT   ABS LIBBPF_0.6.0@@LIBBPF_0.6.0
     ...
The binutils readelf doesn't print out the symbol LIBBPF_* version and llvm-readelf does.
Such a difference caused libbpf build failure with llvm-readelf.

The proposed fix filters out all ABS symbols as they are not part of the comparison.
This works for both binutils readelf and llvm-readelf.

Reported-by: Delyan Kratunov <delyank@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220204214355.502108-1-yhs@fb.com
2022-02-04 18:22:56 -08:00
Jakub Kicinski
c78b8b20e3 net: don't include ndisc.h from ipv6.h
Nothing in ipv6.h needs ndisc.h, drop it.

Link: https://lore.kernel.org/r/20220203043457.2222388-1-kuba@kernel.org
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Link: https://lore.kernel.org/r/20220203231240.2297588-1-kuba@kernel.org
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04 14:15:11 -08:00
Matteo Croce
976a38e05a selftests/bpf: Test bpf_core_types_are_compat() functionality.
Add several tests to check bpf_core_types_are_compat() functionality:
- candidate type name exists and types match
- candidate type name exists but types don't match
- nested func protos at kernel recursion limit
- nested func protos above kernel recursion limit. Such bpf prog
  is rejected during the load.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204005519.60361-3-mcroce@linux.microsoft.com
2022-02-04 11:29:01 -08:00
Matteo Croce
e70e13e7d4 bpf: Implement bpf_core_types_are_compat().
Adopt libbpf's bpf_core_types_are_compat() for kernel duty by adding
explicit recursion limit of 2 which is enough to handle 2 levels of
function prototypes.

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220204005519.60361-2-mcroce@linux.microsoft.com
2022-02-04 11:26:26 -08:00
Hou Tao
b5e975d256 bpf, arm64: Enable kfunc call
Since commit b2eed9b588 ("arm64/kernel: kaslr: reduce module
randomization range to 2 GB"), for arm64 whether KASLR is enabled
or not, the module is placed within 2GB of the kernel region, so
s32 in bpf_kfunc_desc is sufficient to represente the offset of
module function relative to __bpf_call_base. The only thing needed
is to override bpf_jit_supports_kfunc_call().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220130092917.14544-2-hotforest@gmail.com
2022-02-04 16:37:13 +01:00
David S. Miller
c531adaf88 Merge branch 'ipa-RX-replenish'
Alex Elder says:

====================
net: ipa: improve RX buffer replenishing

This series revises the algorithm used for replenishing receive
buffers on RX endpoints.  Currently there are two atomic variables
that track how many receive buffers can be sent to the hardware.
The new algorithm obviates the need for those, by just assuming we
always want to provide the hardware with buffers until it can hold
no more.

The first patch eliminates an atomic variable that's not required.
The next moves some code into the main replenish function's caller,
making one of the called function's arguments unnecessary.   The
next six refactor things a bit more, adding a new helper function
that allows us to eliminate an additional atomic variable.  And the
final two implement two more minor improvements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:09 +00:00
Alex Elder
9654d8c462 net: ipa: determine replenish doorbell differently
Rather than tracking the number of receive buffer transactions that
have been submitted without a doorbell, just track the total number
of transactions that have been issued.  Then ring the doorbell when
that number modulo the replenish batch size is 0.

The effect is roughly the same, but the new count is slightly more
interesting, and this approach will someday allow the replenish
batch size to be tuned at runtime.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
5d6ac24fb1 net: ipa: replenish after delivering payload
Replenishing is now solely driven by whether transactions are
available for a channel, and it doesn't really matter whether
we replenish before or after we deliver received packets to the
network stack.

Replenishing before delivering the payload adds a little latency.
Eliminate that by requesting a replenish after the payload is
delivered.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
09b337deda net: ipa: kill replenish_backlog
We no longer use the replenish_backlog atomic variable to decide
when we've got work to do providing receive buffers to hardware.
Basically, we try to keep the hardware as full as possible, all the
time.  We keep supplying buffers until the hardware has no more
space for them.

As a result, we can get rid of the replenish_backlog field and the
atomic operations performed on it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
5fc7f9ba2e net: ipa: introduce gsi_channel_trans_idle()
Create a new function that returns true if all transactions for a
channel are available for use.

Use it in ipa_endpoint_replenish_enable() to see whether to start
replenishing, and in ipa_endpoint_replenish() to determine whether
it's necessary after a failure to schedule delayed work to ensure a
future replenish attempt occurs.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
d0ac30e74e net: ipa: don't use replenish_backlog
Rather than determining when to stop replenishing using the
replenish backlog, just stop when we have exhausted all available
transactions.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
6a606b9015 net: ipa: allocate transaction in replenish loop
When replenishing, have ipa_endpoint_replenish() allocate a
transaction, and pass that to ipa_endpoint_replenish_one() to fill.
Then, if that produces no error, commit the transaction within the
replenish loop as well.  In this way we can distinguish between
transaction failures and buffer allocation/mapping failures.

Failure to allocate a transaction simply means the hardware already
has as many receive buffers as it can hold.  In that case we can
break out of the replenish loop because there's nothing more to do.

If we fail to allocate or map pages for the receive buffer, just
try again later.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
b9dbabc5ca net: ipa: decide on doorbell in replenish loop
Decide whether the doorbell should be signaled when committing a
replenish transaction in the main replenish loop, rather than in
ipa_endpoint_replenish_one().  This is a step to facilitate the
next patch.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:08 +00:00
Alex Elder
4b22d84195 net: ipa: increment backlog in replenish caller
Three spots call ipa_endpoint_replenish(), and just one of those
requests that the backlog be incremented after completing the
replenish operation.

Instead, have the caller increment the backlog, and get rid of the
add_one argument to ipa_endpoint_replenish().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:07 +00:00
Alex Elder
b4061c136b net: ipa: allocate transaction before pages when replenishing
A transaction failure only occurs if no more transactions are
available for an endpoint.  It's a very cheap test.

When replenishing an RX endpoint buffer, there's no point in
allocating pages if transactions are exhausted.  So don't bother
doing so unless the transaction allocation succeeds.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:07 +00:00
Alex Elder
a9bec7ae70 net: ipa: kill replenish_saved
The replenish_saved field keeps track of the number of times a new
buffer is added to the backlog when replenishing is disabled.  We
don't really use it though, so there's no need for us to track it
separately.  Whether replenishing is enabled or not, we can simply
increment the backlog.

Get rid of replenish_saved, and initialize and increment the backlog
where it would have otherwise been used.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:16:07 +00:00
Jakub Kicinski
b93235e689 tls: cap the output scatter list to something reasonable
TLS recvmsg() passes user pages as destination for decrypt.
The decrypt operation is repeated record by record, each
record being 16kB, max. TLS allocates an sg_table and uses
iov_iter_get_pages() to populate it with enough pages to
fit the decrypted record.

Even though we decrypt a single message at a time we size
the sg_table based on the entire length of the iovec.
This leads to unnecessarily large allocations, risking
triggering OOM conditions.

Use iov_iter_truncate() / iov_iter_reexpand() to construct
a "capped" version of iov_iter_npages(). Alternatively we
could parametrize iov_iter_npages() to take the size as
arg instead of using i->count, or do something else..

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:14:07 +00:00
Russell King (Oracle)
6ff6064605 net: dsa: realtek: convert to phylink_generic_validate()
Populate the supported interfaces and MAC capabilities for the Realtek
rtl8365 DSA switch and remove the old validate implementation to allow
DSA to use phylink_generic_validate() for this switch driver.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:12:53 +00:00
David S. Miller
eace555b4c Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
40GbE Intel Wired LAN Driver Updates 2022-02-03

This series contains updates to the i40e client header file and driver.

Mateusz disables HW TC offload by default.

Joe Damato removes a no longer used statistic.

Jakub Kicinski removes an unused enum from the client header file.

Jedrzej changes some admin queue commands to occur under atomic context
and adds new functions for admin queue MAC VLAN filters to avoid a
potential race that could occur due storing results in a structure that
could be overwritten by the next admin queue call.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-04 10:09:42 +00:00
Florian Westphal
c828414ac9 netfilter: nft_compat: suppress comment match
No need to have the datapath call the always-true comment match stub.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
7890cbea66 netfilter: exthdr: add support for tcp option removal
This allows to replace a tcp option with nop padding to selectively disable
a particular tcp option.

Optstrip mode is chosen when userspace passes the exthdr expression with
neither a source nor a destination register attribute.

This is identical to xtables TCPOPTSTRIP extension.
The only difference is that TCPOPTSTRIP allows to pass in a bitmap
of options to remove rather than a single number.

Unlike TCPOPTSTRIP this expression can be used multiple times
in the same rule to get the same effect.

We could add a new nested attribute later on in case there is a
use case for single-expression-multi-remove.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
20ff320246 netfilter: conntrack: pptp: use single option structure
Instead of exposing the four hooks individually use a sinle hook ops
structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
1015c3de23 netfilter: conntrack: remove extension register api
These no longer register/unregister a meaningful structure so remove it.

Cc: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
1bc91a5ddf netfilter: conntrack: handle ->destroy hook via nat_ops instead
The nat module already exposes a few functions to the conntrack core.
Move the nat extension destroy hook to it.

After this, no conntrack extension needs a destroy hook.
'struct nf_ct_ext_type' and the register/unregister api can be removed
in a followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
5f31edc067 netfilter: conntrack: move extension sizes into core
No need to specify this in the registration modules, we already
collect all sizes for build-time checks on the maximum combined size.

After this change, all extensions except nat have no meaningful content
in their nf_ct_ext_type struct definition.

Next patch handles nat, this will then allow to remove the dynamic
register api completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Florian Westphal
bb62a765b1 netfilter: conntrack: make all extensions 8-byte alignned
All extensions except one need 8 byte alignment, so just make that the
default.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:28 +01:00
Nicolas Dichtel
8b54136472 netfilter: nfqueue: enable to get skb->priority
This info could be useful to improve traffic analysis.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:27 +01:00
Kevin Mitchell
5bed9f3f63 netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY
The udp_error function verifies the checksum of incoming UDP packets if
one is set. This has the desirable side effect of setting skb->ip_summed
to CHECKSUM_COMPLETE, signalling that this verification need not be
repeated further up the stack.

Conversely, when the UDP checksum is empty, which is perfectly legal (at least
inside IPv4), udp_error previously left no trace that the checksum had been
deemed acceptable.

This was a problem in particular for nf_reject_ipv4, which verifies the
checksum in nf_send_unreach() before sending ICMP_DEST_UNREACH. It makes
no accommodation for zero UDP checksums unless they are already marked
as CHECKSUM_UNNECESSARY.

This commit ensures packets with empty UDP checksum are marked as
CHECKSUM_UNNECESSARY, which is explicitly recommended in skbuff.h.

Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-02-04 06:30:27 +01:00
Horatiu Vultur
41414c9bdb net: lan966x: use .mac_select_pcs() interface
Convert lan966x to use the mac_select_interface instead of
phylink_set_pcs.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20220202114949.833075-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 19:11:21 -08:00
Guillaume Nault
95eb6ef82b selftests: rtnetlink: Use more sensible tos values
Using tos 0x1 with 'ip route get <IPv4 address> ...' doesn't test much
of the tos option handling: 0x1 just sets an ECN bit, which is cleared
by inet_rtm_getroute() before doing the fib lookup. Let's use 0x10
instead, which is actually taken into account in the route lookup (and
is less surprising for the reader).

For consistency, use 0x10 for the IPv6 route lookup too (IPv6 currently
doesn't clear ECN bits, but might do so in the future).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/d61119e68d01ba7ef3ba50c1345a5123a11de123.1643815297.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 19:11:21 -08:00
Guillaume Nault
bafe517af2 selftests: fib offload: use sensible tos values
Although both iproute2 and the kernel accept 1 and 2 as tos values for
new routes, those are invalid. These values only set ECN bits, which
are ignored during IPv4 fib lookups. Therefore, no packet can actually
match such routes. This selftest therefore only succeeds because it
doesn't verify that the new routes do actually work in practice (it
just checks if the routes are offloaded or not).

It makes more sense to use tos values that don't conflict with ECN.
This way, the selftest won't be affected if we later decide to warn or
even reject invalid tos configurations for new routes.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/5e43b343720360a1c0e4f5947d9e917b26f30fbf.1643826556.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 19:11:21 -08:00
Eric Dumazet
25ee1660a5 net: minor __dev_alloc_name() optimization
__dev_alloc_name() allocates a private zeroed page,
then sets bits in it while iterating through net devices.

It can use __set_bit() to avoid unnecessary locked operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220203064609.3242863-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 19:11:14 -08:00
Jakub Kicinski
c59400a68c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 17:36:16 -08:00
Kees Cook
dcb85f85fa gcc-plugins/stackleak: Use noinstr in favor of notrace
While the stackleak plugin was already using notrace, objtool is now a
bit more picky.  Update the notrace uses to noinstr.  Silences the
following objtool warnings when building with:

CONFIG_DEBUG_ENTRY=y
CONFIG_STACK_VALIDATION=y
CONFIG_VMLINUX_VALIDATION=y
CONFIG_GCC_PLUGIN_STACKLEAK=y

  vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
  vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section

Note that the plugin's addition of calls to stackleak_track_stack() from
noinstr functions is expected to be safe, as it isn't runtime
instrumentation and is self-contained.

Cc: Alexander Popov <alex.popov@linux.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-02-03 17:02:21 -08:00
Linus Torvalds
eb2eb5161c Networking fixes for 5.17-rc3, including fixes from bpf, netfilter,
and ieee802154.
 
 Current release - regressions:
 
  - Partially revert "net/smc: Add netlink net namespace support",
    fix uABI breakage
 
  - netfilter:
      - nft_ct: fix use after free when attaching zone template
      - nft_byteorder: track register operations
 
 Previous releases - regressions:
 
  - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
 
  - phy: qca8081: fix speeds lower than 2.5Gb/s
 
  - sched: fix use-after-free in tc_new_tfilter()
 
 Previous releases - always broken:
 
  - tcp: fix mem under-charging with zerocopy sendmsg()
 
  - tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
 
  - neigh: do not trigger immediate probes on NUD_FAILED from
    neigh_managed_work, avoid a deadlock
 
  - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
    false-positives
 
  - netfilter: nft_reject_bridge: fix for missing reply from prerouting
 
  - smc: forward wakeup to smc socket waitqueue after fallback
 
  - ieee802154:
      - return meaningful error codes from the netlink helpers
      - mcr20a: fix lifs/sifs periods
      - at86rf230, ca8210: stop leaking skbs on error paths
 
  - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
 
  - ax25: add refcount in ax25_dev to avoid UAF bugs
 
  - eth: mlx5e:
      - fix SFP module EEPROM query
      - fix broken SKB allocation in HW-GRO
      - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
 
  - eth: amd-xgbe:
      - fix skb data length underflow
      - ensure reset of the tx_timer_active flag, avoid Tx timeouts
 
  - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
 
  - eth: e1000e: handshake with CSME starts from Alder Lake platforms
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmH8X9UACgkQMUZtbf5S
 IrsxuhAAlAvFHGL6y5Y2gAmhKvVUvCYjiIJBcvk7R66CwYVRxofvlhmxi6GM/Czs
 9SrVSaN4RXu3p3d7UtAl1gAQwHqzLIHH3m2g5dSKVvHZWQgkm/+n74x0aZQ9Fll7
 mWs9uu5fWsQr/qZBnnjoQTvUxRUNVd4trBy7nXGzkNqJL5j0+2TT4BhH4qalhE28
 iPc9YFCyKPdjoWFksteZqD3hAQbXxK/xRRr6xuvFHENlZdEHM6ARftHnJthTG/fY
 32rdn9YUkQ9lNtOBJNMN9yP2z1B7TcxASBqjjk55I7XtT1QAI9/PskszavHC0hOk
 BCSMX779bLNW4+G0wiSKVB4tq4tvswtawq8Hxa6zdU4TKIzfQ84ZL/Nf66GtH+4W
 C0mbZohmyJV9hQFkNT0ZLeihljd7i4BkDttlbK3uz2IL9tHeX3uSo5V7AgS/Xaf6
 frXgbGgjQTaR6IL9AUhfN3GTCx60mzpH/aRpFho8A5xAl3EtHWCJcRhbY/CEhQBR
 zyCndcLcG5mUzbhx/TxlKrrpRCLxqCUG/Tsb2wCh5jMxO1zonW9Hhv4P1ie6EFuI
 h+XiJT2WWObS/KTze9S86WOR0zcqrtRqaOGJlNB+/+K8ClZU8UsDTFXLQ0dqpVZF
 Mvp7VchBzyFFJrrvO8WkkJgLTKdaPJmM9wuWUZb4J6d2MWlmDkE=
 =qKvf
 -----END PGP SIGNATURE-----

Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf, netfilter, and ieee802154.

  Current release - regressions:

   - Partially revert "net/smc: Add netlink net namespace support", fix
     uABI breakage

   - netfilter:
      - nft_ct: fix use after free when attaching zone template
      - nft_byteorder: track register operations

  Previous releases - regressions:

   - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback

   - phy: qca8081: fix speeds lower than 2.5Gb/s

   - sched: fix use-after-free in tc_new_tfilter()

  Previous releases - always broken:

   - tcp: fix mem under-charging with zerocopy sendmsg()

   - tcp: add missing tcp_skb_can_collapse() test in
     tcp_shift_skb_data()

   - neigh: do not trigger immediate probes on NUD_FAILED from
     neigh_managed_work, avoid a deadlock

   - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
     false-positives

   - netfilter: nft_reject_bridge: fix for missing reply from prerouting

   - smc: forward wakeup to smc socket waitqueue after fallback

   - ieee802154:
      - return meaningful error codes from the netlink helpers
      - mcr20a: fix lifs/sifs periods
      - at86rf230, ca8210: stop leaking skbs on error paths

   - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent

   - ax25: add refcount in ax25_dev to avoid UAF bugs

   - eth: mlx5e:
      - fix SFP module EEPROM query
      - fix broken SKB allocation in HW-GRO
      - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows

   - eth: amd-xgbe:
      - fix skb data length underflow
      - ensure reset of the tx_timer_active flag, avoid Tx timeouts

   - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()

   - eth: e1000e: handshake with CSME starts from Alder Lake platforms"

* tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  ax25: fix reference count leaks of ax25_dev
  net: stmmac: ensure PTP time register reads are consistent
  net: ipa: request IPA register values be retained
  dt-bindings: net: qcom,ipa: add optional qcom,qmp property
  tools/resolve_btfids: Do not print any commands when building silently
  bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
  net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
  tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
  net: sparx5: do not refer to skb after passing it on
  Partially revert "net/smc: Add netlink net namespace support"
  net/mlx5e: Avoid field-overflowing memcpy()
  net/mlx5e: Use struct_group() for memcpy() region
  net/mlx5e: Avoid implicit modify hdr for decap drop rule
  net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
  net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
  net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
  net/mlx5: E-Switch, Fix uninitialized variable modact
  net/mlx5e: Fix handling of wrong devices during bond netevent
  net/mlx5e: Fix broken SKB allocation in HW-GRO
  net/mlx5e: Fix wrong calculation of header index in HW_GRO
  ...
2022-02-03 16:54:18 -08:00
Linus Torvalds
551007a8f1 selinux/stable-5.17 PR 20220203
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmH8VkAUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNyLQ/9GAsvvB7PYeUpj0CGLMaAT9Hys5l+
 WPjP+NU+HF+r+AUsSCJwKkK4yKnpEDK9nidOdkwiYjAO/83yl7kkBPGRgisQep1A
 tEbuJ5vqZnR59jLxNKCmQE0gY+gByjk3jZIVFLSWwG/ho7s1LQyNoYpm7rbFIgAz
 6qe7IR1nsATzxRhDoJI3RIPlQjzhM1qEX9PBEtwW+LLieShtvMc+ijdiUw7bqNl9
 RTM6hRf4fTX4jLHtxfZYZ99bHEjIseksFbSAnjKxxkt0W5EFha73VX8hjwnG24J/
 XZQAhsyvpQmcZKJGZPWUSa+UFcytoauMnNdgJOQw7TcMT4Y2mMuvcoZ/KkFtDjdr
 30qhp46/gml2yqnByXRfzshGQm9E4ZoqSCn+lFWAfjlrhcqdgZFKpILpwMixbdin
 NgTA/pbwXovrlho8UflB0sbDMrbyV3qNGZXD/4hRg66Vm3F7ipgqPBbM89qoDniG
 CXiQnmRQ+rwcftyeE7me7+kD6djYTWOfEY5HRNiCf9NhnQG8GP7YzZ4KACxJ2PwQ
 R9+Egc9nAl4UG6PrEjZeud81rLzc+ws2SJLokxOcIGnid8lZidf83HfWekAmRloA
 J5+tmpx5q26ug/j2uXV/rp36xaQWhjJrrnhEKamIYYAVioXa9srRhtz3qRI8r/13
 mrZ5hu4le8aC/5s=
 =AyIM
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux fix from Paul Moore:
 "One small SELinux patch to ensure that a policy structure field is
  properly reset after freeing so that we don't inadvertently do a
  double-free on certain error conditions"

* tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: fix double free of cond_list on error paths
2022-02-03 16:44:12 -08:00
Linus Torvalds
25b20ae815 linux-kselftest-fixes-5.17-rc3
This Kselftest fixes update for Linux 5.17-rc3 consists of important
 fixes to several tests and documentation clarification on running
 mainline kselftest on stable releases. A few notable fixes:
 
 - fix kselftest run hang due to child processes that haven't been
   terminated. Fix signals all child processes
 - fix false pass/fail results from vdso_test_abi, openat2, mincore
 - build failures when using -j (multiple jobs) option
 - exec test build failure due to incorrect build rule for a run-time
   created "pipe"
 - zram test fixes related to interaction with zram-generator to
   make sure zram test to coordinate deleted with zram-generator
 - zram test compression ratio calculation fix and skipping
   max_comp_streams.
 - increasing rtc test timeout
 - cpufreq test to write test results to stdout which will necessary on
   automated test systems
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmH8OcAACgkQCwJExA0N
 QxzTkg//YF+iSc1aao8nmUOvsK6oc1RBwIj3hkLUHjP3H1qFkm9OYxzTcLYGcnyo
 JahxNjeVoDuVESYx/AyLZ568aCCxRXJEDzNm5eIVNfBrGtVTfFPM19HwC/R3I1Ew
 KJUruxRx++8AvI1RYMEzsDumKpLVe3bor7sj3CcO1E9/qkOoUAukxt7FVmSNMZlW
 qYCDgc3yBa/XrImHCbJdZc4CUhbmh+l05sZgG3V3fxQSlgfIClY0Qg8W7Ucu+r4S
 6W5nwoEJIG32Zl2avaZ2VTF4T+CTQB70g/n4OBEX8TAxuIIi9W12N6zMZ76q8qbp
 iRs7UqgUSqPWdz/3ZHiQ5gy0WsCJ/W1379TtiG0doEeU2vwZ6fR8NMn+2FrEH18W
 xHBPOWeN+PkVAFjUeoyt1c5OGNprK6EEE2kQ6CLoBTwlKWLDQ87ZNPf84uAsez1x
 G0m7AX/T5adeTLoZfEXNXVY4OROs0nxbAkGC5ghVtKQu1giMcUKj+KHUowgj5OIJ
 Zaj+uSPiN3hnwj5L2fk+orOEC+3bZxVoQzqSB2Bs6stQOQFZLP18xHIjQIDZoQuC
 O512ZY+dwMzSyTi2KoQmb/M0Ft3gSfhRVXc7gfEFfOvC3ZbqRGFfeES1ZILup3rZ
 izMTJBDOe+BGSm/GCFPHxu36YfdPqiAyHBTVSYy5EhLGpxafZvI=
 =rH2m
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull Kselftest fixes from Shuah Khan:
 "Important fixes to several tests and documentation clarification on
  running mainline kselftest on stable releases. A few notable fixes:

   - fix kselftest run hang due to child processes that haven't been
     terminated. Fix signals all child processes

   - fix false pass/fail results from vdso_test_abi, openat2, mincore

   - build failures when using -j (multiple jobs) option

   - exec test build failure due to incorrect build rule for a run-time
     created "pipe"

   - zram test fixes related to interaction with zram-generator to make
     sure zram test to coordinate deleted with zram-generator

   - zram test compression ratio calculation fix and skipping
     max_comp_streams.

   - increasing rtc test timeout

   - cpufreq test to write test results to stdout which will necessary
     on automated test systems"

* tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kselftest: Fix vdso_test_abi return status
  selftests: skip mincore.check_file_mmap when fs lacks needed support
  selftests: openat2: Skip testcases that fail with EOPNOTSUPP
  selftests: openat2: Add missing dependency in Makefile
  selftests: openat2: Print also errno in failure messages
  selftests: futex: Use variable MAKE instead of make
  selftests/exec: Remove pipe from TEST_GEN_FILES
  selftests/zram: Adapt the situation that /dev/zram0 is being used
  selftests/zram01.sh: Fix compression ratio calculation
  selftests/zram: Skip max_comp_streams interface on newer kernel
  docs/kselftest: clarify running mainline tests on stables
  kselftest: signal all child processes
  selftests: cpufreq: Write test output to stdout as well
  selftests: rtc: Increase test timeout so that all tests run
2022-02-03 16:36:26 -08:00
Andrii Nakryiko
227a0713b3 libbpf: Deprecate forgotten btf__get_map_kv_tids()
btf__get_map_kv_tids() is in the same group of APIs as
btf_ext__reloc_func_info()/btf_ext__reloc_line_info() which were only
used by BCC. It was missed to be marked as deprecated in [0]. Fixing
that to complete [1].

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20220201014610.3522985-1-davemarchevsky@fb.com/
  [1] Closes: https://github.com/libbpf/libbpf/issues/277

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20220203225017.1795946-1-andrii@kernel.org
2022-02-04 01:07:16 +01:00
Dave Ertman
b794eecb2a ice: add support for DSCP QoS for IDC
The ice driver provides QoS information to auxiliary drivers
through the exported function ice_get_qos_params.  This function
doesn't currently support L3 DSCP QoS.

Add the necessary defines, structure elements and code to support
DSCP QoS through the IIDC functions.

Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-02-03 15:22:03 -08:00
Duoming Zhou
87563a043c ax25: fix reference count leaks of ax25_dev
The previous commit d01ffb9eee ("ax25: add refcount in ax25_dev
to avoid UAF bugs") introduces refcount into ax25_dev, but there
are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().

This patch uses ax25_dev_put() and adjusts the position of
ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.

Fixes: d01ffb9eee ("ax25: add refcount in ax25_dev to avoid UAF bugs")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03 14:20:36 -08:00