Commit Graph

1060520 Commits

Author SHA1 Message Date
Florian Westphal
3fce16493d netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
ip_ct_attach predates struct nf_ct_hook, we can place it there and
remove the exported symbol.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:30:13 +01:00
Florian Westphal
7197743776 netfilter: conntrack: convert to refcount_t api
Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.

For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.

Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().

This is safe because the new entry is not (yet) visible to other cpus.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:30:13 +01:00
Jiapeng Chong
613a0c67d1 netfilter: conntrack: Use max() instead of doing it manually
Fix following coccicheck warning:

./include/net/netfilter/nf_conntrack.h:282:16-17: WARNING opportunity
for max().

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-01-09 23:30:13 +01:00
Colin Ian King
2b71e2c7b5 netfilter: nft_set_pipapo_avx2: remove redundant pointer lt
The pointer lt is being assigned a value and then later
updated but that value is never read. The pointer is
redundant and can be removed.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-24 16:58:17 +01:00
Florian Westphal
c42ba4290b netfilter: flowtable: remove ipv4/ipv6 modules
Just place the structs and registration in the inet module.
nf_flow_table_ipv6, nf_flow_table_ipv4 and nf_flow_table_inet share
same module dependencies: nf_flow_table, nf_tables.

before:
   text	   data	    bss	    dec	    hex	filename
   2278	   1480	      0	   3758	    eae	nf_flow_table_inet.ko
   1159	   1352	      0	   2511	    9cf	nf_flow_table_ipv6.ko
   1154	   1352	      0	   2506	    9ca	nf_flow_table_ipv4.ko

after:
   2369	   1672	      0	   4041	    fc9	nf_flow_table_inet.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:07:44 +01:00
Florian Westphal
878aed8db3 netfilter: nat: force port remap to prevent shadowing well-known ports
If destination port is above 32k and source port below 16k
assume this might cause 'port shadowing' where a 'new' inbound
connection matches an existing one, e.g.

inbound X:41234 -> Y:53 matches existing conntrack entry
        Z:53 -> X:4123, where Z got natted to X.

In this case, new packet is natted to Z:53 which is likely
unwanted.

We avoid the rewrite for connections that originate from local host:
port-shadowing is only possible with forwarded connections.

Also adjust test case.

v3: no need to call tuple_force_port_remap if already in random mode (Phil)

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Phil Sutter <phil@nwl.cc>
Acked-by: Eric Garver <eric@garver.life>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:07:44 +01:00
Florian Westphal
4a6fbdd801 netfilter: conntrack: tag conntracks picked up in local out hook
This allows to identify flows that originate from local machine
in a followup patch.

It would be possible to make this a ->status bit instead.
For now I did not do that yet because I don't have a use-case for
exposing this info to userspace.

If one comes up the toggle can be replaced with a status bit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:07:44 +01:00
Pablo Neira Ayuso
023223dfbf netfilter: nf_tables: make counter support built-in
Make counter support built-in to allow for direct call in case of
CONFIG_RETPOLINE.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:07:35 +01:00
Pablo Neira Ayuso
690d541739 netfilter: nf_tables: replace WARN_ON by WARN_ON_ONCE for unknown verdicts
Bug might trigger warning for each packet, call WARN_ON_ONCE instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:00:56 +01:00
Pablo Neira Ayuso
4765473fef netfilter: nf_tables: consolidate rule verdict trace call
Add function to consolidate verdict tracing.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:00:56 +01:00
Pablo Neira Ayuso
8801d791b4 netfilter: nft_payload: WARN_ON_ONCE instead of BUG
BUG() is too harsh for unknown payload base, use WARN_ON_ONCE() instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:00:56 +01:00
Pablo Neira Ayuso
0d1873a522 netfilter: nf_tables: remove rcu read-size lock
Chain stats are updated from the Netfilter hook path which already run
under rcu read-size lock section.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-23 01:00:56 +01:00
Eric Dumazet
fc0d026a2f netfilter: nf_nat_masquerade: add netns refcount tracker to masq_dev_work
If compiled with CONFIG_NET_NS_REFCNT_TRACKER=y,
using put_net_track() in iterate_cleanup_work()
and netns_tracker_alloc() in nf_nat_masq_schedule()
might help us finding netns refcount imbalances.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-16 12:49:34 +01:00
Eric Dumazet
a9382d9389 netfilter: nfnetlink: add netns refcount tracker to struct nfulnl_instance
If compiled with CONFIG_NET_NS_REFCNT_TRACKER=y,
using put_net_track() in nfulnl_instance_free_rcu()
and get_net_track() in instance_create()
might help us finding netns refcount imbalances.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-16 12:49:34 +01:00
Volodymyr Mytnyk
604ba23090 net: prestera: flower template support
Add user template explicit support. At this moment, max
TCAM rule size is utilized for all rules, doesn't matter
which and how much flower matches are provided by user. It
means that some of TCAM space is wasted, which impacts
the number of filters that can be offloaded.

Introducing the template, allows to have more HW offloaded
filters by specifying the template explicitly.

Example:
  tc qd add dev PORT clsact
  tc chain add dev PORT ingress protocol ip \
    flower dst_ip 0.0.0.0/16
  tc filter add dev PORT ingress protocol ip \
    flower skip_sw dst_ip 1.2.3.4/16 action drop

NOTE: chain 0 is the default chain id for "tc chain" & "tc filter"
      command, so it is omitted in the example above.

This patch adds only template support for default chain 0 suppoerted
by prestera driver at this moment. Chains are not supported yet,
and will be added later.

Signed-off-by: Volodymyr Mytnyk <vmytnyk@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:52:53 +00:00
Luiz Angelo Daros de Luca
a5dba0f207 net: dsa: rtl8365mb: add GMII as user port mode
Recent net-next fails to initialize ports with:

 realtek-smi switch: phy mode gmii is unsupported on port 0
 realtek-smi switch lan5 (uninitialized): validation of gmii with
 support 0000000,00000000,000062ef and advertisement
 0000000,00000000,000062ef failed: -22
 realtek-smi switch lan5 (uninitialized): failed to connect to PHY:
 -EINVAL
 realtek-smi switch lan5 (uninitialized): error -22 setting up PHY
 for tree 1, switch 0, port 0

Current net branch(3dd7d40b43) is not
affected.

I also noticed the same issue before with older versions but using
a MDIO interface driver, not realtek-smi.

Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:51:11 +00:00
David S. Miller
e85fbf5355 Merge branch 'gve-improvements'
Jeroen de Borst says:

====================
gve improvements

This patchset consists of unrelated changes:

A bug fix for an issue that disabled jumbo-frame support, a few code
improvements and minor funcitonal changes and 3 new features:
  Supporting tx|rx-coalesce-usec for DQO
  Suspend/resume/shutdown
  Optional metadata descriptors
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:54 +00:00
Tao Liu
6081ac2013 gve: Add tx|rx-coalesce-usec for DQO
Adding ethtool support for changing rx-coalesce-usec and tx-coalesce-usec
when using the DQO queue format.

Signed-off-by: Tao Liu <xliutaox@google.com>
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:54 +00:00
Jordan Kim
2c9198356d gve: Add consumed counts to ethtool stats
Being able to see how many descriptors are in-use is helpful
when diagnosing certain issues.

Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Signed-off-by: Jordan Kim <jrkim@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:54 +00:00
Catherine Sullivan
974365e518 gve: Implement suspend/resume/shutdown
Add support for suspend, resume and shutdown.

Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:54 +00:00
Willem de Bruijn
497dbb2b97 gve: Add optional metadata descriptor type GVE_TXD_MTD
Allow drivers to pass metadata along with packet data to the device.
Introduce a new metadata descriptor type

* GVE_TXD_MTD

This descriptor is optional. If present it immediate follows the
packet descriptor and precedes the segment descriptor.

This descriptor may be repeated. Multiple metadata descriptors may
follow. There are no immediate uses for this, this is for future
proofing. At present devices allow only 1 MTD descriptor.

The lower four bits of the type_flags field encode GVE_TXD_MTD.
The upper four bits of the type_flags field encodes a *sub*type.

Introduce one such metadata descriptor subtype

* GVE_MTD_SUBTYPE_PATH

This shares path information with the device for network failure
discovery and robust response:

Linux derives ipv6 flowlabel and ECMP multipath from sk->sk_txhash,
and updates this field on error with sk_rethink_txhash. Allow the host
stack to do the same. Pass the tx_hash value if set. Also communicate
whether the path hash is set, or more exactly, what its type is. Define
two common types

  GVE_MTD_PATH_HASH_NONE
  GVE_MTD_PATH_HASH_L4

Concrete examples of error conditions that are resolved are
mentioned in the commits that add sk_rethink_txhash calls. Such as
commit 7788174e87 ("tcp: change IPv6 flow-label upon receiving
spurious retransmission").

Experimental results mirror what the theory suggests: where IPv6
FlowLabel is included in path selection (e.g., LAG/ECMP), flowlabel
rotation on TCP timeout avoids the vast majority of TCP disconnects
that would otherwise have occurred during link failures in long-haul
backbones, when an alternative path is available.

Rotation can be applied to various bad connection signals, such as
timeouts and spurious retransmissions. In aggregate, such flow level
signals can help locate network issues. Define initial common states:

  GVE_MTD_PATH_STATE_DEFAULT
  GVE_MTD_PATH_STATE_TIMEOUT
  GVE_MTD_PATH_STATE_CONGESTION
  GVE_MTD_PATH_STATE_RETRANSMIT

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:54 +00:00
Catherine Sullivan
5fd07df47a gve: remove memory barrier around seqno
No longer needed after we introduced the barrier in gve_napi_poll.

Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:54 +00:00
Catherine Sullivan
13e7939c95 gve: Update gve_free_queue_page_list signature
The id field should be a u32 not a signed int.

Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:53 +00:00
Catherine Sullivan
d30baacc04 gve: Move the irq db indexes out of the ntfy block struct
Giving the device access to other kernel structs is not ideal.
Move the indexes into their own array and just keep pointers to
them in the ntfy block struct.

Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:53 +00:00
Jeroen de Borst
a10834a36c gve: Correct order of processing device options
The legacy raw addressing device option was processed before the
new RDA queue format option.  This caused the supported features mask,
which is provided only on the RDA queue format option, not to be set.

This disabled jumbo-frame support when using raw adressing.

Fixes: 255489f5b3 ("gve: Add a jumbo-frame device option")
Signed-off-by: Jeroen de Borst <jeroendb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:41:53 +00:00
David S. Miller
75df1a2484 Merge branch 'phylink-pcs-validation'
Russell King says:

====================
net: phylink: add PCS validation

This series allows phylink to include the PCS in its validation step.
There are two reasons to make this change:

1. Some of the network drivers that are making use of the split PCS
   support are already manually calling into their PCS drivers to
   perform validation. E.g. stmmac with xpcs.

2. Logically, some network drivers such as mvneta and mvpp2, the
   restriction we impose in the validate() callback is a property of
   the "PCS" block that we provide rather than the MAC.

This series:

1. Gives phylink a mechanism to query the MAC driver which PCS is
   wishes to use for the PHY interface mode. This is necessary to allow
   the PCS to be involved in the validation step without making changes
   to the configuration.

2. Provide a pcs_validate() method that PCS can implement. This follows
   a similar model to the MAC's validate() callback, but with some minor
   differences due to observations from the various implementations.
   E.g. returning an error code for not-supported and the way the
   advertising bitmap is masked.

3. Convert mvpp2 and mvneta to this as examples of its use. Further
   Conversions are in the pipeline, including for stmmac+xpcs, as well
   as some DSA drivers. Note that DSA conversion to this is conditional
   upon all DSA drivers populating their supported_interfaces bitmap,
   since this is required before mac_select_pcs() can be used.

Existing drivers that set a PCS in mac_prepare() or mac_config(), or
shortly after phylink_create() will continue to work. However, it should
be noted that mac_select_pcs() will be called during phylink_create(),
and thus any PCS returned by mac_select_pcs() must be available by this
time - or we drop the check in phylink_create().

v2: fix kerneldoc typo in patch 1.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:14 +00:00
Russell King (Oracle)
d8c3669397 net: mvneta: convert to pcs_validate() and phylink_generic_validate()
Convert mvneta to validate the autoneg state for 1000base-X in the
pcs_validate() operation, rather than the MAC validate() operation.
This allows us to switch the MAC validate() to use
phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:14 +00:00
Russell King
c2e7d2df4a net: mvneta: convert to phylink pcs operations
An initial stab at converting mvneta to PCS operations.  There's a few
FIXMEs to be solved.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:14 +00:00
Russell King
5a7d895369 net: mvneta: convert to use mac_prepare()/mac_finish()
Convert mvneta to use the mac_prepare() and mac_finish() methods in
preparation to converting mvneta to split-PCS support.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:14 +00:00
Russell King (Oracle)
85e3e0ebdb net: mvpp2: convert to pcs_validate() and phylink_generic_validate()
Convert mvpp2 to validate the autoneg state for 1000base-X in the
pcs_validate() operation, rather than the MAC validate() operation.
This allows us to switch the MAC validate() to use
phylink_generic_validate().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:13 +00:00
Russell King (Oracle)
cff0563223 net: mvpp2: use .mac_select_pcs() interface
Use the mac_select_pcs() method to choose between the GMAC and XLG
PCS implementations.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:13 +00:00
Russell King (Oracle)
0d22d4b626 net: phylink: add pcs_validate() method
Add a hook for PCS to validate the link parameters. This avoids MAC
drivers having to have knowledge of their PCS in their validate()
method, thereby allowing several MAC drivers to be simplfied.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:13 +00:00
Russell King (Oracle)
d1e86325af net: phylink: add mac_select_pcs() method to phylink_mac_ops
mac_select_pcs() allows us to have an explicit point to query which
PCS the MAC wishes to use for a particular PHY interface mode, thereby
allowing us to add support to validate the link settings with the PCS.

Phylink will also use this to select the PCS to be used during a major
configuration event without the MAC driver needing to call
phylink_set_pcs().

Note that if mac_select_pcs() is present, the supported_interfaces
bitmap must be filled in; this avoids mac_select_pcs() being called
with PHY_INTERFACE_MODE_NA when we want to get support for all
interface types. Phylink will return an error in phylink_create()
unless this condition is satisfied.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:37:13 +00:00
David S. Miller
4134c846b6 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/nex
t-queue

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2021-12-15

This series contains updates to ice driver only.

Jake makes changes to flash update. This includes the following:

 * a new shadow-ram region similar to NVM region but for the device shadow
   RAM contents. This is distinct from NVM region because shadow RAM is
   built up during device init and may be different from the raw NVM flash
   data.
 * refactoring of the ice_flash_pldm_image to become the main flash update
   entry point. This is simpler than having both an
   ice_devlink_flash_update and an ice_flash_pldm_image. It will make
   additions like dry-run easier in the future.
 * reducing time to read Option ROM version information.
 * adding support for firmware activation via devlink reload, when
   possible.

The major new work is the reload support, which allows activating firmware
immediately without a reboot when possible. Reload support only supports
firmware activation.

Jesse improves transmit code: utilizing newer netif_tx* API, adding some
prefetch calls, correcting expected conditions when calling ice_vsi_down(),
and utilizing __netdev_tx_sent_queue() call.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:23:34 +00:00
David S. Miller
823f7a5497 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:

====================
mlx5-next branch 2021-12-15

Hi Dave, Jakub, Jason

This pulls mlx5-next branch into net-next and rdma branches.
All patches already reviewed on both rdma and netdev mailing lists.

Please pull and let me know if there's any problem.

1) Add multiple FDB steering priorities [1]
2) Introduce HW bits needed to configure MAC list size of VF/SF.
   Required for ("net/mlx5: Memory optimizations") upcoming series [2].

[1] https://lore.kernel.org/netdev/20211201193621.9129-1-saeed@kernel.org/
[2] https://lore.kernel.org/lkml/20211208141722.13646-1-shayd@nvidia.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-16 10:22:20 +00:00
Jakub Kicinski
bd1d97d861 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, mostly
rather small housekeeping patches:

1) Remove unused variable in IPVS, from GuoYong Zheng.

2) Use memset_after in conntrack, from Kees Cook.

3) Remove leftover function in nfnetlink_queue, from Florian Westphal.

4) Remove redundant test on bool in conntrack, from Bernard Zhao.

5) egress support for nft_fwd, from Lukas Wunner.

6) Make pppoe work for br_netfilter, from Florian Westphal.

7) Remove unused variable in conntrack resize routine, from luo penghao.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
  netfilter: conntrack: Remove useless assignment statements
  netfilter: bridge: add support for pppoe filtering
  netfilter: nft_fwd_netdev: Support egress hook
  netfilter: ctnetlink: remove useless type conversion to bool
  netfilter: nf_queue: remove leftover synchronize_rcu
  netfilter: conntrack: Use memset_startat() to zero struct nf_conn
  ipvs: remove unused variable for ip_vs_new_dest
====================

Link: https://lore.kernel.org/r/20211215234911.170741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-15 17:29:28 -08:00
luo penghao
284ca7647c netfilter: conntrack: Remove useless assignment statements
The old_size assignment here will not be used anymore

The clang_analyzer complains as follows:

Value stored to 'old_size' is never read

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: luo penghao <luo.penghao@zte.com.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-12-16 00:17:40 +01:00
Shay Drory
685b1afd79 net/mlx5: Introduce log_max_current_uc_list_wr_supported bit
Downstream patch will use this bit in order to know whether the device
supports changing of max_uc_list.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-15 10:21:50 -08:00
Jesse Brandeburg
9c99d099f7 ice: use modern kernel API for kick
The kernel gained a new interface for drivers to use to combine tail
bump (doorbell) and BQL updates, attempt to use those new interfaces.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:49:25 -08:00
Jesse Brandeburg
21c6e36b1e ice: tighter control over VSI_DOWN state
The driver had comments to the effect of: This flag should be set before
calling this function. While reviewing code it was found that there were
several violations of this policy, which could introduce hard to find
bugs or races.

Fix the violations of the "VSI DOWN state must be set before calling
ice_down" and make checking the state into code with a WARN_ON.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:48:26 -08:00
Jesse Brandeburg
cc14db11c8 ice: use prefetch methods
The kernel provides some prefetch mechanisms to speed up commonly
cold cache line accesses during receive processing. Since these are
software structures it helps to have these strategically placed
prefetches.

Be careful to call BQL prefetch complete only for non XDP queues.

Co-developed-by: Piotr Raczynski <piotr.raczynski@intel.com>
Signed-off-by: Piotr Raczynski <piotr.raczynski@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:46:28 -08:00
Jesse Brandeburg
1c96c16858 ice: update to newer kernel API
Use the netif_tx_* API from netdevice.h which has simpler parameters.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:45:28 -08:00
Jacob Keller
399e27dbbd ice: support immediate firmware activation via devlink reload
The ice hardware contains an embedded chip with firmware which can be
updated using devlink flash. The firmware which runs on this chip is
referred to as the Embedded Management Processor firmware (EMP
firmware).

Activating the new firmware image currently requires that the system be
rebooted. This is not ideal as rebooting the system can cause unwanted
downtime.

In practical terms, activating the firmware does not always require a
full system reboot. In many cases it is possible to activate the EMP
firmware immediately. There are a couple of different scenarios to
cover.

 * The EMP firmware itself can be reloaded by issuing a special update
   to the device called an Embedded Management Processor reset (EMP
   reset). This reset causes the device to reset and reload the EMP
   firmware.

 * PCI configuration changes are only reloaded after a cold PCIe reset.
   Unfortunately there is no generic way to trigger this for a PCIe
   device without a system reboot.

When performing a flash update, firmware is capable of responding with
some information about the specific update requirements.

The driver updates the flash by programming a secondary inactive bank
with the contents of the new image, and then issuing a command to
request to switch the active bank starting from the next load.

The response to the final command for updating the inactive NVM flash
bank includes an indication of the minimum reset required to fully
update the device. This can be one of the following:

 * A full power on is required
 * A cold PCIe reset is required
 * An EMP reset is required

The response to the command to switch flash banks includes an indication
of whether or not the firmware will allow an EMP reset request.

For most updates, an EMP reset is sufficient to load the new EMP
firmware without issues. In some cases, this reset is not sufficient
because the PCI configuration space has changed. When this could cause
incompatibility with the new EMP image, the firmware is capable of
rejecting the EMP reset request.

Add logic to ice_fw_update.c to handle the response data flash update
AdminQ commands.

For the reset level, issue a devlink status notification informing the
user of how to complete the update with a simple suggestion like
"Activate new firmware by rebooting the system".

Cache the status of whether or not firmware will restrict the EMP reset
for use in implementing devlink reload.

Implement support for devlink reload with the "fw_activate" flag. This
allows user space to request the firmware be activated immediately.

For the .reload_down handler, we will issue a request for the EMP reset
using the appropriate firmware AdminQ command. If we know that the
firmware will not allow an EMP reset, simply exit with a suitable
netlink extended ACK message indicating that the EMP reset is not
available.

For the .reload_up handler, simply wait until the driver has finished
resetting. Logic to handle processing of an EMP reset already exists in
the driver as part of its reset and rebuild flows.

Implement support for the devlink reload interface with the
"fw_activate" action. This allows userspace to request activation of
firmware without a reboot.

Note that support for indicating the required reset and EMP reset
restriction is not supported on old versions of firmware. The driver can
determine if the two features are supported by checking the device
capabilities report. I confirmed support has existed since at least
version 5.5.2 as reported by the 'fw.mgmt' version. Support to issue the
EMP reset request has existed in all version of the EMP firmware for the
ice hardware.

Check the device capabilities report to determine whether or not the
indications are reported by the running firmware. If the reset
requirement indication is not supported, always assume a full power on
is necessary. If the reset restriction capability is not supported,
always assume the EMP reset is available.

Users can verify if the EMP reset has activated the firmware by using
the devlink info report to check that the 'running' firmware version has
updated. For example a user might do the following:

 # Check current version
 $ devlink dev info

 # Update the device
 $ devlink dev flash pci/0000:af:00.0 file firmware.bin

 # Confirm stored version updated
 $ devlink dev info

 # Reload to activate new firmware
 $ devlink dev reload pci/0000:af:00.0 action fw_activate

 # Confirm running version updated
 $ devlink dev info

Finally, this change does *not* implement basic driver-only reload
support. I did look into trying to do this. However, it requires
significant refactor of how the ice driver probes and loads everything.
The ice driver probe and allocation flows were not designed with such
a reload in mind. Refactoring the flow to support this is beyond the
scope of this change.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:40:38 -08:00
Jacob Keller
af18d8866c ice: reduce time to read Option ROM CIVD data
During probe and device reset, the ice driver reads some data from the
NVM image as part of ice_init_nvm. Part of this data includes a section
of the Option ROM which contains version information.

The function ice_get_orom_civd_data is used to locate the '$CIV' data
section of the Option ROM.

Timing of ice_probe and ice_rebuild indicate that the
ice_get_orom_civd_data function takes about 10 seconds to finish
executing.

The function locates the section by scanning the Option ROM every 512
bytes. This requires a significant number of NVM read accesses, since
the Option ROM bank is 500KB. In the worst case it would take about 1000
reads. Worse, all PFs serialize this operation during reload because of
acquiring the NVM semaphore.

The CIVD section is located at the end of the Option ROM image data.
Unfortunately, the driver has no easy method to determine the offset
manually. Practical experiments have shown that the data could be at
a variety of locations, so simply reversing the scanning order is not
sufficient to reduce the overall read time.

Instead, copy the entire contents of the Option ROM into memory. This
allows reading the data using 4Kb pages instead of 512 bytes at a time.
This reduces the total number of firmware commands by a factor of 8. In
addition, reading the whole section together at once allows better
indication to firmware of when we're "done".

Re-write ice_get_orom_civd_data to allocate virtual memory to store the
Option ROM data. Copy the entire OptionROM contents at once using
ice_read_flash_module. Finally, use this memory copy to scan for the
'$CIV' section.

This change significantly reduces the time to read the Option ROM CIVD
section from ~10 seconds down to ~1 second. This has a significant
impact on the total time to complete a driver rebuild or probe.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:38:14 -08:00
Jacob Keller
c9f7a483e4 ice: move ice_devlink_flash_update and merge with ice_flash_pldm_image
The ice_devlink_flash_update function performs a few upfront checks and
then calls ice_flash_pldm_image.

Most if these checks make more sense in the context of code within
ice_flash_pldm_image. Merge ice_devlink_flash_update and
ice_flash_pldm_image into one function, placing it in ice_fw_update.c

Since this is still the entry point for devlink, call the function
ice_devlink_flash_update instead of ice_flash_pldm_image. This leaves a
single function which handles the devlink parameters and then initiates
a PLDM update.

With this change, the ice_devlink_flash_update function in
ice_fw_update.c becomes the main entry point for flash update. It
elimintes some unnecessary boiler plate code between the two previous
functions. The ultimate motivation for this is that it eases supporting
a dry run with the PLDM library in a future change.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:37:14 -08:00
Jacob Keller
c356eaa824 ice: move and rename ice_check_for_pending_update
The ice_devlink_flash_update function performs a few checks and then
calls ice_flash_pldm_image. One of these checks is to call
ice_check_for_pending_update. This function checks if the device has
a pending update, and cancels it if so. This is necessary to allow
a new flash update to proceed.

We want to refactor the ice code to eliminate ice_devlink_flash_update,
moving its checks into ice_flash_pldm_image.

To do this, ice_check_for_pending_update will become static, and only
called by ice_flash_pldm_image. To make this change easier to review,
first just move the function up within the ice_fw_update.c file.

While at it, note that the function has a misleading name. Its primary
action is to cancel a pending update. Using the verb "check" does not
imply this. Rename it to ice_cancel_pending_update.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:36:10 -08:00
Jacob Keller
78ad87da99 ice: devlink: add shadow-ram region to snapshot Shadow RAM
We have a region for reading the contents of the NVM flash as
a snapshot. This region does not allow reading the Shadow RAM, as it
always passes the FLASH_ONLY bit to the low level firmware interface.

Add a separate shadow-ram region which will allow snapshot of the
current contents of the Shadow RAM. This data is built from the NVM
contents but is distinct as the device builds up the Shadow RAM during
initialization, so being able to snapshot its contents can be useful
when attempting to debug flash related issues.

Fix the comment description of the nvm-flash region which incorrectly
stated that it filled the shadow-ram region, and add a comment
explaining that the nvm-flash region does not actually read the Shadow
RAM.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-15 08:34:54 -08:00
Jakub Kicinski
3bc14ea0d1 ethtool: always write dev in ethnl_parse_header_dev_get
Commit 0976b888a1 ("ethtool: fix null-ptr-deref on ref tracker")
made the write to req_info.dev conditional, but as Eric points out
in a different follow up the structure is often allocated on the
stack and not kzalloc()'d so seems safer to always write the dev,
in case it's garbage on input.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-15 15:09:24 +00:00
Eric Dumazet
f1d9268e06 net: add net device refcount tracker to struct packet_type
Most notable changes are in af_packet, tipc ones are trivial.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-15 15:07:04 +00:00
David S. Miller
ab8c83cf87 Merge branch 'mlxsw-ipv6-underlay'
Ido Schimmel says:

====================
mlxsw: Add support for VxLAN with IPv6 underlay

So far, mlxsw only supported VxLAN with IPv4 underlay. This patchset
extends mlxsw to also support VxLAN with IPv6 underlay. The main
difference is related to the way IPv6 addresses are handled by the
device. See patch #1 for a detailed explanation.

Patch #1 creates a common hash table to store the mapping from IPv6
addresses to KVDL indexes. This table is useful for both IP-in-IP and
VxLAN tunnels with an IPv6 underlay.

Patch #2 converts the IP-in-IP code to use the new hash table.

Patches #3-#6 are preparations.

Patch #7 finally adds support for VxLAN with IPv6 underlay.

Patch #8 removes a test case that checked that VxLAN configurations with
IPv6 underlay are vetoed by the driver.

A follow-up patchset will add forwarding selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-15 15:05:44 +00:00