Create set scheduler aggregator node and move for VSIs into respective
scheduler node. Max children per aggregator node is 64.
There are two types of aggregator node(s) created.
1. dedicated node for PF and _CTRL VSIs
2. dedicated node(s) for VFs.
As part of reset and rebuild, aggregator nodes are recreated and VSIs
are moved to respective aggregator node.
Having related VSIs in respective tree avoid starvation between PF and VF
w.r.t Tx bandwidth.
Co-developed-by: Tarun Singh <tarun.k.singh@intel.com>
Signed-off-by: Tarun Singh <tarun.k.singh@intel.com>
Co-developed-by: Victor Raj <victor.raj@intel.com>
Signed-off-by: Victor Raj <victor.raj@intel.com>
Co-developed-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add the framework and initial implementation for receiving and processing
netdev bonding events. This is only the software support and the
implementation of the HW offload for bonding support will be coming at a
later time. There are some architectural gaps that need to be closed
before that happens.
Because this is a software only solution that supports in kernel bonding,
SR-IOV is not supported with this implementation.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Current implementation of netdev already contains xsk_buff_pools.
We no longer have to contain these structures in ice_vsi.
Refactor the code to operate on netdev-provided xsk_buff_pools.
Move scheduling napi on each queue to a separate function to
simplify setup function.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].
Refactor the code according to the use of a flexible-array member in
struct ice_res_tracker, instead of a one-element array and use the
struct_size() helper to calculate the size for the allocations.
Also, notice that the code below suggests that, currently, two too many
bytes are being allocated with devm_kzalloc(), as the total number of
entries (pf->irq_tracker->num_entries) for pf->irq_tracker->list[] is
_vectors_ and sizeof(*pf->irq_tracker) also includes the size of the
one-element array _list_ in struct ice_res_tracker.
drivers/net/ethernet/intel/ice/ice_main.c:3511:
3511 /* populate SW interrupts pool with number of OS granted IRQs. */
3512 pf->num_avail_sw_msix = (u16)vectors;
3513 pf->irq_tracker->num_entries = (u16)vectors;
3514 pf->irq_tracker->end = pf->irq_tracker->num_entries;
With this change, the right amount of dynamic memory is now allocated
because, contrary to one-element arrays which occupy at least as much
space as a single object of the type, flexible-array members don't
occupy such space in the containing structure.
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays
Built-tested-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The current MSI-X enablement logic tries to enable best-case MSI-X
vectors and if that fails we only support a bare-minimum set. This
includes a single MSI-X for 1 Tx and 1 Rx queue and a single MSI-X
for the OICR interrupt. Unfortunately, the driver fails to load when we
don't get as many MSI-X as requested for a couple reasons.
First, the code to allocate MSI-X in the driver tries to allocate
num_online_cpus() MSI-X for LAN traffic without caring about the number
of MSI-X actually enabled/requested from the kernel for LAN traffic.
So, when calling ice_get_res() for the PF VSI, it returns failure
because the number of available vectors is less than requested. Fix
this by not allowing the PF VSI to allocation more than
pf->num_lan_msix MSI-X vectors and pf->num_lan_msix Rx/Tx queues.
Limiting the number of queues is done because we don't want more than
1 Tx/Rx queue per interrupt due to performance conerns.
Second, the driver assigns pf->num_lan_msix = 2, to account for LAN
traffic and the OICR. However, pf->num_lan_msix is only meant for LAN
MSI-X. This is causing a failure when the PF VSI tries to
allocate/reserve the minimum pf->num_lan_msix because the OICR MSI-X has
already been reserved, so there may not be enough MSI-X vectors left.
Fix this by setting pf->num_lan_msix = 1 for the failure case. Then the
ICE_MIN_MSIX accounts for the LAN MSI-X and the OICR MSI-X needed for
the failure case.
Update the related defines used in ice_ena_msix_range() to align with
the above behavior and remove the unused RDMA defines because RDMA is
currently not supported. Also, remove the now incorrect comment.
Fixes: 152b978a1f ("ice: Rework ice_ena_msix_range")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
vlan_ena was introduced to track whether VLAN filters are enabled on
the device, but
1) checking for num_vlan > 1 already gives us this information, and is
currently used in this way throughout the code
2) the logic for vlan_ena is broken when multiple VLANs are active
Just remove vlan_ena and use num_vlan instead.
Signed-off-by: Nick Nunley <nicholas.d.nunley@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently, the devlink_port structure is stored within the ice_pf. This
made sense because we create a single devlink_port for each PF. This
setup does not mesh with the abstractions in the driver very well, and
led to a flow where we accidentally call devlink_port_unregister twice
during error cleanup.
In particular, if devlink_port_register or devlink_port_unregister are
called twice, this leads to a kernel panic. This appears to occur during
some possible flows while cleaning up from a failure during driver
probe.
If register_netdev fails, then we will call devlink_port_unregister in
ice_cfg_netdev as it cleans up. Later, we again call
devlink_port_unregister since we assume that we must cleanup the port
that is associated with the PF structure.
This occurs because we cleanup the devlink_port for the main PF even
though it was not allocated. We allocated the port within a per-VSI
function for managing the main netdev, but did not release the port when
cleaning up that VSI, the allocation and destruction are not aligned.
Instead of attempting to manage the devlink_port as part of the PF
structure, manage it as part of the PF VSI. Doing this has advantages,
as we can match the de-allocation of the devlink_port with the
unregister_netdev associated with the main PF VSI.
Moving the port to the VSI is preferable as it paves the way for
handling devlink ports allocated for other purposes such as SR-IOV VFs.
Since we're changing up how we allocate the devlink_port, also change
the indexing. Originally, we indexed the port using the PF id number.
This came from an old goal of sharing a devlink for each physical
function. Managing devlink instances across multiple function drivers is
not workable. Instead, lets set the port number to the logical port
number returned by firmware and set the index using the VSI index
(sometimes referred to as VSI handle).
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This takes care of all of the trivial W=1 fixes in the Intel
Ethernet drivers, which allows developers and maintainers to
build more of the networking tree with more complete warning
checks.
There are three classes of kdoc warnings fixed:
- cannot understand function prototype: 'x'
- Excess function parameter 'x' description in 'y'
- Function parameter or member 'x' not described in 'y'
All of the changes were trivial comment updates on
function headers.
Inspired by Lee Jones' series of wireless work to do the same.
Compile tested only, and passes simple test of
$ git ls-files *.[ch] | egrep drivers/net/ethernet/intel | \
xargs scripts/kernel-doc -none
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the explicit umem reference passed to the driver in AF_XDP
zero-copy mode with the buffer pool instead. This in preparation for
extending the functionality of the zero-copy mode so that umems can be
shared between queues on the same netdev and also between netdevs. In
this commit, only an umem reference has been added to the buffer pool
struct. But later commits will add other entities to it. These are
going to be entities that are different between different queue ids
and netdevs even though the umem is shared between them.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
Display and count some useful hot-path statistics. The usefulness is as
follows:
- tx_restart: use to determine if the transmit ring size is too small or
if the transmit interrupt rate is too low.
- rx_gro_dropped: use to count drops from GRO layer, which previously were
completely uncounted when occurring.
- tx_busy: use to determine when the driver is miscounting number of
descriptors needed for an skb.
- tx_timeout: as our other drivers, count the number of times we've reset
due to timeout because the kernel only prints a warning once per netdev.
Several of these were already counted but not displayed.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Use the newly added pldmfw library to implement device flash update for
the Intel ice networking device driver. This support uses the devlink
flash update interface.
The main parts of the flash include the Option ROM, the netlist module,
and the main NVM data. The PLDM firmware file contains modules for each
of these components.
Using the pldmfw library, the provided firmware file will be scanned for
the three major components, "fw.undi" for the Option ROM, "fw.mgmt" for
the main NVM module containing the primary device firmware, and
"fw.netlist" containing the netlist module.
The flash is separated into two banks, the active bank containing the
running firmware, and the inactive bank which we use for update. Each
module is updated in a staged process. First, the inactive bank is
erased, preparing the device for update. Second, the contents of the
component are copied to the inactive portion of the flash. After all
components are updated, the driver signals the device to switch the
active bank during the next EMP reset (which would usually occur during
the next reboot).
Although the firmware AdminQ interface does report an immediate status
for each command, the NVM erase and NVM write commands receive status
asynchronously. The driver must not continue writing until previous
erase and write commands have finished. The real status of the NVM
commands is returned over the receive AdminQ. Implement a simple
interface that uses a wait queue so that the main update thread can
sleep until the completion status is reported by firmware. For erasing
the inactive banks, this can take quite a while in practice.
To help visualize the process to the devlink application and other
applications based on the devlink netlink interface, status is reported
via the devlink_flash_update_status_notify. While we do report status
after each 4k block when writing, there is no real status we can report
during erasing. We simply must wait for the complete module erasure to
finish.
With this implementation, basic flash update for the ice hardware is
supported.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the Port Disable bit is set in the Link Default Override Mask TLV PFA
module in the NVM, Total Port Shutdown mode is supported and enabled. In
this mode, the driver should act as if the link-down-on-close ethtool
private flag is always enabled and dis-allow any change to that flag.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Adds functions to check for link override firmware support and get
the override settings for a port. The previously supported/default link
mode was strict mode.
In strict mode link is configured based on get PHY capabilities PHY types
with media.
Lenient mode is now the default link mode. In lenient mode the link is
configured based on get PHY capabilities PHY types without media. This
allows the user to configure link that the media does not report. Limit the
minimum supported link mode to 25G for devices that support 100G, and 1G
for devices that support less than 100G.
Default override is only supported in lenient mode. If default override
is supported and enabled, then default override values are used for
configuring speed and FEC. Default override provide persistent link
settings in the NVM.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Evan Swanson <evan.swanson@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
After the transition from no media to media FW will clear the
set-phy-cfg data set by the user. Save initial PHY settings and any
settings later requested by the user and use that data to restore PHY
settings on media insertion. Since PHY configuration is now being stored,
replace calls that were calling FW to get the configuration with the saved
copy.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Chinh T Cao <chinh.t.cao@intel.com>
Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add callbacks needed to support advanced power management for Wake on LAN.
Also make ice_pf_state_is_nominal function available for all configurations
not just CONFIG_PCI_IOV.
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add a new devlink region used for capturing a snapshot of the device
capabilities buffer which is reported by the firmware over the AdminQ.
This information can useful in debugging driver and firmware
interactions.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
As with other networking drivers, remove the unnecessary driver version
from the Intel drivers. The ethtool driver information and module version
will then report the kernel version instead.
For ixgbe, i40e and ice drivers, the driver passes the driver version to
the firmware to confirm that we are up and running. So we now pass the
value of UTS_RELEASE to the firmware. This adminq call is required per
the HAS document. The Device then sends an indication to the BMC that the
PF driver is present. This is done using Host NC Driver Status Indication
in NC-SI Get Link command or via the Host Network Controller Driver Status
Change AEN.
What the BMC may do with this information is implementation-dependent, but
this is a standard NC-SI 1.1 command we honor per the HAS.
CC: Bruce Allan <bruce.w.allan@intel.com>
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Alek Loktionov <aleksandr.loktionov@intel.com>
CC: Kevin Liedtke <kevin.d.liedtke@intel.com>
CC: Aaron Rowden <aaron.f.rowden@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Enable accelerated Receive Flow Steering (aRFS). It is used to steer Rx
flows to a specific queue. This functionality is triggered by the network
stack through ndo_rx_flow_steer and requires Flow Director (ntuple on) to
function.
The fltr_info is used to add/remove/update flow rules in the HW, the
fltr_state is used to determine what to do with the filter with respect
to HW and/or SW, and the flow_id is used in co-ordination with the
network stack.
The work for aRFS is split into two paths: the ndo_rx_flow_steer
operation and the ice_service_task. The former is where the kernel hands
us an Rx SKB among other items to setup aRFS and the latter is where
the driver adds/updates/removes filter rules from HW and updates filter
state.
In the Rx path the following things can happen:
1. New aRFS entries are added to the hash table and the state is
set to ICE_ARFS_INACTIVE so the filter can be updated in HW
by the ice_service_task path.
2. aRFS entries have their Rx Queue updated if we receive a
pre-existing flow_id and the filter state is ICE_ARFS_ACTIVE.
The state is set to ICE_ARFS_INACTIVE so the filter can be
updated in HW by the ice_service_task path.
3. aRFS entries marked as ICE_ARFS_TODEL are deleted
In the ice_service_task path the following things can happen:
1. New aRFS entries marked as ICE_ARFS_INACTIVE are added or
updated in HW.
and their state is updated to ICE_ARFS_ACTIVE.
2. aRFS entries are deleted from HW and their state is updated
to ICE_ARFS_TODEL.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Following a reset, Flow Director filters are cleared from the hardware.
Rebuild the filters using the software structures containing the filter
rules.
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add functionality for ethtool --show-ntuple, allowing for filters to be
displayed when set functionality is added. Add statistics related to
Flow Director matches and status.
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Flow Director allows for redirection based on ntuple rules. Rules are
programmed using the ethtool set-ntuple interface. Supported actions are
redirect to queue and drop.
Setup the initial framework to process Flow Director filters. Create and
allocate resources to manage and program filters to the hardware. Filters
are processed via a sideband interface; a control VSI is created to manage
communication and process requests through the sideband. Upon allocation of
resources, update the hardware tables to accept perfect filters.
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The vf_id variable is dealt with in the code in inconsistent
ways of sign usage, preventing compilation with -Werror=sign-compare.
Fix this problem in the code by always treating vf_id as unsigned, since
there are no valid values of vf_id that are negative.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Change min() macros to min_t() which has compare type specified and it
helps avoid precision loss.
In some cases there was precision loss during calls or assignments.
Some fields in structs were unnecessarily large and gave multiple
warnings.
There were also some minor type differences which are now fixed as well as
some cases where a simple cast was needed.
Callers were were passing data that is a u16 to
ice_sched_cfg_node_bw_alloc() but the function was truncating that to a u8.
Fix that by changing the function to take a u16.
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
When printing the ice status or AQ error codes, instead of printing out the
numerical value, provide the description of the error code. This provides
more info about the issue than a number.
Signed-off-by: Lihong Yang <lihong.yang@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Implement promiscuous support for VF VSIs. Behaviour of promiscuous support
is based on VF trust as well as the, introduced, vf-true-promisc flag.
A trusted VF with vf-true-promisc disabled will be the default VSI, which
means that all traffic without a matching destination MAC address in the
device's internal switch will be forwarded to this VF VSI.
A trusted VF with vf-true-promisc enabled will go into "true promiscuous
mode". This amounts to the VF receiving all ingress and egress traffic
that hits the device's internal switch.
An untrusted VF will only receive traffic destined for that VF.
The vf-true-promisc-support flag cannot be toggled while any VF is in
promiscuous mode. This flag should be set prior to loading the iavf driver
or spawning VF(s).
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Create a boost TCAM entry for each tunnel port in order to get a tunnel
PTYPE. Update netdev feature flags and implement the appropriate logic to
get and set values for hardware offloads.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add a devlink region for exposing the device's Non Volatime Memory flash
contents.
Support the recently added .snapshot operation, enabling userspace to
request a snapshot of the NVM contents via DEVLINK_CMD_REGION_NEW.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Begin implementing support for the devlink interface with the ice
driver.
The pf structure is currently memory managed through devres, via
a devm_alloc. To mimic this behavior, after allocating the devlink
pointer, use devm_add_action to add a teardown action for releasing the
devlink memory on exit.
The ice hardware is a multi-function PCIe device. Thus, each physical
function will get its own devlink instance. This means that each
function will be treated independently, with its own parameters and
configuration. This is done because the ice driver loads a separate
instance for each function.
Due to this, the implementation does not enable devlink to manage
device-wide resources or configuration, as each physical function will
be treated independently. This is done for simplicity, as managing
a devlink instance across multiple driver instances would significantly
increase the complexity for minimal gain.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Currently the PF's mailbox receive queue is only 512 entries. This fine,
but considering that all VF's mailbox send queues funnel into the PF's
single mailbox receive queue, let's increase it to the maximum size. This
will help prevent any possible bottleneck/slowdown occurring from the PF's
mailbox receive queue being full.
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Currently, if there are bare-metal VFs passing traffic and the ice
driver is removed, there is a possibility of VFs triggering a Tx timeout
right before iavf_remove(). This is causing iavf_close() to not be
called because there is a check in the beginning of iavf_remove() that
bails out early if (adapter->state < IAVF_DOWN_PENDING). This makes it
so some resources do not get cleaned up. Specifically, free_irq()
is never called for data interrupts, which results in the following line
of code to trigger:
pci_disable_msix()
free_msi_irqs()
...
BUG_ON(irq_has_action(entry->irq + i));
...
To prevent the Tx timeout from occurring on the VF during driver unload
for ice and the iavf there are a few changes that are needed.
[1] Don't disable all active VF Tx/Rx queues prior to calling
pci_disable_sriov.
[2] Call ice_free_vfs() before disabling the service task.
[3] Disable VF resets when the ice driver is being unloaded by setting
the pf->state flag __ICE_VF_RESETS_DISABLED.
Changing [1] and [2] allow each VF driver's remove flow to successfully
send VIRTCHNL requests, which includes queue disable. This prevents
unexpected Tx timeouts because the PF driver is no longer forcefully
disabling queues.
Due to [1] and [2] there is a possibility that the PF driver will get a
VFLR or reset request over VIRTCHNL from a VF during PF driver unload.
Prevent that by doing [3].
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Currently when the device runs out of MSI-X interrupts a cryptic and
unhelpful message is printed. This will cause confusion when hitting this
case. Fix this by clearing up the error message for both SR-IOV and non
SR-IOV use cases.
Also, make a few minor changes to increase clarity of variables.
1. Change per VF MSI-X and queue pair variables in the PF structure.
2. Use ICE_NONQ_VECS_VF when determining pf->num_msix_per_vf instead of
the magic number "1". This vector is reserved for the OICR.
All of the resource tracking functions were moved to avoid adding
any forward declaration function prototypes.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Unlike the XL710 series, 800-series hardware can allocate more than 4
MSI-X vectors per VF. This patch enables that functionality. We
dynamically allocate vectors and queues depending on how many VFs are
enabled. Allocating the maximum number of VFs replicates XL710
behavior with 4 queues and 4 vectors. But allocating a smaller number
of VFs will give you 16 queues and 16 vectors.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Update the PF VFs MDD event message to rate limit once per second and
report the total number Rx|Tx event count. Add support to print pending
MDD events that occur during the rate limit. The use of net_ratelimit did
not allow for per VF Rx|Tx granularity.
Additional PF MDD log messages are guarded by netif_msg_[rx|tx]_err().
Since VF RX MDD events disable the queue, add ethtool private flag
mdd-auto-reset-vf to configure VF reset to re-enable the queue.
Disable anti-spoof detection interrupt to prevent spurious events
during a function reset.
To avoid race condition do not make PF MDD register reads conditional
on global MDD result.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
In ice_xsk_umem(), variable qid which is later used as an array index,
is not validated for a possible boundary exceedance. Because of that,
a calling function might receive an invalid address, which causes
general protection fault when dereferenced.
To address this, add a boundary check to see if qid is greater than the
size of a UMEM array. Also, don't let user change vsi->num_xsk_umems
just by trying to setup a second UMEM if its value is already set up
(i.e. UMEM region has already been allocated for this VSI).
While at it, make sure that ring->zca.free pointer is always zeroed out
if there is no UMEM on a specified ring.
Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
We can't have more than one default VSI so prevent another VSI from
overwriting the current dflt_vsi. This was achieved by adding the
following functions:
ice_is_dflt_vsi_in_use()
- Used to check if the default VSI is already being used.
ice_is_vsi_dflt_vsi()
- Used to check if VSI passed in is in fact the default VSI.
ice_set_dflt_vsi()
- Used to set the default VSI via a switch rule
ice_clear_dflt_vsi()
- Used to clear the default VSI via a switch rule.
Also, there was no need to introduce any locking because all mailbox
events and synchronization of switch filters for the PF happen in the
service task.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
There are many things wrong with the function
ice_set_vf_spoofchk().
1. The VSI being modified is the PF VSI, not the VF VSI.
2. We are enabling Rx VLAN pruning instead of Tx VLAN anti-spoof.
3. The spoofchk setting for each VF is not initialized correctly
or re-initialized correctly on reset.
To fix [1] we need to make sure we are modifying the VF VSI.
This is done by using the vf->lan_vsi_idx to index into the PF's
VSI array.
To fix [2] replace setting Rx VLAN pruning in ice_set_vf_spoofchk()
with setting Tx VLAN anti-spoof.
To Fix [3] we need to make sure the initial VSI settings match what
is done in ice_set_vf_spoofchk() for spoofchk=on. Also make sure
this also works for VF reset. This was done by modifying ice_vsi_init()
to account for the current spoofchk state of the VF VSI.
Because of these changes, Tx VLAN anti-spoof needs to be removed
from ice_cfg_vlan_pruning(). This is okay for the VF because this
is now controlled from the admin enabling/disabling spoofchk. For the
PF, Tx VLAN anti-spoof should not be set. This change requires us to
call ice_set_vf_spoofchk() when configuring promiscuous mode for
the VF which requires ice_set_vf_spoofchk() to move in order to prevent
a forward declaration prototype.
Also, add VLAN 0 by default when allocating a VF since the PF is unaware
if the guest OS is running the 8021q module. Without this, MDD events will
trigger on untagged traffic because spoofcheck is enabled by default. Due
to this change, ignore add/delete messages for VLAN 0 from VIRTCHNL since
this is added/deleted during VF initialization/teardown respectively and
should not be modified.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add code to query and set the number of channels on the primary VSI for a
PF. This is accessed from the 'ethtool -l' and 'ethtool -L' commands,
respectively. Though the ice driver supports asymmetric queues report an
IRQ vector that has both Rx and Tx queues attached and is counted as a
'combined' channel.
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Co-developed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
We use &pf->dev->pdev all over the code. Add a simple
macro to do this for us. When multiple de-references
like this are being done add a local struct device
variable.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Implement interface layer for the DCBNL subsystem. These are the functions
to support the callbacks defined in the dcbnl_rtnl_ops struct. These
callbacks are going to be used to interface with the DCB settings of the
device. Implementation of dcb_nl set functions and supporting SW DCB
functions.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
DCB configuration flow needs to disable and enable only the PF (main)
VSI, so use ice_ena_vsi and ice_dis_vsi. To avoid the use of ifdef to
control the staticness of these functions, move them to ice_lib.c.
Also replace the allocate and copy of old_cfg to kmemdup() in
ice_pf_dcb_cfg().
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add an ethtool "legacy-rx" priv flag for toggling the Rx path. This
control knob will be mainly used for build_skb usage as well as buffer
size/MTU manipulation.
In preparation for adding build_skb support in a way that it takes
care of how we set the values of max_frame and rx_buf_len fields of
struct ice_vsi. Specifically, in this patch mentioned fields are set to
values that will allow us to provide headroom and tailroom in-place.
This can be mostly broken down onto following:
- for legacy-rx "on" ethtool control knob, old behaviour is kept;
- for standard 1500 MTU size configure the buffer of size 1536, as
network stack is expecting the NET_SKB_PAD to be provided and
NET_IP_ALIGN can have a non-zero value (these can be typically equal
to 32 and 2, respectively);
- for larger MTUs go with max_frame set to 9k and configure the 3k
buffer in case when PAGE_SIZE of underlying arch is less than 8k; 3k
buffer is implying the need for order 1 page, so that our page
recycling scheme can still be applied;
With that said, substitute the hardcoded ICE_RXBUF_2048 and PAGE_SIZE
values in DMA API that we're making use of with rx_ring->rx_buf_len and
ice_rx_pg_size(rx_ring). The latter is an introduced helper for
determining the page size based on its order (which was figured out via
ice_rx_pg_order). Last but not least, take care of truesize calculation.
In the followup patch the headroom/tailroom computation logic will be
introduced.
This change aligns the buffer and frame configuration with other Intel
drivers, most importantly with iavf.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add zero copy AF_XDP support. This patch adds zero copy support for
Tx and Rx; code for zero copy is added to ice_xsk.h and ice_xsk.c.
For Tx, implement ndo_xsk_wakeup. As with other drivers, reuse
existing XDP Tx queues for this task, since XDP_REDIRECT guarantees
mutual exclusion between different NAPI contexts based on CPU ID. In
turn, a netdev can XDP_REDIRECT to another netdev with a different
NAPI context, since the operation is bound to a specific core and each
core has its own hardware ring.
For Rx, allocate frames as MEM_TYPE_ZERO_COPY on queues that AF_XDP is
enabled.
Signed-off-by: Krzysztof Kazimierczak <krzysztof.kazimierczak@intel.com>
Co-developed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Add support for XDP. Implement ndo_bpf and ndo_xdp_xmit. Upon load of
an XDP program, allocate additional Tx rings for dedicated XDP use.
The following actions are supported: XDP_TX, XDP_DROP, XDP_REDIRECT,
XDP_PASS, and XDP_ABORTED.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Remove a few uses of kernel configuration flags from ice_lib.c by
introducing a new source file ice_base.c. Also move corresponding
function prototypes from ice_lib.h to ice_base.h and include ice_base.h
where required.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Attempt to request an optional device-specific DDP package file
(one with the PCIe Device Serial Number in its name so that different DDP
package files can be used on different devices). If the optional package
file exists, download it to the device. If not, download the default
package file.
Log an appropriate message based on whether or not a DDP package
file exists and the return code from the attempt to download it to the
device. If the download fails and there is not already a package file on
the device, go into "Safe Mode" where some features are not supported.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The FW build id is currently being displayed as an int which doesn't make
sense. Instead display FW build id as a hex value. Also add other useful
information to the output such as NVM version, API patch info, and FW
build hash.
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The driver is required to send a version to the firmware
to indicate that the driver is up. If the driver doesn't
do this the firmware doesn't behave properly.
Signed-off-by: Paul M Stillwell Jr <paul.m.stillwell.jr@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The driver should start out with a reasonable number of descriptors that
can prevent drops due to a CPU being in a power management state.
Change the default number of descriptors to 2048.
The user can always change the value at runtime. Transmit descriptor
counts are not modified because they don't need to change due to the
speed of the interface, or for power managed CPUs, but the code is
simplified to a fixed value for the transmit default.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Remove q_left_tx and q_left_rx from the PF struct as these can be
obtained by calling ice_get_avail_txq_count and ice_get_avail_rxq_count
respectively.
The function ice_determine_q_usage is only setting num_lan_tx and
num_lan_rx in the PF structure, and these are later assigned to
vsi->alloc_txq and vsi->alloc_rxq respectively. This is an unnecessary
indirection, so remove ice_determine_q_usage and just assign values
for vsi->alloc_txq and vsi->alloc_rxq in ice_vsi_set_num_qs and use
these to set num_lan_tx and num_lan_rx respectively.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>