Commit Graph

824423 Commits

Author SHA1 Message Date
Heiner Kallweit
2d64610934 net: phy: aquantia: inform about proprietary 1000Base-T2 mode being in use
The AQCS109 supports a proprietary 2-pair 1Gbps mode. The standard
registers don't allow to tell between 1000BaseT and 1000BaseT2.
Add reporting this proprietary mode based on a vendor register.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 11:33:43 -07:00
Heiner Kallweit
43429a0353 net: phy: aquantia: report PHY details like firmware version
Add reporting firmware details. These details are available only once
the firmware has finished initializing the chip. This can take some
time and we need to poll for init completion.

v2:
- Propagate timeout in aqr107_wait_reset_complete(). Don't bail out
  completely on timeout because chip may be functional even w/o
  firmware image.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 11:33:43 -07:00
Heiner Kallweit
9d685c11bf net: phy: aquantia: print remote capabilities if link partner is Aquantia PHY
If both link partners are Aquantia PHY's then additional information is
exchanged as part of the auto-negotiation. Report remote capabilities
if link partner is Aquantia PHY.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 11:33:43 -07:00
Vladimir Oltean
6146dd453e net: dsa: Avoid null pointer when failing to connect to PHY
When phylink_of_phy_connect fails, dsa_slave_phy_setup tries to save the
day by connecting to an alternative PHY, none other than a PHY on the
switch's internal MDIO bus, at an address equal to the port's index.

However this does not take into consideration the scenario when the
switch that failed to probe an external PHY does not have an internal
MDIO bus at all.

Fixes: aab9c4067d ("net: dsa: Plug in PHYLINK support")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 11:26:55 -07:00
Heiner Kallweit
9675db398b net: phy: aquantia: simplify aqr_config_aneg
Simplify aqr_config_aneg().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 11:16:22 -07:00
David S. Miller
be67101fbf Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2019-03-25

This series contains updates to the ice driver only.

Victor updates the ice driver to be able to update the VSI queue
configuration dynamically, by providing the ability to increase or
decrease the VSI's number of queues.

Michal fixes an issue when the VM starts or the VF driver is reloaded,
the VLAN switch rule was lost (i.e. not added), so ensure it gets added
in these cases.

Brett updates the driver to support link events over the admin receive
queue, instead of polling link events.

Maciej refactors the code a bit to introduce a new function to fetch the
receiver buffer and do the DMA synchronization to reduce the code
duplication.  Also added ice_can_reuse_rx_page() to verify whether the
page can be reused so that in the future, we can use this check
elsewhere in the driver.  Additional driver optimizations so that we can
drop the ice_pull_tail() altogether.  Added support for bulk updates of
refcount instead of doing it one by one.  Refactored the page counting
and buffer recycling so that we can use this code to clean up receive
buffers when there is no skb allocated, like XDP.  Added
DMA_ATTR_WEAK_ORDERING and DMA_ATTR_SKIP_CPU_SYNC attributes to the DMA
API during the mapping operations on the receive side, so that nonx86
platforms will be able to sync with what is being used (2k buffers)
instead of the entire page.

Dave fixes the driver to perform the most intrusive of the resets
requested and clear the other request bits so that we do not end up with
repeated reset, after reset.

Bruce adds a iterator macro to clean up several for() loops.

Chinh modifies the packet flags to be more generic so that they can be
used for both receive and transmit.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-26 09:41:27 -07:00
Chinh T Cao
86e81794ac ice: Create a generic name for the ice_rx_flg64_bits structure
This structure is used to define the packet flags. These flags are
applicable for both TX and RX packet. Thus, this patch changes its
name from ice_rx_flag64_bits to ice_flg64_bits, and its member definition.

Signed-off-by: Chinh T Cao <chinh.t.cao@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 10:40:04 -07:00
Bruce Allan
2bdc97be97 ice: add and use new ice_for_each_traffic_class() macro
There are numerous for() loops iterating over each of the max traffic
classes.  Use a simple iterator macro instead to make the code cleaner.

Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 10:33:54 -07:00
Preethi Banala
105e5bc23a ice: change VF VSI tc info along with num_queues
Update VF VSI tc info along with vsi->num_txq/num_rxq when VF requests to
configure queues.

Signed-off-by: Preethi Banala <preethi.banala@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 10:14:02 -07:00
Dave Ertman
2ebd4428d9 ice: Prevent unintended multiple chain resets
In the current implementation of ice_reset_subtask, if multiple reset
types are set in the pf->state, the most intrusive one is meant to be
performed only, but the bits requesting the other types are not being
cleared. This would lead to another reset being performed the next time
the service task is scheduled.

Change the flow of ice_reset_subtask so that all reset request bits in
pf->state are cleared, and we still perform the most intrusive of the
resets requested.

Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 10:12:21 -07:00
Maciej Fijalkowski
a65f71fed5 ice: map Rx buffer pages with DMA attributes
Provide DMA_ATTR_WEAK_ORDERING and DMA_ATTR_SKIP_CPU_SYNC attributes to
the DMA API during the mapping operations on Rx side. With this change
the non-x86 platforms will be able to sync only with what is being used
(2k buffer) instead of entire page. This should yield a slight
performance improvement.

Furthermore, DMA unmap may destroy the changes that were made to the
buffer by CPU when platform is not a x86 one. DMA_ATTR_SKIP_CPU_SYNC
attribute usage fixes this issue.

Also add a sync_single_for_device call during the Rx buffer assignment,
to make sure that the cache lines are cleared before device attempting
to write to the buffer.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 10:10:39 -07:00
Maciej Fijalkowski
712edbbb67 ice: Limit the ice_add_rx_frag to frag addition
Refactor ice_fetch_rx_buf and ice_add_rx_frag in a way that we have
standalone functions that do either the skb construction or frag
addition to previously constructed skb.

The skb handling between rx_bufs is spread among various functions. The
ice_get_rx_buf will retrieve the skb pointer from rx_buf and if it is a
NULL pointer then we do the ice_construct_skb, otherwise we add a frag
to the current skb via ice_add_rx_frag. Then, on the ice_put_rx_buf the
skb pointer that belongs to rx_buf will be cleared. Moving further, if
the current frame is not EOP frame we assign the current skb to the
rx_buf that is pointed by updated next_to_clean indicator.

What is more during the buffer reuse let's assign each member of
ice_rx_buf individually so we avoid the unnecessary copy of skb.

Last but not least, this logic split will allow us for better code reuse
when adding a support for build_skb.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 09:55:35 -07:00
Maciej Fijalkowski
1d032bc77b ice: Gather the rx buf clean-up logic for better reuse
Pull out the code responsible for page counting and buffer recycling so
that it will be possible to clean up the Rx buffers in cases where we
won't allocate skb (ex. XDP)

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 09:45:07 -07:00
Maciej Fijalkowski
03c66a1376 ice: Introduce bulk update for page count
{get,put}_page are atomic operations which we use for page count
handling. The current logic for refcount handling is that we increment
it when passing a skb with the data from the first half of page up to
netstack and recycle the second half of page. This operation protects us
from losing a page since the network stack can decrement the refcount of
page from skb.

The performance can be gently improved by doing the bulk updates of
refcount instead of doing it one by one. During the buffer initialization,
maximize the page's refcount and don't allow the refcount to become
less than two.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 09:33:13 -07:00
Maciej Fijalkowski
1857ca42a7 ice: Get rid of ice_pull_tail
Instead of adding a frag and later when dealing with EOP frame accessing
that frag in order to copy the headers onto linear part of skb, we can do
this in ice_add_rx_frag in case where the data_len is still 0 and frame
won't fit onto the linear part as a whole.

Function comment of ice_pull_tail was a bit misleading because of
mentioned optimizations that can be performed (drop a frag/maintaining
accurate truesize of skb) - it seems that this part of logic was dropped
and the comment was not updated to reflect this change.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 09:22:21 -07:00
Maciej Fijalkowski
bbb97808a0 ice: Pull out page reuse checks onto separate function
Introduce ice_can_reuse_rx_page which will verify whether the page can
be reused and return the boolean result to caller.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 08:00:14 -07:00
Maciej Fijalkowski
6c869cb7a8 ice: Retrieve rx_buf in separate function
Introduce ice_get_rx_buf, which will fetch the Rx buffer and do the DMA
synchronization. Length of the packet that hardware Rx descriptor
contains is now read in ice_clean_rx_irq, so we can feed ice_get_rx_buf
with it and resign from rx_desc passed as argument in ice_fetch_rx_buf
and ice_add_rx_frag.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 07:58:17 -07:00
Brett Creeley
250c3b3e0a ice: Enable link events over the ARQ
The hardware now supports link events over the admin receive queue (ARQ),
so enable HW link events over the ARQ and remove code for link event
polling.

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 07:54:44 -07:00
Alan Brady
8d051b8b5d ice: use irq_num var in ice_vsi_req_irq_msix
Someone went through the effort of making this a variable so let's use
it instead of recalculating it again.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 07:52:51 -07:00
Michal Swiatkowski
840bcd88f8 ice: Restore VLAN switch rule if port VLAN existed before
The VLAN rule is lost when VM starts or the AVF driver (iavf.ko) is
reloaded. So it is necessary to add this rule again.

Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 07:41:49 -07:00
Victor Raj
b0153fdd7e ice: update VSI config dynamically
When VSI increases the number of queues dynamically, the scheduler
just needs to add the new required nodes rather than re-adjusting with
previously allocated number of nodes. Readjusting didn't provide enough
parents to add the upper layer nodes also can't place lan and rdma
subtrees separately.

In decrease case, keep the VSI configuration with max number of queues
always. This will leave some extra nodes in the tree but no harm done.

Signed-off-by: Victor Raj <victor.raj@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2019-03-25 07:23:20 -07:00
David S. Miller
68cc2999f6 Merge branch 'devlink-small-spring-cleanup'
Jiri Pirko says:

====================
devlink: small spring cleanup

Mostly cosmetics and janitor work.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
f6b19b354d net: devlink: select NET_DEVLINK from drivers
Some drivers are becoming more dependent on NET_DEVLINK being selected
in configuration. With upcoming compat functions, the behavior would be
wrong in case devlink was not compiled in. So make the drivers select
NET_DEVLINK and rely on the functions being there, not just stubs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
b8f975545c net: devlink: add port type spinlock
Add spinlock to protect port type and type_dev pointer consistency.
Without that, userspace may see inconsistent type and type_dev
combinations.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
v1->v2:
- rebased
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
2b239e7090 net: devlink: warn on setting type on unregistered port
Port needs to be registered first before the type is set. Warn and
bail-out in case it is not.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
d0d54e8c35 bnxt: set devlink port type after registration
Move the type set of devlink port after it is registered.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
faaccbe6eb nfp: move devlink port type set after netdev registration
Similar to other driver, move the port type set after netdev registration
is done. Along with that, clear the type before unregistration.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
45b861120e net: devlink: disallow port_attrs_set() to be called before register
Since the port attributes are static and cannot change during the port
lifetime, WARN_ON if some driver calls it after registration. Also, no
need to call notifications as it is noop anyway due to check of
devlink_port->registered there.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
d8ba36204c dsa: move devlink_port_attrs_set() call before register
Since attrs are static during the existence of devlink port, set the
before registration of the port.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
e519418f89 mlxsw: Move devlink_port_attrs_set() call before register
Since attrs are static during the existence of devlink port, set the
before registration of the port.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
773b1f38e3 net: devlink: don't pass return value of __devlink_port_type_set()
__devlink_port_type_set() returns void, it makes no sense to pass it on,
so don't do that.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
e0dcd386d1 net: devlink: don't take devlink_mutex for devlink_compat_*
The netdevice is guaranteed to not disappear so we can rely that
devlink_port and devlink won't disappear as well. No need to take
devlink_mutex so don't take it here.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
c3f10cbcaa bnxt: call devlink_port_type_eth_set() before port register
Call devlink_port_type_eth_set() before devlink_port_register(). Bnxt
instances won't change type during lifetime. This avoids one extra
userspace devlink notification.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:30 -04:00
Jiri Pirko
a0e18132ec bnxt: set devlink port attrs properly
Set the attrs properly so delink has enough info to generate physical
port names.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:30 -04:00
Jiri Pirko
402f99e550 dsa: add missing net/devlink.h include
devlink functions are in use, so include the related header file.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:30 -04:00
Jiri Pirko
477edb7806 bnxt: add missing net/devlink.h include
devlink functions are in use, so include the related header file.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:30 -04:00
Jiri Pirko
375cf8c643 net: devlink: add couple of missing mutex_destroy() calls
Add missing called to mutex_destroy() for two mutexes used
in devlink code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:30 -04:00
David S. Miller
956ca8fc5c Merge branch 'aquantia-rx-perf'
Igor Russkikh says:

====================
net: aquantia: RX performance optimization patches

Here is a set of patches targeting for performance improvement
on various platforms and protocols.

Our main target was rx performance on iommu systems, notably
NVIDIA Jetson TX2 and NVIDIA Xavier platforms.

We introduce page reuse strategy to better deal with iommu dma mapping costs.
With it we see 80-90% of page reuse under some test configurations on UDP traffic.

This shows good improvements on other systems with IOMMU hardware, like
AMD Ryzen.

We've also improved TCP LRO configuration parameters, allowing packets to better
coalesce.

Page reuse tests were carried out using iperf3, iperf2, netperf and pktgen.
Mainly on UDP traffic, with various packet lengths.

Jetson TX2, UDP, Default MTU:
RX Lost Datagrams
  Before: Max: 69%  Min: 68% Avg: 68.5%
  After:  Max: 41%  Min: 38% Avg: 39.2%
Maximum throughput
  Before: 1.27 Gbits/sec
  After:  2.41 Gbits/sec

AMD Ryzen 5 2400G, UDP, Default MTU:
RX Lost Datagrams
  Before:  Max: 12%  Min: 4.5% Avg: 7.17%
  After:   Max: 6.2% Min: 2.3% Avg: 4.26%
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:54 -04:00
Igor Russkikh
d0d443cddb net: aquantia: enable driver build for arm64 or compile_test
The driver is now constantly tested in our lab on aarch64 hardware:
Jetson tx2, Pascal and Xavier tegra based hardware.
Many of tegra smmu related HW bugs were fixed or workarounded already.

Thus, add ARM64 into Kconfig.

Add also COMPILE_TEST dependency.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
Nikita Danilov
1eef4757ce net: aquantia: improve LRO configuration
Default LRO HW configuration was very conservative.

Low Number of Descriptors per LRO Sequence, small session
timeout, inefficient settings in interrupt generation logic.

Change max number of LRO descriptors from 2 to 16 to
increase performance. Increase maximum coalescing interval
in HW to 250uS. Tune up HW LRO interrupt generation setting
to prevent hw issues with long LRO sessions.

Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
Igor Russkikh
1b09e72d16 net: aquantia: Increase rx ring default size from 1K to 2K
For multigig rates 1K ring size is often not enough and causes extra
packet drops in hardware.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
Igor Russkikh
8bd7e7639d net: aquantia: Make RX default frame size 2K
This correlates with default internet MTU. This also allows page
flip/reuse to be activated, since each allocated RX page now serves for
two frags/packets.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
Igor Russkikh
9773ef18b8 net: aquantia: Introduce rx refill threshold value
Before that, we've refilled ring even on single descriptor move.
Under high packet load that caused page allocation logic to be triggered
too often. That made overall ring processing slower.

Moreover, with page buffer reuse implemented, we should give a chance
higher networking levels to process received packets faster, release
the pages they consumed and therefore give a higher chance for these
pages to be reused.

RX ring is now refilled only when AQ_CFG_RX_REFILL_THRES or more
descriptors were processed (32 by default). Under regular traffic this
gives quite enough time for packet to be consumed and page to be reused.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
Igor Russkikh
46f4c29d9d net: aquantia: optimize rx performance by page reuse strategy
We introduce internal aq_rxpage wrapper over regular page
where extra field is tracked: rxpage offset inside of allocated page.

This offset allows to reuse one page for multiple packets.
When needed (for example with large frames processing), allocated
pageorder could be customized. This gives even larger page reuse
efficiency.

page_ref_count is used to track page users. If during rx refill
underlying page has users, we increase pg_off by rx frame size
thus the top half of the page is reused.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
Igor Russkikh
7e2698c4fd net: aquantia: optimize rx path using larger preallocated skb len
Atlantic driver used 14 bytes preallocated skb size. That made L3 protocol
processing inefficient because pskb_pull had to fetch all the L3/L4 headers
from extra fragments.

Specially on UDP flows that caused extra packet drops because CPU was
overloaded with pskb_pull.

This patch uses eth_get_headlen for skb preallocation.

Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:16:53 -04:00
David S. Miller
d64fee0a03 mlx5-updates-2019-03-20
This series includes updates to mlx5 driver,
 
 1) Compiler warnings cleanup from Saeed Mahameed
 2) Parav Pandit simplifies sriov enable/disables
 3) Gustavo A. R. Silva, Removes a redundant assignment
 4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
 5) Eli Britstein, Adds the Support for VLAN modify action and
    Replaces TC VLAN pop and push actions with VLAN modify
 
 Note: This series includes two simple non-mlx5 patches,
 
 1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
 and use it in some drivers.
 2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
 and use it in mlx5 and nfp drivers.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJclTLsAAoJEEg/ir3gV/o+7bEH/1sz4oKP2mfhKSbG/I/g7Q3D
 ifnccYq2EyXd1HzeglXpzLndO8wPve9qr/ANKrrKIYYCxc8FpCdb4aJD1Ucuylbb
 XHHdfbTIPMa3vjhKtR/Fydht4RkY5IBBsgXywBcNL3ofxmnleNt9JRSr76Yhr2sy
 Q3H30X+UvwAAQJBY1X+P8RiJcSklLu0UPG2KtTXcCz8YRgOWK0JtEiQyQu6yET4u
 zbVxYixwKgsR9uhwNXqLxVMsaWFue9cYmVSMLigDx7fRZvj6Ao9REEUflt1hCEoR
 jOXm1Avnsg9TKnwmgiBjrWQQQ4h+IMfZLK8EtuxVcraBUjtQRVnPak5JjZMjDuc=
 =7t4R
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-03-20

This series includes updates to mlx5 driver,

1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
   Replaces TC VLAN pop and push actions with VLAN modify

Note: This series includes two simple non-mlx5 patches,

1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:03:44 -04:00
David S. Miller
071d08af38 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2019-03-22

This series contains updates to ice driver only.

Akeem enables MAC anti-spoofing by default when a new VSI is being
created.  Fixes an issue when reclaiming VF resources back to the pool
after reset, by freeing VF resources separately using the first VF
vector index to traverse the list, instead of starting at the last
assigned vectors list.  Added support for VF & PF promiscuous mode in
the ice driver.  Fixed the PF driver from letting the VF know it is "not
trusted" when it attempts to add more than its permitted additional MAC
addresses.  Altered how the driver gets the VF VSIs instances, instead
of using the mailbox messages to retrieve VSIs, get it directly via the
VF object in the PF data structure.

Bruce fixes return values to resolve static analysis warnings.  Made
whitespace changes to increase readability and reduce code wrapping.

Anirudh cleans up code by removing a function prototype that was never
implemented and removed an unused field in the ice_sched_vsi_info
structure.

Kiran fixes a potential divide by zero issue by adding a check.

Victor cleans up the transmit scheduler by adjusting the stack variable
usage and added/modified debug prints to make them more useful.

Yashaswini updates the driver in VEB mode to ensure that the LAN_EN bit
is set if all the right conditions are met.

Christopher ensures the loopback enable bit is not set for prune switch
rules, since all transmit traffic would be looped back to the internal
switch and dropped.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:02:54 -04:00
David S. Miller
bdaba8959e Merge branch 'tcp-rx-tx-cache'
Eric Dumazet says:

====================
tcp: add rx/tx cache to reduce lock contention

On hosts with many cpus we can observe a very serious contention
on spinlocks used in mm slab layer.

The following can happen quite often :

1) TX path
  sendmsg() allocates one (fclone) skb on CPU A, sends a clone.
  ACK is received on CPU B, and consumes the skb that was in the retransmit
  queue.

2) RX path
  network driver allocates skb on CPU C
  recvmsg() happens on CPU D, freeing the skb after it has been delivered
  to user space.

In both cases, we are hitting the asymetric alloc/free pattern
for which slab has to drain alien caches. At 8 Mpps per second,
this represents 16 Mpps alloc/free per second and has a huge penalty.

In an interesting experiment, I tried to use a single kmem_cache for all the skbs
(in skb_init() : skbuff_fclone_cache = skbuff_head_cache =
                  kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones),);
qnd most of the contention disappeared, since cpus could better use
their local slab per-cpu cache.

But we can do actually better, in the following patches.

TX : at ACK time, no longer free the skb but put it back in a tcp socket cache,
     so that next sendmsg() can reuse it immediately.

RX : at recvmsg() time, do not free the skb but put it in a tcp socket cache
   so that it can be freed by the cpu feeding the incoming packets in BH.

This increased the performance of small RPC benchmark by about 10 % on a host
with 112 hyperthreads.

v2 : - Solved a race condition : sk_stream_alloc_skb() to make sure the prior
       clone has been freed.
     - Really test rps_needed in sk_eat_skb() as claimed.
     - Fixed rps_needed use in drivers/net/tun.c

v3: Added a #ifdef CONFIG_RPS, to avoid compile error (kbuild robot)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet
8b27dae5a2 tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.

This means the incoming skb had to be allocated on a cpu,
but freed on another.

This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.

A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.

More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.

This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.

(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
 instead of 8 Mpps)

This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :

 - CPU handling the NIC rx interrupts, feeding the receive queue,
  and (after this patch) freeing the skbs that were consumed.

 - CPU in recvmsg() system call, essentially 100 % busy copying out
  data to user space.

Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.

Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.

To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet
472c2e07ee tcp: add one skb cache for tx
On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.

    20.69%  [kernel]       [k] queued_spin_lock_slowpath
     5.64%  [kernel]       [k] _raw_spin_lock
     3.83%  [kernel]       [k] syscall_return_via_sysret
     3.48%  [kernel]       [k] __entry_text_start
     1.76%  [kernel]       [k] __netif_receive_skb_core
     1.64%  [kernel]       [k] __fget

For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.

In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.

This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.

We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00