Zero-length and one-element arrays are deprecated, see
Documentation/process/deprecated.rst
Flexible-array members should be used instead.
Generated by: scripts/coccinelle/misc/flexible_array.cocci
Fixes: 23ae3a7877 ("net: dsa: felix: add stream gate settings for psfp")
CC: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang says:
====================
hns3: some cleanups for -next
To improve code readability and simplicity, this series refactor some
functions in the HNS3 ethernet driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Function hns3_set_l2l3l4() is a bit too long. So add two
new functions hns3_set_l3_type() and hns3_set_l4_csum_length()
to simplify code and improve code readability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function hns3_handle_bdinfo() is a bit too long. So add two
new functions hns3_handle_rx_ts_info() and hns3_handle_rx_vlan_tag(
to simplify code and improve code readability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function hns3_nic_get_stats64() is a bit too long. So add a
new function hns3_fetch_stats() to simplify code and improve
code readability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch encapsulates the process code for queue to qset config of two
mode(tc based and vnet based) into two function, for making code more
concise.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch encapsulates the process code of tc based schedule mode of
function hclge_tm_lvl34_schd_mode_cfg() into a new function
hclge_tm_schd_mode_tc_base_cfg(). It make code more concise and the new
process code can be reused.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To reuse the code of converting speed of driver to speed of firmware in
function hclge_cfg_mac_speed_dup_hw(), encapsulate them into a new
function hclge_convert_to_fw_speed().
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function hns3_get_tx_timeo_queue_info() is a bit too long. So add two
new functions hns3_dump_queue_stats() and hns3_dump_queue_reg() to
simplify code and improve code readability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use for statement to optimize some print work of function
hclge_dbg_dump_rst_info() and hclge_dbg_dump_mac_enable_status() to
improve code simplicity.
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split rx copybreak handle into a separate function from function
hns3_nic_reuse_page() to improve code simplicity.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the hclge_reset_prepare_general function uses the goto
statement to jump upwards, which increases code complexity and makes
the program structure difficult to understand. In addition, if
reset_pending is set, retry_cnt cannot be increased. This may result
in a failure to exit the retry or increase the number of retries.
Use the while statement instead to make the program easier to understand
and solve the problem that the goto statement cannot be exited.
Signed-off-by: Jiaran Zhang <zhangjiaran@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once tcp small queue check failed in tcp_small_queue_check(), the
throughput of tcp will be limited, and it's hard to distinguish
whether it is out of tcp congestion control.
Add statistics of LINUX_MIB_TCPSMALLQUEUEFAILURE for this scene.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET command doesn't have .doit callback
and has no use in internal_flags at all. Remove this misleading assignment.
Fixes: e44ef4e451 ("devlink: Hang reporter's dump method on a dumpit cb")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Foster says:
====================
update seville to use shared MDIO driver
This patch set exposes and utilizes the shared MDIO bus in
drivers/net/mdio/msio-mscc-miim.c
v3:
* Fix errors using uninitilized "dev" inside the probe function.
* Remove phy_regmap from the setup function, since it currently
isn't used
* Remove GCB_PHY_PHY_CFG definition from ocelot.h - it isn't used
yet...
v2:
* Error handling (thanks Andrew Lunn)
* Fix logic errors calling mscc_miim_setup during patch 1/3 (thanks
Jakub Kicinski)
* Remove unnecessary felix_mdio file (thanks Vladimir Oltean)
* Pass NULL to mscc_miim_setup instead of GCB_PHY_PHY_CFG, since the
phy reset isn't handled at that point of the Seville driver (patch
3/3)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch to a shared MDIO access implementation by way of the mdio-mscc-miim
driver.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch seville to use of_mdiobus_register(bus, NULL) instead of just
mdiobus_register. This code is about to be pulled into a separate module
that can optionally define ports by the device_node.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Utilize regmap instead of __iomem to perform indirect mdio access. This
will allow for custom regmaps to be used by way of the mscc_miim_setup
function.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur says:
====================
net: lan966x: Add lan966x switch driver
This patch series add support for Microchip lan966x driver
The lan966x switch is a multi-port Gigabit AVB/TSN Ethernet Switch with
two integrated 10/100/1000Base-T PHYs. In addition to the integrated PHYs,
it supports up to 2RGMII/RMII, up to 3BASE-X/SERDES/2.5GBASE-X and up to
2 Quad-SGMII/Quad-USGMII interfaces.
Initially it adds support only for the ports to behave as simple
NIC cards. In the future patches it would be extended with other
functionality like Switchdev, PTP, Frame DMA, VCAP, etc.
v4->v5:
- more fixes to the reset of the switch, require all resources before
activating the hardware
- fix to lan966x-switch binding
- implement get/set_pauseparam in ethtool_ops
- stop calling lan966x_port_link_down when calling lan966x_port_pcs_set and
call it in lan966x_phylink_mac_link_down
v3->v4:
- add timeouts when injecting/extracting frames, in case the HW breaks
- simplify the creation of the IFH
- fix the order of operations in lan966x_cleanup_ports
- fixes to phylink based on Russel review
v2->v3:
- fix compiling issues for x86
- fix resource management in first patch
v1->v2:
- add new patch for MAINTAINERS
- add functions lan966x_mac_cpu_learn/forget
- fix build issues with second patch
- fix the reset of the switch, return error if there is no reset controller
- start to use phylink_mii_c22_pcs_decode_state and
phylink_mii_c22_pcs_encode_advertisement to remove duplicate code
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Update MAINTAINERS to include lan966x driver
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for statistics counters for the network
interfaces. Also adds support for configuring the network interface via
ethtool like: speed, duplex etc.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for MAC table operations like add and forget.
Also add the functionality to read the MAC address from DT, if there is
no MAC set in DT it would use a random one.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for netdev and phylink in the switch. The
injection + extraction is register based. This will be replaced with DMA
accees.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds basic SwitchDev driver framework for lan966x. It
includes only the IO range mapping and probing of the switch.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Document the lan966x switch device driver bindings
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IXP4xx is being migrated to device tree only. Convert this
driver to use device tree probing.
Pull in all the boardfile code from the one boardfile and
make it local, pull all the boardfile parameters from the
device tree instead of the board file.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds device tree bindings for the IXP4xx V.35 WAN high
speed serial (HSS) link.
An example is added to the NPE example where the HSS appears
as a child.
Cc: devicetree@vger.kernel.org
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A contact at Realtek has clarified what exactly the units of RGMII RX
delay are. The answer is that the unit of RX delay is "about 0.3 ns".
Take this into account when parsing rx-internal-delay-ps by
approximating the closest step value. Delays of more than 2.1 ns are
rejected.
This obviously contradicts the previous assumption in the driver that a
step value of 4 was "about 2 ns", but Realtek also points out that it is
easy to find more than one RX delay step value which makes RGMII work.
Fixes: 4af2950c50 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
Cc: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 4af2950c50 ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Probe deferral is not an error, so don't log this as an error:
[0.590156] realtek-smi ethernet-switch: unable to register switch ret = -517
Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior patch:
]# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh
TEST: Vlan multicast snooping enable [ OK ]
Device "bridge" does not exist.
TEST: Disable multicast vlan snooping when vlan filtering is disabled [FAIL]
Vlan filtering is disabled but multicast vlan snooping is still enabled
After patch:
# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh
TEST: Vlan multicast snooping enable [ OK ]
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
Fixes: f5a9dd58f4 ("selftests: net: bridge: add test for vlan_filtering dependency")
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benjamin Poirier says:
====================
net: mpls: Cleanup nexthop iterator macros
The mpls macros for_nexthops and change_nexthops were probably copied
from decnet or ipv4 but they grew a superfluous variable and lost a
beneficial "const".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are separate for_nexthops and change_nexthops iterators. The
for_nexthops variant should use const.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__nh is just a copy of nh with a different type.
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephan Gerhold says:
====================
net: wwan: Add Qualcomm BAM-DMUX WWAN network driver
The BAM Data Multiplexer provides access to the network data channels
of modems integrated into many older Qualcomm SoCs, e.g. Qualcomm MSM8916
or MSM8974. This series adds a driver that allows using it.
All the changes in this patch series are based on a quite complicated
driver from Qualcomm [1]. The driver has been used in postmarketOS [2]
on various smartphones/tablets based on Qualcomm MSM8916 and MSM8974
for more than a year now with no reported problems. It works out of
the box with open-source WWAN userspace such as ModemManager.
[1]: https://source.codeaurora.org/quic/la/kernel/msm-3.10/tree/drivers/soc/qcom/bam_dmux.c?h=LA.BR.1.2.9.1-02310-8x16.0
[2]: https://postmarketos.org/
Changes in v3:
- Clarify DT schema based on discussion
- Drop bam_dma/dmaengine patches since they already landed in 5.16
- Rebase on net-next
- Simplify cover letter and commit messages
Changes in v2:
- Rename "qcom,remote-power-collapse" -> "qcom,powered-remotely"
- Rebase on net-next and fix conflicts
- Rename network interfaces from "rmnet%d" -> "wwan%d"
- Fix wrong file name in MAINTAINERS entry
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The BAM Data Multiplexer provides access to the network data channels of
modems integrated into many older Qualcomm SoCs, e.g. Qualcomm MSM8916 or
MSM8974. It is built using a simple protocol layer on top of a DMA engine
(Qualcomm BAM) and bidirectional interrupts to coordinate power control.
The modem announces a fixed set of channels by sending an OPEN command.
The driver exports each channel as separate network interface so that
a connection can be established via QMI from userspace. The network
interface can work either in Ethernet or Raw-IP mode (configurable via
QMI). However, Ethernet mode seems to be broken with most firmwares
(network packets are actually received as Raw-IP), therefore the driver
only supports Raw-IP mode.
Note that the control channel (QMI/AT) is entirely separate from
BAM-DMUX and is already supported by the RPMSG_WWAN_CTRL driver.
The driver uses runtime PM to coordinate power control with the modem.
TX/RX buffers are put in a kind of "ring queue" and submitted via
the bam_dma driver of the DMAEngine subsystem.
The basic architecture looks roughly like this:
+------------+ +-------+
[IPv4/6] | BAM-DMUX | | |
[Data...] | | | |
---------->|wwan0 | [DMUX chan: x] | |
[IPv4/6] | (chan: 0) | [IPv4/6] | |
[Data...] | | [Data...] | |
---------->|wwan1 |--------------->| Modem |
| (chan: 1) | BAM | |
[IPv4/6] | ... | (DMA Engine) | |
[Data...] | | | |
---------->|wwan7 | | |
| (chan: 7) | | |
+------------+ +-------+
Note that some newer firmware versions support QMAP ("rmnet" driver)
as additional multiplexing layer on top of BAM-DMUX, but this is not
currently supported by this driver.
Signed-off-by: Stephan Gerhold <stephan@gerhold.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BAM Data Multiplexer provides access to the network data channels of
modems integrated into many older Qualcomm SoCs, e.g. Qualcomm MSM8916 or
MSM8974. It is built using a simple protocol layer on top of a DMA engine
(Qualcomm BAM) and bidirectional interrupts to coordinate power control.
The device tree node combines the incoming interrupt with the outgoing
interrupts (smem-states) as well as the two DMA channels, which allows
the BAM-DMUX driver to request all necessary resources.
Signed-off-by: Stephan Gerhold <stephan@gerhold.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang says:
====================
net: vxlan: add macro definition for number of IANA VXLAN-GPE port
This series add macro definition for number of IANA VXLAN-GPE port for
cleanup.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses macro IANA_VXLAN_GPE_UDP_PORT to replace number 4790 for
cleanup.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add macro definition for number of IANA VXLAN-GPE port for generic use.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The writer acquires dev_base_lock with disabled bottom halves.
The reader can acquire dev_base_lock without disabling bottom halves
because there is no writer in softirq context.
On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves (as in write_lock_bh()) and somewhere else with
enabled bottom halves (as by read_lock() in netstat_show()) followed by
disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
-> spin_lock_bh()). This is the reverse locking order.
All read_lock() invocation are from sysfs callback which are not invoked
from softirq context. Therefore there is no need to disable bottom
halves while acquiring a write lock.
Acquire the write lock of dev_base_lock without disabling bottom halves.
Reported-by: Pei Zhang <pezhang@redhat.com>
Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously commit e02d494d2c ("l2tp: Convert rwlock to RCU") converted
most, but not all, rwlock instances in the l2tp subsystem to RCU.
The remaining rwlock protects the per-tunnel hashlist of sessions which
is used for session lookups in the UDP-encap data path.
Convert the remaining rwlock to rcu to improve performance of UDP-encap
tunnels.
Note that the tunnel and session, which both live on RCU-protected
lists, use slightly different approaches to incrementing their refcounts
in the various getter functions.
The tunnel has to use refcount_inc_not_zero because the tunnel shutdown
process involves dropping the refcount to zero prior to synchronizing
RCU readers (via. kfree_rcu).
By contrast, the session shutdown removes the session from the list(s)
it is on, synchronizes with readers, and then decrements the session
refcount. Since the getter functions increment the session refcount
with the RCU read lock held we prevent getters seeing a zero session
refcount, and therefore don't need to use refcount_inc_not_zero.
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier says:
====================
net: mvneta: mqprio cleanups and shaping support
This is the second version of the series that adds some improvements to the
existing mqprio implementation in mvneta, and adds support for
egress shaping offload.
The first 3 patches are some minor cleanups, such as using the
tc_mqprio_qopt_offload structure to get access to more offloading
options, cleaning the logic to detect whether or not we should offload
mqprio setting, and allowing to have a 1 to N mapping between TCs and
queues.
The last patch adds traffic shaping offload, using mvneta's per-queue
token buckets, allowing to limit rates from 10Kbps up to 5Gbps with
10Kbps increments.
This was tested only on an Armada 3720, with traffic up to 2.5Gbps.
Changes since V1 fixes the build for 32bits kernels, using the right
div helpers as suggested by Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The mvneta controller is able to do some tocken-bucket per-queue traffic
shaping. This commit adds support for setting these using the TC mqprio
interface.
The token-bucket parameters are customisable, but the current
implementation configures them to have a 10kbps resolution for the
rate limitation, since it allows to cover the whole range of max_rate
values from 10kbps to 5Gbps with 10kbps increments.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current mqprio implementation assumed that we are only using one
queue per TC. Use the offset and count parameters to allow using
multiple queues per TC. In that case, the controller will use a standard
round-robin algorithm to pick queues assigned to the same TC, with the
same priority.
This only applies to VLAN priorities in ingress traffic, each TC
corresponding to a vlan priority.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The qopt->hw flag is set by the TC code according to the offloading mode
asked by user. Don't force-set it in the driver, but instead read it to
make sure we do what's asked.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct tc_mqprio_qopt_offload is a container for struct tc_mqprio_qopt,
that allows passing extra parameters, such as traffic shaping. This commit
converts the current mqprio code to that new struct.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use devm_ioremap() instead of ioremap() to avoid iounmap() missing.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kuniyuki Iwashima says:
====================
af_unix: Replace unix_table_lock with per-hash locks.
The hash table of AF_UNIX sockets is protected by a single big lock,
unix_table_lock. This series replaces it with small per-hash locks.
1st - 2nd : Misc refactoring
3rd - 8th : Separate BSD/abstract address logics
9th - 11th : Prep to save a hash in each socket
12th : Replace the big lock
13th : Speed up autobind()
Note to maintainers:
The 12th patch adds two kinds of Sparse warnings on patchwork:
about unix_table_double_lock/unlock()
We can avoid this by adding two apparent acquires/releases annotations,
but there are the same kinds of warnings about unix_state_double_lock().
about unix_next_socket() and unix_seq_stop() (/proc/net/unix)
This is because Sparse does not understand logic in unix_next_socket(),
which leaves a spin lock held until it returns NULL.
Also, tcp_seq_stop() causes a warning for the same reason.
These warnings seem reasonable, but let me know if there is any better way.
Please see [0] for details.
[0]: https://lore.kernel.org/netdev/20211117001611.74123-1-kuniyu@amazon.co.jp/
====================
Link: https://lore.kernel.org/r/20211124021431.48956-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we bind an AF_UNIX socket without a name specified, the kernel selects
an available one from 0x00000 to 0xFFFFF. unix_autobind() starts searching
from a number in the 'static' variable and increments it after acquiring
two locks.
If multiple processes try autobind, they obtain the same lock and check if
a socket in the hash list has the same name. If not, one process uses it,
and all except one end up retrying the _next_ number (actually not, it may
be incremented by the other processes). The more we autobind sockets in
parallel, the longer the latency gets. We can avoid such a race by
searching for a name from a random number.
These show latency in unix_autobind() while 64 CPUs are simultaneously
autobind-ing 1024 sockets for each.
Without this patch:
usec : count distribution
0 : 1176 |*** |
2 : 3655 |*********** |
4 : 4094 |************* |
6 : 3831 |************ |
8 : 3829 |************ |
10 : 3844 |************ |
12 : 3638 |*********** |
14 : 2992 |********* |
16 : 2485 |******* |
18 : 2230 |******* |
20 : 2095 |****** |
22 : 1853 |***** |
24 : 1827 |***** |
26 : 1677 |***** |
28 : 1473 |**** |
30 : 1573 |***** |
32 : 1417 |**** |
34 : 1385 |**** |
36 : 1345 |**** |
38 : 1344 |**** |
40 : 1200 |*** |
With this patch:
usec : count distribution
0 : 1855 |****** |
2 : 6464 |********************* |
4 : 9936 |******************************** |
6 : 12107 |****************************************|
8 : 10441 |********************************** |
10 : 7264 |*********************** |
12 : 4254 |************** |
14 : 2538 |******** |
16 : 1596 |***** |
18 : 1088 |*** |
20 : 800 |** |
22 : 670 |** |
24 : 601 |* |
26 : 562 |* |
28 : 525 |* |
30 : 446 |* |
32 : 378 |* |
34 : 337 |* |
36 : 317 |* |
38 : 314 |* |
40 : 298 | |
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>