mirror of
https://github.com/torvalds/linux.git
synced 2024-12-14 15:13:52 +00:00
320e77acb3
Issue Description: We have seen cpu lock up issue from fields if system has greater (more than 96) logical cpu count. SAS3.0 controller (Invader series) supports at max 96 msix vector and SAS3.5 product (Ventura) supports at max 128 msix vectors. This may be a generic issue (if PCI device supports completion on multiple reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to simplify the problem and possible changes to handle such issues. IT HBA (mpt3sas) supports multiple reply queues in completion path. Driver creates MSI-x vectors for controller as "min of (FW supported Reply queue, Logical CPUs)". If submitter is not interrupted via completion on same CPU, there is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO timeout, system sluggish etc. Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU (e.g. CPU B) is busy with processing the corresponding IO's reply descriptors from reply descriptor queue upon receiving the interrupts from HBA. If the CPU A is continuously pumping the IOs then always CPU B (which is executing the ISR) will see the valid reply descriptors in the reply descriptor queue and it will be continuously processing those reply descriptor in a loop without quitting the ISR handler. Mpt3sas driver will exit ISR handler if it finds unused reply descriptor in the reply descriptor queue. Since CPU A will be continuously sending the IOs, CPU B may always see a valid reply descriptor (posted by HBA Firmware after processing the IO) in the reply descriptor queue. In worst case, driver will not quit from this loop in the ISR handler. Eventually, CPU lockup will be detected by watchdog. Above mentioned behavior is not common if "rq_affinity" set to 2 or affinity_hint is honored by irqbalance as "exact". If rq_affinity is set to 2, submitter will be always interrupted via completion on same CPU. If irqbalance is using "exact" policy, interrupt will be delivered to submitter CPU. If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is not 1:1, we still have exposure of issue explained above and for that we don't have any solution. Exposure of soft/hard lockup if CPU count is more than MSI-x supported by device. If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU counts to MSI-x vector count ratio is something like X:1, where X > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU hard/soft lockups. There won't be any one to one mapping between CPU to MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is shared with group/set of CPUs and there is a possibility of having a loop in the IO path within that CPU group and may observe lockups. For example: Consider a system having two NUMA nodes and each node having four logical CPUs and also consider that number of MSI-x vectors enabled on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1. e.g. MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1. numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 --> MSI-x 0 node 0 size: 65536 MB node 0 free: 63176 MB node 1 cpus: 4 5 6 7 -->MSI-x 1 node 1 size: 65536 MB node 1 free: 63176 MB Assume that user started an application which uses all the CPUs of NUMA node 0 for issuing the IOs. Only one CPU from affinity list (it can be any cpu since this behavior depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be decreasing and ISR processing percentage will be increasing as it is more busy with processing the interrupts. Gradually IO submission percentage on CPU 0 will be zero and it's ISR processing percentage will be 100 percentage as IO loop has already formed within the NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs and only CPU 0 is busy in the ISR path as it always find the valid reply descriptor in the reply descriptor queue. Eventually, we will observe the hard lockup here. Chances of occurring of hard/soft lockups are directly proportional to value of X. If value of X is high, then chances of observing CPU lockups is high. Solution: Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver will execute ISR routine in Softirq context and it will always quit the loop based on budget provided in IRQ poll interface. In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups due to voluntary exit from the reply queue processing based on budget. Note - Only one MSI-x vector is busy doing processing. Irqstat output: IRQs / 1 second(s) IRQ# TOTAL NODE0 NODE1 NODE2 NODE3 NAME 44 122871 122871 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix0 45 0 0 0 0 0 IR-PCI-MSI-edge mpt3sas0-msix1 We use this approach only if cpu count is more than FW supported MSI-x vector Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
84 lines
3.5 KiB
Plaintext
84 lines
3.5 KiB
Plaintext
#
|
|
# Kernel configuration file for the MPT3SAS
|
|
#
|
|
# This code is based on drivers/scsi/mpt3sas/Kconfig
|
|
# Copyright (C) 2012-2014 LSI Corporation
|
|
# (mailto:DL-MPTFusionLinux@lsi.com)
|
|
|
|
# This program is free software; you can redistribute it and/or
|
|
# modify it under the terms of the GNU General Public License
|
|
# as published by the Free Software Foundation; either version 2
|
|
# of the License, or (at your option) any later version.
|
|
|
|
# This program is distributed in the hope that it will be useful,
|
|
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
# GNU General Public License for more details.
|
|
|
|
# NO WARRANTY
|
|
# THE PROGRAM IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OR
|
|
# CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, WITHOUT
|
|
# LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT,
|
|
# MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Each Recipient is
|
|
# solely responsible for determining the appropriateness of using and
|
|
# distributing the Program and assumes all risks associated with its
|
|
# exercise of rights under this Agreement, including but not limited to
|
|
# the risks and costs of program errors, damage to or loss of data,
|
|
# programs or equipment, and unavailability or interruption of operations.
|
|
|
|
# DISCLAIMER OF LIABILITY
|
|
# NEITHER RECIPIENT NOR ANY CONTRIBUTORS SHALL HAVE ANY LIABILITY FOR ANY
|
|
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
# DAMAGES (INCLUDING WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND
|
|
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
|
|
# TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
|
|
# USE OR DISTRIBUTION OF THE PROGRAM OR THE EXERCISE OF ANY RIGHTS GRANTED
|
|
# HEREUNDER, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
|
|
|
|
# You should have received a copy of the GNU General Public License
|
|
# along with this program; if not, write to the Free Software
|
|
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
|
|
# USA.
|
|
|
|
config SCSI_MPT3SAS
|
|
tristate "LSI MPT Fusion SAS 3.0 & SAS 2.0 Device Driver"
|
|
depends on PCI && SCSI
|
|
select SCSI_SAS_ATTRS
|
|
select RAID_ATTRS
|
|
select IRQ_POLL
|
|
---help---
|
|
This driver supports PCI-Express SAS 12Gb/s Host Adapters.
|
|
|
|
config SCSI_MPT2SAS_MAX_SGE
|
|
int "LSI MPT Fusion SAS 2.0 Max number of SG Entries (16 - 256)"
|
|
depends on PCI && SCSI && SCSI_MPT3SAS
|
|
default "128"
|
|
range 16 256
|
|
---help---
|
|
This option allows you to specify the maximum number of scatter-
|
|
gather entries per I/O. The driver default is 128, which matches
|
|
MAX_PHYS_SEGMENTS in most kernels. However in SuSE kernels this
|
|
can be 256. However, it may decreased down to 16. Decreasing this
|
|
parameter will reduce memory requirements on a per controller instance.
|
|
|
|
config SCSI_MPT3SAS_MAX_SGE
|
|
int "LSI MPT Fusion SAS 3.0 Max number of SG Entries (16 - 256)"
|
|
depends on PCI && SCSI && SCSI_MPT3SAS
|
|
default "128"
|
|
range 16 256
|
|
---help---
|
|
This option allows you to specify the maximum number of scatter-
|
|
gather entries per I/O. The driver default is 128, which matches
|
|
MAX_PHYS_SEGMENTS in most kernels. However in SuSE kernels this
|
|
can be 256. However, it may decreased down to 16. Decreasing this
|
|
parameter will reduce memory requirements on a per controller instance.
|
|
|
|
config SCSI_MPT2SAS
|
|
tristate "Legacy MPT2SAS config option"
|
|
default n
|
|
select SCSI_MPT3SAS
|
|
depends on PCI && SCSI
|
|
---help---
|
|
Dummy config option for backwards compatiblity: configure the MPT3SAS
|
|
driver instead.
|