linux/drivers
Shivasharan S 62a04f81e6 scsi: megaraid_sas: IRQ poll to avoid CPU hard lockups
Issue Description:

We have seen cpu lock up issues from field if system has a large (more than
96) logical cpu count.  SAS3.0 controller (Invader series) supports max 96
MSI-X vector and SAS3.5 product (Ventura) supports max 128 MSI-X vectors.

This may be a generic issue (if PCI device support completion on multiple
reply queues).

Let me explain it w.r.t megaraid_sas supported h/w just to simplify the
problem and possible changes to handle such issues.  MegaRAID controller
supports multiple reply queues in completion path.  Driver creates MSI-X
vectors for controller as "minimum of (FW supported Reply queues, Logical
CPUs)".  If submitter is not interrupted via completion on same CPU, there
is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO
timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA.  If CPU A is continuously pumping the IOs then always CPU B (which is
executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.

megaraid_sas driver will exit ISR handler if it finds unused reply
descriptor in the reply descriptor queue.  Since CPU A will be continuously
sending the IOs, CPU B may always see a valid reply descriptor (posted by
HBA Firmware after processing the IO) in the reply descriptor queue. In
worst case, driver will not quit from this loop in the ISR handler.
Eventually, CPU lockup will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalancer as "exact".  If rq_affinity is set
to 2, submitter will be always interrupted via completion on same CPU.  If
irqbalancer is using "exact" policy, interrupt will be delivered to
submitter CPU.

Problem statement:

If CPU count to MSI-X vectors (reply descriptor Queues) count ratio is not
1:1, we still have exposure of issue explained above and for that we don't
have any solution.

Exposure of soft/hard lockup is seen if CPU count is more than MSI-X
supported by device.

If CPUs count to MSI-X vectors count ratio is not 1:1, (Other way, if
CPU counts to MSI-X vector count ratio is something like X:1, where X > 1)
then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between
CPU to MSI-X vector instead one MSI-X interrupt (or reply descriptor queue)
is shared with group/set of CPUs and there is a possibility of having a
loop in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-X vectors enabled on
the HBA is two, then CPUs count to MSI-X vector count ratio as 4:1.
e.g.
MSI-X vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and
MSI-X vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3                 --> MSI-X 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7                 --> MSI-X 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs.  Only one CPU from affinity list (it can be any
cpu since this behavior depends upon irqbalance) CPU0 will receive the
interrupts from MSI-X 0 for all the IOs. Eventually, CPU 0 IO submission
percentage will be decreasing and ISR processing percentage will be
increasing as it is more busy with processing the interrupts.  Gradually IO
submission percentage on CPU 0 will be zero and it's ISR processing
percentage will be 100% as IO loop has already formed within the
NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
always find the valid reply descriptor in the reply descriptor queue.
Eventually, we will observe the hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups is
high.

Solution:

Use IRQ poll interface defined in "irq_poll.c".

megaraid_sas driver will execute ISR routine in softirq context and it will
always quit the loop based on budget provided in IRQ poll interface.
Driver will switch to IRQ poll only when more than a threshold number of
reply descriptors are handled in one ISR. Currently threshold is set as
1/4th of HBA queue depth.

In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X >  1)), IRQ poll interface will avoid CPU hard lockups due to
voluntary exit from the reply queue processing based on budget.
Note - Only one MSI-X vector is busy doing processing.

Select CONFIG_IRQ_POLL from driver Kconfig for driver compilation.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-06-18 19:46:19 -04:00
..
accessibility
acpi One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
amba amba: tegra-ahb: Mark PM functions as __maybe_unused 2019-05-08 14:40:39 +02:00
android Char/Misc patches for 5.2-rc1 - part 2 2019-05-07 13:39:22 -07:00
ata for-5.2/block-post-20190516 2019-05-16 19:08:15 -07:00
atm
auxdisplay
base More power management updates for 5.2-rc1 2019-05-15 08:46:44 -07:00
bcma
block for-5.2/block-post-20190516 2019-05-16 19:08:15 -07:00
bluetooth Bluetooth: hci_qca: Rename STATE_<flags> to QCA_<flags> 2019-05-05 19:34:00 +02:00
bus ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
cdrom
char Some minor cleanups for the IPMI driver. 2019-05-08 10:34:17 -07:00
clk One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
clocksource Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 11:11:20 -07:00
connector
counter
cpufreq One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
cpuidle
crypto ARM: SoC platform updates 2019-05-16 08:31:32 -07:00
dax libnvdimm fixes 5.2-rc1 2019-05-15 18:56:50 -07:00
dca
devfreq
dio
dma dmaengine updates for v5.2-rc1 2019-05-09 08:51:45 -07:00
dma-buf drm i915, amdgpu, nouveau, msm, panfrost, bridge, pl111 fixes 2019-05-16 07:22:42 -07:00
edac * Do not build mpc85_edac as a module (Michael Ellerman) 2019-05-16 11:55:35 -07:00
eisa
extcon Char/Misc patches for 5.2-rc1 - part 2 2019-05-07 13:39:22 -07:00
firewire drivers/firewire/core-iso.c: convert to use vm_map_pages_zero() 2019-05-14 09:47:50 -07:00
firmware Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 10:58:45 -07:00
fmc
fpga ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
fsi
gnss Char/Misc patches for 5.2-rc1 - part 2 2019-05-07 13:39:22 -07:00
gpio Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 10:58:45 -07:00
gpu One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
hid treewide: prefix header search paths with $(srctree)/ 2019-05-18 11:49:57 +09:00
hsi
hv
hwmon Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2019-05-16 16:16:18 -07:00
hwspinlock
hwtracing Char/Misc patches for 5.2-rc1 - part 2 2019-05-07 13:39:22 -07:00
i2c i2c: core: add device-managed version of i2c_new_dummy 2019-05-17 19:29:40 +02:00
i3c * Fix a shift wrap bug in the core 2019-05-07 08:50:40 -07:00
ide ide: officially deprecated the legacy IDE driver 2019-05-08 16:47:23 -07:00
idle
iio power supply and reset changes for the v5.2 series 2019-05-15 18:50:40 -07:00
infiniband 5.2 Merge Window second pull request 2019-05-14 20:56:31 -07:00
input ARM: SoC platform updates 2019-05-16 08:31:32 -07:00
interconnect
iommu Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 10:58:45 -07:00
ipack
irqchip Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 10:58:45 -07:00
isdn Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
leds - Core Frameworks 2019-05-14 10:39:08 -07:00
lightnvm lightnvm: pblk: use nvm_rq_to_ppa_list() 2019-05-06 10:19:19 -06:00
macintosh
mailbox One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
mcb
md - Improve DM snapshot target's scalability by using finer grained 2019-05-16 15:55:48 -07:00
media media: prefix header search paths with $(srctree)/ 2019-05-18 11:49:56 +09:00
memory One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
memstick MMC core: 2019-05-07 12:56:19 -07:00
message
mfd One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
misc Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 11:11:20 -07:00
mmc One more patch to remove io.h from clk-provider.h. We used to need this 2019-05-16 19:05:35 -07:00
mtd treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h> 2019-05-14 19:52:52 -07:00
mux
net treewide: prefix header search paths with $(srctree)/ 2019-05-18 11:49:57 +09:00
nfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
ntb
nubus
nvdimm libnvdimm fixes 5.2-rc1 2019-05-15 18:56:50 -07:00
nvme for-5.2/block-post-20190516 2019-05-16 19:08:15 -07:00
nvmem ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
of of_net: Fix missing of_find_device_by_node ref count drop 2019-05-13 08:52:37 -07:00
opp
oprofile
parisc parisc: Skip registering LED when running in QEMU 2019-05-03 23:47:39 +02:00
parport DMA mapping updates for 5.2 2019-05-09 08:40:55 -07:00
pci pci-v5.2-changes 2019-05-14 10:30:10 -07:00
pcmcia treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h> 2019-05-14 19:52:52 -07:00
perf
phy USB/PHY patches for 5.2-rc1 2019-05-08 10:03:52 -07:00
pinctrl - Core Frameworks 2019-05-14 10:39:08 -07:00
platform - Core Frameworks 2019-05-14 10:39:08 -07:00
pnp
power power supply and reset changes for the v5.2 series 2019-05-15 18:50:40 -07:00
powercap
pps pps: pps-gpio PPS ECHO implementation 2019-05-14 19:52:51 -07:00
ps3
ptp ptp_qoriq: fix NULL access if ptp dt node missing 2019-05-09 09:19:26 -07:00
pwm Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 11:11:20 -07:00
rapidio rapidio: fix a NULL pointer dereference when create_workqueue() fails 2019-05-14 19:52:50 -07:00
ras
regulator Merge branch 'regulator-5.2' into regulator-next 2019-05-06 22:52:14 +09:00
remoteproc
reset ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
rpmsg
rtc ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
s390 s390 updates for the 5.2 merge window #2 2019-05-17 10:08:59 -07:00
sbus mm/gup: change GUP fast to use flags rather than a write 'bool' 2019-05-14 09:47:46 -07:00
scsi scsi: megaraid_sas: IRQ poll to avoid CPU hard lockups 2019-06-18 19:46:19 -04:00
sfi
sh treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h> 2019-05-14 19:52:52 -07:00
siox
slimbus
sn
soc Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-19 10:58:45 -07:00
soundwire
spi ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
spmi
ssb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2019-05-07 22:03:58 -07:00
staging media updates for v5.2-rc1 2019-05-16 11:57:16 -07:00
target treewide: prefix header search paths with $(srctree)/ 2019-05-18 11:49:57 +09:00
tc
tee ARM: SoC-related driver updates 2019-05-16 09:19:14 -07:00
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2019-05-16 16:16:18 -07:00
thunderbolt Char/Misc patches for 5.2-rc1 - part 2 2019-05-07 13:39:22 -07:00
tty RISC-V Patches for the 5.2 Merge Window, Part 1 v3 2019-05-19 09:56:36 -07:00
uio
usb treewide: prefix header search paths with $(srctree)/ 2019-05-18 11:49:57 +09:00
uwb
vfio mm/gup: change GUP fast to use flags rather than a write 'bool' 2019-05-14 09:47:46 -07:00
vhost virtio: fixes, features 2019-05-14 14:12:59 -07:00
video fbdev/efifb: Ignore framebuffer memmap entries that lack any memory types 2019-05-17 11:07:42 +02:00
virt drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl 2019-05-14 19:52:52 -07:00
virtio virtio/virtio_ring: do some comment fixes 2019-05-12 13:11:35 -04:00
visorbus
vlynq
vme
w1 Char/Misc patches for 5.2-rc1 - part 2 2019-05-07 13:39:22 -07:00
watchdog ARM: SoC platform updates 2019-05-16 08:31:32 -07:00
xen xen: fixes and features for 5.2-rc1 2019-05-15 18:44:52 -07:00
zorro
Kconfig
Makefile