linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 12:11:40 +00:00

A mirror of the official Linux kernel repository just in case

Go to file

Mahesh Salgaonkar 77583f77ed PCI: rpaphp: Error out on busy status from get-sensor-state When certain PHB HW failure causes pHyp to recover PHB, it marks the PE state as temporarily unavailable until recovery is complete. This also triggers an EEH handler in Linux which needs to notify drivers, and perform recovery. But before notifying the driver about the PCI error it uses get_adapter_status()->rpaphp_get_sensor_state()->rtas_call(get-sensor-state) operation of the hotplug_slot to determine if the slot contains a device or not. If the slot is empty, the recovery is skipped entirely. eeh_event_handler() ->eeh_handle_normal_event() ->eeh_slot_presence_check() ->get_adapter_status() ->rpaphp_get_sensor_state() ->rtas_get_sensor() ->rtas_call(get-sensor-state) However on certain PHB failures, the RTAS call rtas_call(get-sensor-state) returns extended busy error (9902) until PHB is recovered by pHyp. Once PHB is recovered, the rtas_call(get-sensor-state) returns success with correct presence status. The RTAS call interface rtas_get_sensor() loops over the RTAS call on extended delay return code (9902) until the return value is either success (0) or error (-1). This causes the EEH handler to get stuck for ~6 seconds before it could notify that the PCI error has been detected and stop any active operations. Hence with running I/O traffic, during this 6 seconds, the network driver continues its operation and hits a timeout (netdev watchdog). ------------ [52732.244731] DEBUG: ibm_read_slot_reset_state2() [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=> [52732.244798] DEBUG: in eeh_slot_presence_check [52732.244804] DEBUG: error state check [52732.244807] DEBUG: Is slot hotpluggable [52732.244810] DEBUG: hotpluggable ops ? [52732.244953] DEBUG: Calling ops->get_adapter_status [52732.244958] DEBUG: calling rpaphp_get_sensor_state [52736.564262] ------------[ cut here ]------------ [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o> [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev> [...] [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440 [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440 ------------ On timeouts, network driver starts dumping debug information to console (e.g bnx2 driver calls bnx2x_panic_dump()), and go into recovery path while pHyp is still recovering the PHB. As part of recovery, the driver tries to reset the device and it keeps failing since every PCI read/write returns ff's. And when EEH recovery kicks-in, the driver is unable to recover the device. This impacts the ssh connection and leads to the system being inaccessible. To get the NIC working again it needs a reboot or re-assign the I/O adapter from HMC. [ 9531.168587] EEH: Beginning: 'slot_reset' [ 9531.168601] PCI 0013:01:00.0#10000: EEH: Invoking bnx2x->slot_reset() [...] [ 9614.110094] bnx2x: [bnx2x_func_stop:9129(enP19p1s0f0)]FUNC_STOP ramrod failed. Running a dry transaction [ 9614.110300] bnx2x: [bnx2x_igu_int_disable:902(enP19p1s0f0)]BUG! Proper val not read from IGU! [ 9629.178067] bnx2x: [bnx2x_fw_command:3055(enP19p1s0f0)]FW failed to respond! [ 9629.178085] bnx2x 0013:01:00.0 enP19p1s0f0: bc 7.10.4 [ 9629.178091] bnx2x: [bnx2x_fw_dump_lvl:789(enP19p1s0f0)]Cannot dump MCP info while in PCI error [ 9644.241813] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f0)]IO slot reset --> driver unload [...] [ 9644.241819] PCI 0013:01:00.0#10000: EEH: bnx2x driver reports: 'disconnect' [ 9644.241823] PCI 0013:01:00.1#10000: EEH: Invoking bnx2x->slot_reset() [ 9644.241827] bnx2x: [bnx2x_io_slot_reset:14229(enP19p1s0f1)]IO slot reset initializing... [ 9644.241916] bnx2x 0013:01:00.1: enabling device (0140 -> 0142) [ 9644.258604] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f1)]IO slot reset --> driver unload [ 9644.258612] PCI 0013:01:00.1#10000: EEH: bnx2x driver reports: 'disconnect' [ 9644.258615] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect' [ 9644.258620] EEH: Unable to recover from failure from PHB#13-PE#10000. [ 9644.261811] EEH: Beginning: 'error_detected(permanent failure)' [...] [ 9644.261823] EEH: Finished:'error_detected(permanent failure)' Hence, it becomes important to inform driver about the PCI error detection as early as possible, so that driver is aware of PCI error and waits for EEH handler's next action for successful recovery. Current implementation uses rtas_get_sensor() API which blocks the slot check state until RTAS call returns success. To avoid this, fix the PCI hotplug driver (rpaphp) to return an error (-EBUSY) if the slot presence state can not be detected immediately while PE is in EEH recovery state. Change rpaphp_get_sensor_state() to invoke rtas_call(get-sensor-state) directly only if the respective PE is in EEH recovery state, and take actions based on RTAS return status. This way EEH handler will not be blocked on rpaphp_get_sensor_state() and can immediately notify driver about the PCI error and stop any active operations. In normal cases (non-EEH case) rpaphp_get_sensor_state() will continue to invoke rtas_get_sensor() as it was earlier with no change in existing behavior. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/169235815601.193557.13989873835811325343.stgit@jupiter		2023-08-18 23:30:22 +10:00
arch	powerpc/rtas: export rtas_error_rc() for reuse.	2023-08-18 23:28:57 +10:00
block	block-6.5-2023-07-21	2023-07-22 11:05:15 -07:00
certs	KEYS: Add missing function documentation	2023-04-24 16:15:52 +03:00
crypto	crypto: algif_hash - Fix race between MORE and non-MORE sends	2023-07-08 22:48:42 +10:00
Documentation	powerpc/idle: Add support for nohlt	2023-08-18 21:19:13 +10:00
drivers	PCI: rpaphp: Error out on busy status from get-sensor-state	2023-08-18 23:30:22 +10:00
fs	Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's	2023-07-23 10:21:49 -07:00
include	perf/hw_breakpoint: Remove arch breakpoint hooks	2023-08-16 23:54:50 +10:00
init	Kbuild updates for v6.5	2023-07-01 09:24:31 -07:00
io_uring	io_uring-6.5-2023-07-21	2023-07-22 10:46:30 -07:00
ipc	Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2023-02-24 19:20:07 -08:00
kernel	perf/hw_breakpoint: Remove arch breakpoint hooks	2023-08-16 23:54:50 +10:00
lib	block-6.5-2023-07-21	2023-07-22 11:05:15 -07:00
LICENSES	LICENSES: Add the copyleft-next-0.3.1 license	2022-11-08 15:44:01 +01:00
mm	mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()	2023-07-17 12:53:21 -07:00
net	Including fixes from BPF, netfilter, bluetooth and CAN.	2023-07-20 14:46:39 -07:00
rust	rust: error: `impl Debug` for `Error` with `errname()` integration	2023-06-13 01:24:42 +02:00
samples	arm64: ftrace: Add direct call trampoline samples support	2023-07-10 17:51:54 -04:00
scripts	Kbuild fixes for v6.5	2023-07-23 14:55:41 -07:00
security	security: keys: Modify mismatched function name	2023-07-17 19:40:27 +00:00
sound	ASoC: Fixes for v6.5	2023-07-20 15:16:11 +02:00
tools	selftests/powerpc: add const qualification where possible	2023-08-18 17:03:15 +10:00
usr	initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP	2023-06-06 17:54:49 +09:00
virt	ARM64:	2023-07-03 15:32:22 -07:00
.clang-format	iommu: Add for_each_group_device()	2023-05-23 08:15:51 +02:00
.cocciconfig
.get_maintainer.ignore	get_maintainer: add Alan to .get_maintainer.ignore	2022-08-20 15:17:44 -07:00
.gitattributes	.gitattributes: set diff driver for Rust source code files	2023-05-31 17:48:25 +02:00
.gitignore	Revert ".gitignore: ignore .cover and .mbx"	2023-07-04 15:05:12 -07:00
.mailmap	Including fixes from BPF, netfilter, bluetooth and CAN.	2023-07-20 14:46:39 -07:00
.rustfmt.toml	rust: add `.rustfmt.toml`	2022-09-28 09:02:20 +02:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	- Address -Wmissing-prototype warnings	2023-06-26 16:43:54 -07:00
Kbuild	Kbuild updates for v6.1	2022-10-10 12:00:45 -07:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	ASoC: Fixes for v6.5	2023-07-17 08:21:09 +02:00
Makefile	Linux 6.5-rc3	2023-07-23 15:24:10 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.