linux/drivers
Steffen Maier fdb7cee3b9 scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response
At the default trace level, we only trace unsuccessful events including
FSF responses.

zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
decide on an unsuccessful response. However, this is only one of multiple
possible sources determining a failed struct zfcp_fsf_req.

An FSF request can also "fail" if its response runs into an ERP timeout
or if it gets dismissed because a higher level recovery was triggered
[trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
FSF requests with ERP timeout are:
FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
One example is slow queue processing which can cause follow-on errors,
e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
In order to see the root cause, we need to see late responses even if the
channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
Example trace records formatted with zfcpdbf from the s390-tools package:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : fcegpf1
LUN            : 0xffffffffffffffff
WWPN           : 0x<WWPN>
D_ID           : 0x00<D_ID>
Adapter status : 0x5400050b
Port status    : 0x41200000
LUN status     : 0x00000000
Ready count    : 0x00000001
Running count  : 0x...
ERP want       : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
ERP need       : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
|
Timestamp      : ...				30 seconds later
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 2
Tag            : erscf_2
LUN            : 0xffffffffffffffff
WWPN           : 0x<WWPN>
D_ID           : 0x00<D_ID>
Adapter status : 0x5400050b
Port status    : 0x41200000
LUN status     : 0x00000000
Request ID     : 0x<request_ID>
ERP status     : 0x10000000			ZFCP_STATUS_ERP_TIMEDOUT
ERP step       : 0x0800				ZFCP_ERP_STEP_PORT_OPENING
ERP action     : 0x02				ZFCP_ERP_ACTION_REOPEN_PORT
ERP count      : 0x00
|
Timestamp      : ...				later than previous record
Area           : HBA
Subarea        : 00
Level          : 5	> default level		=> 3	<= default level
Exception      : -
CPU ID         : 00
Caller         : ...
Record ID      : 1
Tag            : fs_qtcb			=> fs_rerr
Request ID     : 0x<request_ID>
Request status : 0x00001010			ZFCP_STATUS_FSFREQ_DISMISSED
						| ZFCP_STATUS_FSFREQ_CLEANUP
FSF cmnd       : 0x00000005
FSF sequence no: 0x...
FSF issued     : ...				> 30 seconds ago
FSF stat       : 0x00000000			FSF_GOOD
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001			FSF_PROT_GOOD
Prot stat qual : 00000000 00000000 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x00000000
QTCB log length: ...
QTCB log info  : ...

In case of problems detecting that new responses are waiting on the input
queue, we sooner or later trigger adapter recovery due to an FSF request
timeout (trace tag "fsrth_1").
FSF requests with FSF request timeout are:
typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
because the channel filled in the response via DMA into the request's QTCB.

In a theroretical case, inject code can create an erroneous FSF request
on purpose. If data router is enabled, it uses deferred error reporting.
A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
it can still fail, e.g. if an intentionally wrong scatter list does not
provide enough space. Rather than getting an unsuccessful response,
we get a QDIO activate check which in turn triggers adapter recovery.
One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
because the channel filled in the response via DMA into the request's QTCB.
Example trace records formatted with zfcpdbf from the s390-tools package:

Timestamp      : ...
Area           : HBA
Subarea        : 00
Level          : 6	> default level		=> 3	<= default level
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : fs_norm			=> fs_rerr
Request ID     : 0x<request_ID2>
Request status : 0x00001010			ZFCP_STATUS_FSFREQ_DISMISSED
						| ZFCP_STATUS_FSFREQ_CLEANUP
FSF cmnd       : 0x00000001
FSF sequence no: 0x...
FSF issued     : ...
FSF stat       : 0x00000000			FSF_GOOD
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001			FSF_PROT_GOOD
Prot stat qual : ........ ........ 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x...
|
Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 3
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : rsl_err
Request ID     : 0x<request_ID2>
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x000e0000			DID_TRANSPORT_DISRUPTED
SCSI retries   : 0x00
SCSI allowed   : 0x05
SCSI scribble  : 0x<request_ID2>
SCSI opcode    : 28...				Read(10)
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000000 00000000
                                         ^^	SAM_STAT_GOOD
                 00000000 00000000

Only with luck in both above cases, we could see a follow-on trace record
of an unsuccesful event following a successful but late FSF response with
FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
[On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
However, the reason for this follow-on trace was invisible because the
corresponding HBA trace record was missing at the default trace level
(by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").

On adapter recovery, after we had shut down the QDIO queues, we perform
unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
for each pending FSF request in zfcp_fsf_req_dismiss_all().
In order to find the root cause, we need to see all pseudo responses even
if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.

Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".

It does not matter that there are numerous places which set
ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
early. These cases are based on protocol status != FSF_PROT_GOOD or
== FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
as trace tag "fs_perr" or "fs_ferr" respectively.

NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
all remains. zfcp_fsf_req_complete() handles this and returns early.
All other FSF request types are handled separately and as described above.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 8a36e4532e ("[SCSI] zfcp: enhancement of zfcp debug features")
Fixes: 2e261af84c ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
Cc: <stable@vger.kernel.org> #2.6.38+
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2017-08-10 19:36:57 -04:00
..
accessibility
acpi Merge branches 'acpi-soc', 'acpi-wdat' and 'acpi-cppc' 2017-08-03 20:30:18 +02:00
amba
android binder: Use wake up hint for synchronous transactions. 2017-07-17 14:44:19 +02:00
ata libata: fix a couple of doc build warnings 2017-07-31 08:03:06 -07:00
atm atm: zatm: Fix an error handling path in 'zatm_init_one()' 2017-07-18 11:37:46 -07:00
auxdisplay
base dma mapping fixes for 4.13-rc2: 2017-07-25 17:17:18 -07:00
bcma
block Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2017-07-28 12:13:34 -07:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2017-07-05 12:31:59 -07:00
bus bus: uniphier-system-bus: set up registers when resuming 2017-08-04 12:57:18 +02:00
cdrom block: don't set bounce limit in blk_init_queue 2017-06-27 12:13:45 -06:00
char Add wait_for_random_bytes() and get_random_*_wait() functions so that 2017-07-15 12:44:02 -07:00
clk clk: keystone: sci-clk: Fix sci_clk_get 2017-08-02 18:37:26 -07:00
clocksource clocksource/drivers/timer-of: Handle of_irq_get_byname() result correctly 2017-07-17 22:43:00 +02:00
connector
cpufreq Merge branches 'pm-cpufreq-x86', 'pm-cpufreq-docs' and 'intel_pstate' 2017-08-03 20:29:24 +02:00
cpuidle powerpc updates for 4.13 2017-07-07 13:55:45 -07:00
crypto crypto: brcm - remove BCM_PDC_MBOX dependency in Kconfig 2017-07-18 17:01:08 +08:00
dax - A few DM integrity fixes that improve performance. One that address 2017-07-28 12:17:17 -07:00
dca
devfreq PM / devfreq: constify attribute_group structures. 2017-07-06 10:17:24 +09:00
dio
dma dmaengine updates for 4.13-rc1 2017-07-08 12:36:50 -07:00
dma-buf Merge branch 'drm-misc-next-fixes' into drm-misc-fixes 2017-07-17 11:56:07 -04:00
edac EDAC, pnd2: Fix Apollo Lake DIMM detection 2017-06-29 10:37:50 +02:00
eisa
extcon
firewire
firmware efi: avoid fortify checks in EFI stub 2017-07-12 16:26:02 -07:00
fmc
fpga
fsi drivers/fsi: fix fsi_slave_mode prototype 2017-07-17 16:13:54 +02:00
gpio gpio: tegra: fix unbalanced chained_irq_enter/exit 2017-08-02 10:42:38 +02:00
gpu Merge branch 'drm-fixes-4.13' of git://people.freedesktop.org/~agd5f/linux into drm-fixes 2017-08-04 11:43:14 +10:00
hid HID: ortek: add one more buggy device 2017-07-24 17:38:21 +02:00
hsi HSI changes for the v4.13 series 2017-07-04 14:28:22 -07:00
hv vmbus: re-enable channel tasklet 2017-07-17 15:00:47 +02:00
hwmon hwmon: (applesmc) Avoid buffer overruns 2017-07-15 16:38:56 -07:00
hwspinlock
hwtracing Char/Misc patches for 4.13-rc1 2017-07-03 20:55:59 -07:00
i2c Merge branch 'i2c/for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2017-07-12 10:04:56 -07:00
ide ide: avoid warning for timings calculation 2017-07-21 04:37:22 +01:00
idle intel_idle: Use more common logging style 2017-06-29 22:58:35 +02:00
iio hwmon updates for v4.13: 2017-07-04 11:48:27 -07:00
infiniband RDMA/core: Initialize port_num in qp_attr 2017-07-20 11:24:13 -04:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2017-07-14 22:53:37 -07:00
iommu iommu/amd: Fix schedule-while-atomic BUG in initialization code 2017-07-26 15:39:14 +02:00
ipack
irqchip irqchip/digicolor: Drop unnecessary static 2017-07-18 21:59:23 +02:00
isdn isdn/i4l: fix buffer overflow 2017-08-02 20:43:36 -07:00
leds LED updates for 4.13 2017-07-06 11:32:40 -07:00
lguest
lightnvm lightnvm: pblk: advance bio according to lba index 2017-07-28 08:06:00 -06:00
macintosh Merge branch 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-05 13:13:32 -07:00
mailbox mailbox: pcc: Fix crash when request PCC channel 0 2017-07-26 02:11:47 +02:00
mcb
md Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-07-28 12:24:21 -07:00
media media: platform: davinci: drop VPFE_CMD_S_CCDC_RAW_PARAMS 2017-07-26 06:14:33 -04:00
memory ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
memstick
message scsi: fusion: fix string overflow warning 2017-08-07 14:04:02 -04:00
mfd chrome-platform-for-linus-4.13 2017-07-11 09:55:47 -07:00
misc powerpc updates for 4.13 2017-07-07 13:55:45 -07:00
mmc mmc: block: bypass the queue even if usage is present for hotplug 2017-08-03 11:00:39 +02:00
mtd MTD updates for v4.13-rc1: 2017-07-13 12:07:44 -07:00
mux mux: mux-core: unregister mux_class in mux_exit() 2017-07-17 16:38:35 +02:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-07-31 22:36:42 -07:00
nfc NFC 4.13 pull request 2017-07-01 14:30:39 -07:00
ntb ntb: Add error path/handling to Debug FS entry creation 2017-07-06 11:30:08 -04:00
nubus
nvdimm libnvdimm: fix badblock range handling of ARS range 2017-07-17 11:43:58 -07:00
nvme nvme: validate admin queue before unquiesce 2017-07-26 17:41:41 +02:00
nvmem nvmem: rockchip-efuse: amend compatible rk322x-efuse to rk3228-efuse 2017-07-17 16:15:57 +02:00
of Merge remote-tracking branches 'asoc/fix/dpcm', 'asoc/fix/imx', 'asoc/fix/msm8916', 'asoc/fix/multi-pcm', 'asoc/fix/of-graph' and 'asoc/fix/pxa' into asoc-linus 2017-08-01 15:17:06 +01:00
oprofile
parisc parisc: pdc_stable: Fix locking when creating sysfs links 2017-07-31 16:43:13 +02:00
parport
pci Power management fixes for v4.13-rc1 2017-07-14 22:24:25 -07:00
pcmcia
perf drivers/perf: arm_pmu: Request PMU SPIs with IRQF_PER_CPU 2017-07-27 13:43:22 +01:00
phy phy: bcm-ns-usb3: fix MDIO_BUS dependency 2017-07-27 17:20:19 -07:00
pinctrl pinctrl: stm32: select IRQ_DOMAIN_HIERARCHY instead of depends on 2017-08-01 10:04:41 +02:00
platform platform/x86: intel-vbtn: match power button on press rather than release 2017-08-05 14:37:19 -07:00
pnp This is the bulk of GPIO changes for the v4.13 series: 2017-07-07 12:40:27 -07:00
power power supply and reset changes for the v4.13 series (part 2) 2017-07-13 11:47:59 -07:00
powercap powercap/RAPL: prevent overridding bits outside of the mask 2017-06-28 00:38:34 +02:00
pps
ps3
ptp ptp: dte: Use LL suffix for 64-bit constants 2017-07-06 11:40:58 +01:00
pwm pwm: Changes for v4.13-rc1 2017-07-13 11:49:52 -07:00
rapidio
ras arm64 updates for 4.13: 2017-07-05 17:09:27 -07:00
regulator Merge remote-tracking branches 'regulator/topic/settle', 'regulator/topic/tps65910' and 'regulator/topic/tps65917' into regulator-next 2017-07-03 16:52:21 +01:00
remoteproc remoteproc/keystone: Fix circular dependencies for ARM configs 2017-06-27 16:21:34 -07:00
reset ARM: SoC driver updates 2017-07-04 14:47:47 -07:00
rpmsg rpmsg updates for v4.13 2017-07-06 15:38:31 -07:00
rtc RTC for 4.13 2017-07-13 12:15:06 -07:00
s390 scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response 2017-08-10 19:36:57 -04:00
sbus sbus: Convert to using %pOF instead of full_name 2017-07-20 12:37:10 -07:00
scsi scsi: libcxgbi: use ndev->ifindex to find route 2017-08-10 19:36:52 -04:00
sfi
sh drivers/sh/intc/virq.c: delete an error message for a failed memory allocation in add_virq_to_pirq() 2017-07-06 16:24:30 -07:00
sn
soc soc: zte: Restrict SOC_ZTE to ARCH_ZX or COMPILE_TEST 2017-07-27 13:12:34 +02:00
spi Merge branch 'for-spi' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-07-08 10:41:53 -07:00
spmi spmi: pmic-arb: Always allocate ppid_to_apid table 2017-07-17 15:00:47 +02:00
ssb
staging media fixes for v4.13-rc2 2017-08-05 14:09:26 -07:00
target Add wait_for_random_bytes() and get_random_*_wait() functions so that 2017-07-15 12:44:02 -07:00
tc
tee
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2017-07-14 13:12:32 -07:00
thunderbolt Merge branches 'pm-core' and 'pm-misc' 2017-08-03 20:29:45 +02:00
tty gpio: exar: Use correct property prefix and document bindings 2017-08-01 13:43:55 +02:00
uio
usb xhci: fix memleak in xhci_run() 2017-07-20 14:40:36 +02:00
uwb driver core patches for 4.13-rc1 2017-07-03 20:27:48 -07:00
vfio vfio/pci: Fix handling of RC integrated endpoint PCIe capability size 2017-07-27 10:39:33 -06:00
vhost Revert "vhost: cache used event for better performance" 2017-07-29 14:15:56 -07:00
video Merge branch 'akpm' (patches from Andrew) 2017-07-13 12:38:49 -07:00
virt
virtio virtio-balloon: coding format cleanup 2017-07-25 16:37:35 +03:00
vlynq
vme
w1 w1: omap-hdq: fix error return code in omap_hdq_probe() 2017-07-17 16:48:15 +02:00
watchdog Merge git://www.linux-watchdog.org/linux-watchdog 2017-07-11 09:59:37 -07:00
xen xen: dont fiddle with event channel masking in suspend/resume 2017-07-27 19:55:46 +02:00
zorro
Kconfig
Makefile