Commit Graph

55 Commits

Author SHA1 Message Date
Gavin Shan
feadf7c0a1 powerpc/eeh: Lock module while handling EEH event
The EEH core is talking with the PCI device driver to determine the
action (purely reset, or PCI device removal). During the period, the
driver might be unloaded and in turn causes kernel crash as follows:

EEH: Detected PCI bus error on PHB#4-PE#10000
EEH: This PCI device has failed 3 times in the last hour
lpfc 0004:01:00.0: 0:2710 PCI channel disable preparing for reset
Unable to handle kernel paging request for data at address 0x00000490
Faulting instruction address: 0xd00000000e682c90
cpu 0x1: Vector: 300 (Data Access) at [c000000fc75ffa20]
    pc: d00000000e682c90: .lpfc_io_error_detected+0x30/0x240 [lpfc]
    lr: d00000000e682c8c: .lpfc_io_error_detected+0x2c/0x240 [lpfc]
    sp: c000000fc75ffca0
   msr: 8000000000009032
   dar: 490
 dsisr: 40000000
  current = 0xc000000fc79b88b0
  paca    = 0xc00000000edb0380	 softe: 0	 irq_happened: 0x00
    pid   = 3386, comm = eehd
enter ? for help
[c000000fc75ffca0] c000000fc75ffd30 (unreliable)
[c000000fc75ffd30] c00000000004fd3c .eeh_report_error+0x7c/0xf0
[c000000fc75ffdc0] c00000000004ee00 .eeh_pe_dev_traverse+0xa0/0x180
[c000000fc75ffe70] c00000000004ffd8 .eeh_handle_event+0x68/0x300
[c000000fc75fff00] c0000000000503a0 .eeh_event_handler+0x130/0x1a0
[c000000fc75fff90] c000000000020138 .kernel_thread+0x54/0x70
1:mon>

The patch increases the reference of the corresponding driver modules
while EEH core does the negotiation with PCI device driver so that the
corresponding driver modules can't be unloaded during the period and
we're safe to refer the callbacks.

Cc: stable@vger.kernel.org
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-18 15:32:48 +10:00
Gavin Shan
20ee6a9708 powerpc/eeh: Remove EEH PE for normal PCI hotplug
Function eeh_rmv_from_parent_pe() could be called by the path of
either normal PCI hotplug, or EEH recovery. For the former case,
we need purge the corresponding PE on removal of the associated
PE bus.

The patch tries to cover that by passing more information to function
pcibios_remove_pci_devices() so that we know if the corresponding PE
needs to be purged or be marked as "invalid".

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-18 15:32:23 +10:00
Gavin Shan
dbbceee12f powerpc/eeh: Move stats to PE
The patch removes the eeh related statistics for eeh device since
they have been maintained by the corresponding eeh PE. Also, the
flags used to trace the state of eeh device and PE have been reworked
for a little bit.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-10 09:35:43 +10:00
Gavin Shan
9b3c76f081 powerpc/eeh: Handle EEH error based on PE
The patch reworks the current implementation so that the eeh errors
will be handled basing on PE instead of eeh device.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-10 09:35:43 +10:00
Gavin Shan
40a7cd9219 powerpc/eeh: Replace pci_dn with eeh_dev for EEH aux components
The original EEH implementation is heavily depending on struct pci_dn.
We have to put EEH related information to pci_dn. Actually, we could
split struct pci_dn so that the EEH sensitive information to form an
individual struct, then EEH looks more independent.

The patch replaces pci_dn with eeh_dev for EEH aux components like
event and driver. Also, the eeh_event struct has been adjusted for
a little bit since eeh_dev has linked the associated FDT (Flat Device
Tree) node and PCI device. It's not necessary for eeh_event struct to
trace FDT node and PCI device. We can just simply to trace eeh_dev in
eeh_event.

The patch also renames function pcid_name() to eeh_pcid_name(), which
should be missed in the previous patch where the EEH aux components
have been cleaned up.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:39:46 +11:00
Gavin Shan
29f8bf1b7f powerpc/pseries: Cleanup comments in EEH aux components
There're several EEH aux components and the patch does some cleanup
for them so that they look more clean.

        * Duplicated comments have been removed from the header file.
        * Comments have been reorganized so that it looks more clean.
        * The leading comments of functions are adjusted for a little
          bit so that the result of "make pdfdocs" would be more
          unified.
        * Function calls "xxx ()" has been replaced by "xxx()".

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:11:20 +11:00
Gavin Shan
1823fbf119 powerpc/eeh: pseries platform EEH configure bridge
In order to enable particular PCI device, which has been included
in the parent PE. The involved PCI bridges should be enabled explicitly
if there has. On pSeries platform, there're dedicated RTAS calls
to fulfil the purpose.

The patch implements the function of configuring PCI bridges through
the dedicated RTAS calls. Besides, the function has been abstracted
by struct eeh_ops::configure_bridge so that the EEH core components
could support multiple platforms in future.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:11:11 +11:00
Gavin Shan
8d633291b4 powerpc/eeh: pseries platform EEH error log retrieval
On RTAS compliant pSeries platform, one dedicated RTAS call has
been introduced to retrieve EEH temporary or permanent error log.

The patch implements the function of retriving EEH error log through
RTAS call. Besides, it has been abstracted by struct eeh_ops::get_log
so that EEH core components could support multiple platforms in future.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:11:01 +11:00
Gavin Shan
b0e5f742f1 powerpc/eeh: pseries platform EEH wait PE state
On pSeries platform, the PE state might be temporarily unavailable.
In that case, the firmware will return the corresponding wait time.
That means the kernel has to wait for appropriate time in order to
get the PE state.

The patch does the implementation for that. Besides, the function
has been abstracted through struct eeh_ops::wait_state so that EEH core
components could support multiple platforms in future.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:10:39 +11:00
Gavin Shan
eb594a4754 powerpc/eeh: pseries platform PE state retrieval
On pSeries platform, there're 2 dedicated RTAS calls introduced to
retrieve the corresponding PE's state: ibm,read-slot-reset-state and
ibm,read-slot-reset-state2.

The patch implements the retrieval of PE's state according to the
given PE address. Besides, the implementation has been abstracted by
struct eeh_ops::get_state so that EEH core components could support
multiple platforms in future.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:10:26 +11:00
Gavin Shan
8fb8f70902 powerpc/eeh: pseries platform EEH operations
There're 4 EEH operations that are covered by the dedicated RTAS
call <ibm,set-eeh-option>: enable or disable EEH, enable MMIO and
enable DMA. At early stage of system boot, the EEH would be tried
to enable on PCI device related device node. MMIO and DMA for
particular PE should be enabled when doing recovery on EEH errors
so that the PE could function properly again.

The patch implements it and abstract that through struct
eeh_ops::set_eeh. It would be help for EEH to support multiple
platforms in future.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:09:49 +11:00
Gavin Shan
cce4b2d243 powerpc/eeh: Cleanup function names in the EEH core
The EEH has been implemented on pSeries platform. The original
code looks a little bit nasty. The patch does cleanup on the
current EEH implementation so that it looks more clean.

        * Try adding prefix "eeh" for functions.
        * Some function names have been adjusted so that they looks
          shorter and meaningful.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09 11:08:37 +11:00
Richard A Lary
82578e192b powerpc/eeh: Display eeh error location for bus and device
For adapters which have devices under a PCIe switch/bridge it is informative
  to display information for both the PCIe switch/bridge and the device on
  which the bus error was detected.

  rebased to powerpc-next

Signed-off-by: Richard A Lary <rlary@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-05-06 13:32:31 +10:00
Breno Leitao
8d3d50bf19 powerpc/eeh: Fix a bug when pci structure is null
During a EEH recover, the pci_dev structure can be null, mainly if an
eeh event is detected during cpi config operation. In this case, the
pci_dev will not be known (and will be null) the kernel will crash
with the following message:

Unable to handle kernel paging request for data at address 0x000000a0
Faulting instruction address: 0xc00000000006b8b4
Oops: Kernel access of bad area, sig: 11 [#1]

NIP [c00000000006b8b4] .eeh_event_handler+0x10c/0x1a0
LR [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
Call Trace:
[c0000003a80dff00] [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
[c0000003a80dff90] [c000000000031f1c] .kernel_thread+0x54/0x70

The bug occurs because pci_name() tries to access a null pointer.
This patch just guarantee that pci_name() is not called on Null pointers.

Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>
Signed-off-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17 14:02:47 +11:00
Frans Pop
8354be9c10 powerpc: Remove trailing space in messages
Signed-off-by: Frans Pop <elendil@planet.nl>
Cc: linuxppc-dev@ozlabs.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-09 13:56:23 +11:00
Michael Ellerman
59e3f83702 powerpc/pseries: Use irq_has_action() in eeh_disable_irq()
Rather than open-coding our own check, use irq_has_action()
to check if an irq has an action - ie. is "in use".

irq_has_action() doesn't take the descriptor lock, but it
shouldn't matter - we're just using it as an indicator
that the irq is in use. disable_irq_nosync() will take
the descriptor lock before doing anything also.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-10-30 17:20:54 +11:00
Zhang, Yanmin
70298c6e6c PCI AER: support Multiple Error Received and no error source id
Based on PCI Express AER specs, a root port might receive multiple
TLP errors while it could only save a correctable error source id
and an uncorrectable error source id at the same time. In addition,
some root port hardware might be unable to provide a correct source
id, i.e., the source id, or the bus id part of the source id provided
by root port might be equal to 0.

The patchset implements the support in kernel by searching the device
tree under the root port.

Patch 1 changes parameter cb of function pci_walk_bus to return a value.
When cb return non-zero, pci_walk_bus stops more searching on the
device tree.

Reviewed-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2009-06-16 14:30:13 -07:00
Mike Mason
c58dc575f3 powerpc/pseries: Set error_state to pci_channel_io_normal in eeh_report_reset()
While adding native EEH support to Emulex and Qlogic drivers, it was
discovered that dev->error_state was set to pci_io_channel_normal too
late in the recovery process. These drivers rely on error_state to
determine if they can access the device in their slot_reset callback,
thus error_state needs to be set to pci_io_channel_normal in
eeh_report_reset(). Below is a detailed explanation (courtesy of Richard
Lary) as to why this is necessary.

Background:
PCI MMIO or DMA accesses to a frozen slot generate additional EEH
errors. If the number of additional EEH errors exceeds EEH_MAX_FAILS the
adapter will be shutdown. To avoid triggering excessive EEH errors and
an undesirable adapter shutdown, some drivers use the
pci_channel_offline(dev) wrapper function to return a Boolean value
based on the value of pci_dev->error_state to determine if PCI MMIO or
DMA accesses are safe. If the wrapper returns TRUE, drivers must not
make PCI MMIO or DMA access to their hardware.

The pci_dev structure member error_state reflects one of three values,
1) pci_channel_io_normal, 2) pci_channel_io_frozen, 3)
pci_channel_io_perm_failure.  Function pci_channel_offline(dev) returns
TRUE if error_state is pci_channel_io_frozen or pci_channel_io_perm_failure.

The EEH driver sets pci_dev->error_state to pci_channel_io_frozen at the
point where the PCI slot is frozen. Currently, the EEH driver restores
dev->error_state to pci_channel_io_normal in eeh_report_resume() before
calling the driver's resume callback. However, when the EEH driver calls
the driver's slot_reset callback() from eeh_report_reset(), it
incorrectly indicates the error state is still pci_channel_io_frozen.

Waiting until eeh_report_resume() to restore dev->error_state to
pci_channel_io_normal is too late for Emulex and QLogic FC drivers and
any other drivers which are designed to use common code paths in these
two cases: i) those called after the driver's slot_reset callback() and
ii) those called after the PCI slot is frozen but before the driver's
slot_reset callback is called. Case i) all driver paths executed to
reinitialize the hardware after a reset and case ii) all code paths
executed by driver kernel threads that run asynchronous to the main
driver thread, such as interrupt handlers and worker threads to process
driver work queues.

Emulex and QLogic FC drivers are designed with common code paths which
require that pci_channel_offline(dev) reflect the true state of the
hardware. The state transitions that the hardware takes from Normal
Operations to Slot Frozen to Reset to Normal Operations are documented
in the Power Architecture™ Platform Requirements+ (PAPR+) in Table 75.
PE State Control.

PAPR defines the following 3 states:

0 -- Not reset, Not EEH stopped, MMIO load/store allowed, DMA allowed
     (Normal Operations)
1 -- Reset, Not EEH stopped, MMIO load/store disabled, DMA disabled
2 -- Not reset, EEH stopped, MMIO load/store disabled, DMA disabled
     (Slot Frozen)

An EEH error places the slot in state 2 (Frozen) and the adapter driver
is notified that an EEH error was detected. If the adapter driver
returns PCI_ERS_RESULT_NEED_RESET, the EEH driver calls
eeh_reset_device() to place the slot into state 1 (Reset) and
eeh_reset_device completes by placing the slot into State 0 (Normal
Operations). Upon return from eeh_reset_device(), the EEH driver calls
eeh_report_reset, which then calls the adapter's slot_reset callback. At
the time the adapter's slot_reset callback is called, the true state of
the hardware is Normal Operations and should be accurately reflected by
setting dev->error_state to pci_channel_io_normal.

The current implementation of EEH driver does not do so and requires
this change to correct this deficiency.

Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Acked-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2009-04-15 15:23:53 +10:00
Mike Mason
8535ef05a6 powerpc/eeh: Only disable/enable LSI interrupts in EEH
The EEH code disables and enables interrupts during the
device recovery process.  This is unnecessary for MSI
and MSI-X interrupts because they are effectively disabled
by the DMA Stopped state when an EEH error occurs.  The
current code is also incorrect for MSI-X interrupts.  It
doesn't take into account that MSI-X interrupts are tracked
in a different way than LSI/MSI interrupts.  This patch
ensures only LSI interrupts are disabled/enabled.

Signed-off-by: Mike Mason <mmlnx@us.ibm.com>
Acked-by: Linas Vepstas <linasvepstas@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-02-11 16:00:08 +11:00
Tony Breeds
dcfcfe7567 powerpc: Guard print_device_node_tree() with #if 0
Currently print_device_node_tree() isn't called but it can be useful for
debugging.  Leave the function there but hide it behind '#if 0' to save
it being rewritten.  If you want to call it you're already editing this
file anyway. ;P

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-08-20 16:34:57 +10:00
Andrew Morton
8e01520c06 [POWERPC] Fix warning in pseries/eeh_driver.c
Fix this:

/usr/src/devel/arch/powerpc/platforms/pseries/eeh_driver.c: In function 'print_device_node_tree':
/usr/src/devel/arch/powerpc/platforms/pseries/eeh_driver.c:55: warning: ISO C90 forbids mixed declarations and code

also make that function look like it's part of Linux.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-06-16 15:00:44 +10:00
Stephen Rothwell
b76e5e9398 [POWERPC] EEH: Avoid a possible NULL pointer dereference
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-12-11 13:46:12 +11:00
Linas Vepstas
5f1a7c811b [POWERPC] EEH: Report errors as soon as possible
Do not wait for the pci slot status before reporting an error
to the device driver. Some systems may take many seconds to
report the slot status, and this can confuse unsuspecting
device drivers.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-12-03 13:56:26 +11:00
Linas Vepstas
2a50f144fc [POWERPC] EEH: Drivers that need reset trump others
Bugfix: if a driver controlling one part of a multi-function PCI card
has asked for a reset, honor that request above all others.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-11-08 14:15:32 +11:00
Linas Vepstas
638799b335 [POWERPC] EEH: Clean up comments
Clean up commentary, remove dead code.

Signed-off-by Linas Vepstas <linas@austin.ibm.com>

Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-11-08 14:15:32 +11:00
Linas Vepstas
3c8c90ab88 [POWERPC] Tweak EEH copyright info
Twiddle the copyright notices. Per current guidelines, the use
of the (C) or (c) in source code is deprecated.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 arch/powerpc/platforms/pseries/eeh.c        |    6 +++++-
 arch/powerpc/platforms/pseries/eeh_cache.c  |    3 ++-
 arch/powerpc/platforms/pseries/eeh_driver.c |    6 +++---
 3 files changed, 10 insertions(+), 5 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-06-14 22:29:56 +10:00
Linas Vepstas
17213c3bf6 [POWERPC] Assorted janitorial EEH cleanups
Assorted minor cleanups to EEH code; -- use literals, use
kerneldoc format.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>

----
 arch/powerpc/platforms/pseries/eeh.c        |   13 ++++++++++---
 arch/powerpc/platforms/pseries/eeh_driver.c |    7 ++++---
 include/asm-powerpc/ppc-pci.h               |   18 +++++++++++++++---
 3 files changed, 29 insertions(+), 9 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-10 21:28:13 +10:00
Linas Vepstas
b455b24cf2 [POWERPC] EEH: Split up long error msg
Make some minor adjustments to the EEH error messages.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-09 16:35:01 +10:00
Linas Vepstas
ede8ca269f [POWERPC] EEH: log error only after driver notification.
It turns out many/most versions of firmware enable MMIO when
the slto-error-detail rtas call is made (in violation of the
architecture). Thus, it would be best to call slot-error-detail
only after notifying device drivers of a freeze, as otherwise,
a variety of strange and unexpected things may happen.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-09 16:35:00 +10:00
Stephen Rothwell
e2eb63927b [POWERPC] Rename get_property to of_get_property: arch/powerpc
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-04-13 03:55:19 +10:00
Linas Vepstas
4980d5eb75 [POWERPC] EEH: restructure multi-function support
Rework how multi-function PCI devices are identified and traversed.
This fixes a bug with multi-function recovery on Power4 that was
introduced by a recent Power4 EEH patch.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:57 +11:00
Linas Vepstas
fa1be476a2 [POWERPC] EEH: verify state change
After requesting a state change, verify that the state change
actually ocurred, and the system ends up in the expected state.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:56 +11:00
Linas Vepstas
d0ab95ca98 [POWERPC] EEH: rm un-needed data
The EEH event notification system passes around data that is
not needed or at least, not used properly. Stop passing this
data; get it in a more reliable fashion.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:55 +11:00
Linas Vepstas
5794dbcbab [POWERPC] EEH: multifunction recovery bugfix
If the second or higher function of a multi-function device fails
to recover, this failure is not reported upwards. Fix this.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:53 +11:00
Linas Vepstas
90fdd6130f [POWERPC] EEH: hotplug recovery bugfix
If a device driver does not have native PCI error recovery,
a hotplug error recovery will be attemped. In this case,
the device driver will not report back whether its healthy
or not; simply assume that it is.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:52 +11:00
Linas Vepstas
e0f90b6418 [POWERPC] EEH: Add clarifying messages.
There are multiple code patchs tht resuls in a "permanent
failure"; when examining rare events, it can be hard to see
which was taken. This patch adds printk's to assist.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-03-22 22:52:50 +11:00
Linas Vepstas
a885902de3 [POWERPC] Clarify EEH error message
Clarify error message re EEH permanent failure.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-01-24 21:13:56 +11:00
Linas Vepstas
d0e70341c0 [POWERPC] EEH recovery tweaks
If one attempts to create a device driver recovery sequence that
does not depend on a hard reset of the device, but simply just
attempts to resume processing, then one discovers that the
recovery sequence implemented on powerpc is not quite right.
This patch fixes this up.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-12-08 17:10:18 +11:00
Linas Vepstas
6a1ca373a1 [POWERPC] EEH: support MMIO enable recovery step
Update to the PowerPC PCI error recovery code.

Add code to enable MMIO if a device driver reports that it is capable
of recovering on its own.  One anticipated use of this having a device
driver enable MMIO so that it can take a register dump, which might
then be followed by the device driver requesting a full reset.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-09-21 22:59:20 +10:00
Linas Vepstas
cb5b562444 [POWERPC] EEH: code comment cleanup
Clean up subroutine documentation; mostly formatting changes, with
some new content.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-09-21 22:59:10 +10:00
Jeremy Kerr
954a46e2d5 [POWERPC] pseries: Constify & voidify get_property()
Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.

pseries platform changes.

Built for pseries_defconfig

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-31 15:55:04 +10:00
Adrian Bunk
80f7228b59 typo fixes: occuring -> occurring
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:27:16 +02:00
Linas Vepstas
0aa8d15b01 [POWERPC] pseries: Print PCI slot location code on failure
The PCI error recovery code will printk diagnostic info when
a PCI error event occurs. Change the messages to include the slot
location code, which is how most sysadmins will know the device.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-21 15:01:32 +10:00
Linas Vepstas
4240545661 [PATCH] powerpc/pseries: Increment fail counter in PCI recovery
When a PCI device driver does not support PCI error recovery,
the powerpc/pseries code takes a walk through a branch of code
that resets the failure counter. Because of this, if a broken
PCI card is present, the kernel will attempt to reset it an
infinite number of times. (This is annoying but mostly harmless:
each reset takes about 10-20 seconds, and uses almost no CPU time).

This patch preserves the failure count across resets.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-05-19 13:51:12 +10:00
Linas Vepstas
ac325acd50 [PATCH] powerpc/pseries: clear PCI failure counter if no new failures
The current PCI error recovery system keeps track of the number of PCI card
resets, and refuses to bring a card back up if this number is too large.
The goal of doing this was to avoid an infinite loop of resets if a card is
obviously dead.  However, if the failures are rare, but the machine has a
high uptime, this mechanism might still be triggered; this is too harsh.

This patch will avoids this problem by decrementing the fail count after an
hour.  Thus, as long as a pci card BSOD's less than 6 times an hour, it
will continue to be reset indefinitely.  If it's failure rate is greater
than that, it will be taken off-line permanently.

This patch is larger than it might otherwise be because it changes
indentation by removing a pointless while-loop.  The while loop is not
needed, as the handler is invoked once fo each event (by schedule_work());
the loop is leftover cruft from an earlier implementation.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-04-22 18:46:13 +10:00
Linas Vepstas
a219be2cf4 [PATCH] powerpc/pseries: fix device name printing, again.
The recent patch to print device names in EEH reset messages
was lacking ... this patch works better.

Signed-off-by: Linas Vepstas <linas@linas.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-04-01 22:37:02 +11:00
Linas Vepstas
8df83028cf [PATCH] powerpc/pseries: print message if EEH recovery fails
The current code prints an ambiguous message if the recovery
of a failed PCI device fails. Give this special case its own
unique message.

Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-04-01 22:35:01 +11:00
Linas Vepstas
b4f382a3e5 [PATCH] powerpc/pseries: Cleanup device name printing.
This avoids printk'ing a NULL string.

Signed-off-by: Linas Vepstas <linas@linas.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-03-27 14:48:46 +11:00
Olaf Hering
273d280381 [PATCH] powerpc: fix NULL pointer in handle_eeh_events
This patch fixes a crash in handle_eeh_events,
but ethtool -t still doesnt work right.

...
pepino:~ # cpu 0x3: Vector: 300 (Data Access) at [c00000005192bbe0]
    pc: c00000000004a380: .handle_eeh_events+0xe0/0x23c
    lr: c00000000004a374: .handle_eeh_events+0xd4/0x23c
    sp: c00000005192be60
   msr: 9000000000009032
   dar: 268
 dsisr: 40000000
  current = 0xc0000001fe7bf1a0
  paca    = 0xc00000000048b280
    pid   = 16322, comm = eehd
enter ? for help
[c00000005192bf00] c00000000004a808 .eeh_event_handler+0xcc/0x130
[c00000005192bf90] c000000000025e00 .kernel_thread+0x4c/0x68

...

(none):/# /usr/sbin/ethtool -i eth0
driver: e100
version: 3.5.10-k2-NAPI
firmware-version: N/A
bus-info: 0000:21:01.0
(none):/# /usr/sbin/ethtool -t eth0
Call Trace:
[C00000000F8DEFF0] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000F8DF0A0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000F8DF150] [C000000000049E58] .eeh_check_failure+0x10c/0x138
[C00000000F8DF1E0] [C0000000002DFDB0] .e100_hw_reset+0x70/0xf4
[C00000000F8DF270] [C0000000002E1BBC] .e100_hw_init+0x2c/0x260
[C00000000F8DF310] [C0000000002E2464] .e100_loopback_test+0x8c/0x220
[C00000000F8DF3C0] [C0000000002E28DC] .e100_diag_test+0xdc/0x16c
[C00000000F8DF490] [C000000000420BE0] .dev_ethtool+0xf24/0x14f8
[C00000000F8DF8F0] [C00000000041F4A8] .dev_ioctl+0x5cc/0x740
[C00000000F8DFA20] [C00000000040FEFC] .sock_ioctl+0x3d0/0x404
[C00000000F8DFAC0] [C0000000000D513C] .do_ioctl+0x68/0x108
[C00000000F8DFB50] [C0000000000D56B0] .vfs_ioctl+0x4d4/0x510
[C00000000F8DFC10] [C0000000000D5740] .sys_ioctl+0x54/0x94
[C00000000F8DFCC0] [C0000000000FB6EC] .ethtool_ioctl+0x11c/0x150
[C00000000F8DFD60] [C0000000000F7E40] .compat_sys_ioctl+0x338/0x3bc
[C00000000F8DFE30] [C00000000000871C] syscall_exit+0x0/0x40
EEH: Detected PCI bus error on device 0000:21:01.0
EEH: This PCI device has failed 1 times since last reboot: <NULL> -

modprobe: FATAL: Could not load /lib/modules/2.6.16-rc4-git7/modules.dep: No such file or directory

Cannot get strings: No such device
(none):/#
(none):/# EEH: Unable to configure device bridge (-3) for /pci@400000000110/pci@2,2

(none):/# Call Trace:
[C00000000FA17940] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000FA179F0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000FA17AA0] [C00000000001E114] .rtas_read_config+0x120/0x154
[C00000000FA17B40] [C000000000049664] .early_enable_eeh+0x274/0x2bc
[C00000000FA17C00] [C000000000049708] .eeh_add_device_early+0x5c/0x6c
[C00000000FA17C90] [C000000000049748] .eeh_add_device_tree_early+0x30/0x5c
[C00000000FA17D20] [C000000000046568] .pcibios_add_pci_devices+0x8c/0x1f8
[C00000000FA17DD0] [C00000000004A528] .eeh_reset_device+0xe0/0x110
[C00000000FA17E60] [C00000000004A698] .handle_eeh_events+0x140/0x250
[C00000000FA17F00] [C00000000004AC7C] .eeh_event_handler+0xe8/0x140
[C00000000FA17F90] [C000000000025784] .kernel_thread+0x4c/0x68
EEH: Detected PCI bus error on device <NULL>
EEH: This PCI device has failed 1 times since last reboot: <NULL> -
EEH: Unable to configure device bridge (-3) for /pci@400000000110/pci@2,2
Call Trace:
[C00000000FA17940] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000FA179F0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000FA17AA0] [C00000000001E114] .rtas_read_config+0x120/0x154
[C00000000FA17B40] [C000000000049664] .early_enable_eeh+0x274/0x2bc
[C00000000FA17C00] [C000000000049708] .eeh_add_device_early+0x5c/0x6c
[C00000000FA17C90] [C000000000049748] .eeh_add_device_tree_early+0x30/0x5c
[C00000000FA17D20] [C000000000046568] .pcibios_add_pci_devices+0x8c/0x1f8
[C00000000FA17DD0] [C00000000004A528] .eeh_reset_device+0xe0/0x110
[C00000000FA17E60] [C00000000004A698] .handle_eeh_events+0x140/0x250
[C00000000FA17F00] [C00000000004AC7C] .eeh_event_handler+0xe8/0x140
[C00000000FA17F90] [C000000000025784] .kernel_thread+0x4c/0x68
EEH: Detected PCI bus error on device <NULL>
EEH: This PCI device has failed 1 times since last reboot: <NULL> -
EEH: Unable to configure device bridge (-3) for /pci@400000000110/pci@2,2
Call Trace:
[C00000000FA17940] [C00000000000F270] .show_stack+0x74/0x1b4 (unreliable)
[C00000000FA179F0] [C000000000049D04] .eeh_dn_check_failure+0x290/0x2d8
[C00000000FA17AA0] [C00000000001E114] .rtas_read_config+0x120/0x154
[C00000000FA17B40] [C000000000049664] .early_enable_eeh+0x274/0x2bc
[C00000000FA17C00] [C000000000049708] .eeh_add_device_early+0x5c/0x6c
[C00000000FA17C90] [C000000000049748] .eeh_add_device_tree_early+0x30/0x5c
[C00000000FA17D20] [C000000000046568] .pcibios_add_pci_devices+0x8c/0x1f8
[C00000000FA17DD0] [C00000000004A528] .eeh_reset_device+0xe0/0x110
[C00000000FA17E60] [C00000000004A698] .handle_eeh_events+0x140/0x250
[C00000000FA17F00] [C00000000004AC7C] .eeh_event_handler+0xe8/0x140
[C00000000FA17F90] [C000000000025784] .kernel_thread+0x4c/0x68
EEH: Detected PCI bus error on device <NULL>
and so on

Signed-off-by: Olaf Hering <olh@suse.de>
Acked-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-02-28 16:25:54 +11:00
Al Viro
d04e4e115b [PATCH] eeh_driver NULL noise removal
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-02-07 20:58:33 -05:00