Commit Graph

183 Commits

Author SHA1 Message Date
Mauro Carvalho Chehab
2068def56c i7core_edac: fix error codes for sysfs error injection interface
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:55 -03:00
Mauro Carvalho Chehab
276b824c30 i7core_edac: some fixes at error injection code
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:54 -03:00
Mauro Carvalho Chehab
17cb7b0cf7 i7core_edac: Some cleanups at displayed info
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:54 -03:00
Mauro Carvalho Chehab
086271a037 i7core: remove some uneeded noisy debug messages
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:54 -03:00
Mauro Carvalho Chehab
3a7dde7fcd i7core: add socket info at the debug msg
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:53 -03:00
Mauro Carvalho Chehab
ec6df24c15 i7core: better document i7core_get_active_channels()
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:53 -03:00
Mauro Carvalho Chehab
c77720b954 i7core: fix get_devices routine for Xeon55xx
i7core_get_devices() were preparet to get just the first found device of each type.
Due to that, on Xeon 55xx, only socket 1 were retrived.

Rework i7core_get_devices() to clean it and to properly support Xeon 55xx.

While here, fix a small typo.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:53 -03:00
Mauro Carvalho Chehab
a639539fa2 i7core: enrich error information based on memory transaction type
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:53 -03:00
Mauro Carvalho Chehab
c5d3452869 i7core: check if the memory error is fatal or non-fatal
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:53 -03:00
Mauro Carvalho Chehab
310cbb7284 i7core: fix probing on Xeon55xx
Xeon55xx fails to probe with this error message:

EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 1660: MC: drivers/edac/i7core_edac.c: i7core_init()
EDAC i7core: Device not found: dev 00:00.0 PCI ID 8086:2c41
i7core_edac: probe of 0000:00:14.0 failed with error -22

This is due to the fact that, on Xeon35xx (and i7core), device 00.0 has
PCI ID 8086:2c40.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:52 -03:00
Mauro Carvalho Chehab
f237fcf2b7 i7core_edac: some fixes at memory error parser
m->bank is not related to the memory bank but, instead, to the MCA Error
register bank. Fix it accordingly. While here, improves the comments for
Nehalem bank.

A later fix is needed, in order to get bank/rank information from MCA
error log.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:52 -03:00
Mauro Carvalho Chehab
8a2f118e3a i7core_edac: decode mcelog error and send it via edac interface
Enriches mcelog error by using the encoded information at MCE status and
misc registers (IA32_MCx_STATUS, IA32_MCx_MISC).

Some fixes are still needed here, in order to properly fill the EDAC
fields.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:52 -03:00
Mauro Carvalho Chehab
ba6c5c62ee i7core_edac: maps all sockets as if ther are one MC controller
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:52 -03:00
Mauro Carvalho Chehab
67166af4ab i7core_edac: add support for more than one MC socket
Some Nehalem architectures have more than one MC socket. Socket 0 is
located at bus 255.

Currently, it is using up to 2 sockets, but increasing it to a larger
number is just a matter of increasing MAX_SOCKETS definition.

This seems to be required for properly support of Xeon 55xx.

Still needs testing with Xeon 55xx.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:51 -03:00
Mauro Carvalho Chehab
d1fd4fb69e i7core_edac: Add a code to probe Xeon 55xx bus
This code changes the detection procedure of i7core_edac. Instead of
directly probing for MC registers, it probes for another register found
on Nehalem. If found, it tries to pick the first MC PCI BUS. This should
work fine with Xeon 35xx, but, on Xeon 55xx, this is at bus 254 and 255
that are not properly detected by the non-legacy PCI methods.

The new detection code scans specifically at buses 254 and 255 for the
Xeon 55xx devices.

This code has not tested yet. After working, a change at the code will
be needed, since the i7core is not yet ready for working with 2 sets of
MC.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:51 -03:00
Mauro Carvalho Chehab
e9bd2e7379 i7core_edac: Adds write unlock to MC registers
The public Intel Xeon 5500 volume 2 datasheet describes, on page 53,
session 2.6.7 a register that can lock/unlock Memory Controller the
configuration register, called MC_CFG_CONTROL.

Adds support for it in the hope that software error injection would
work. With my tests with Xeon 35xx, there's still something missing.
With a program that does sequencial bit writes at dev 0.0, sometimes, it
produces error injection, after unblocking the MC_CFG_CONTROL (and,
sometimes, it just locks my testing machine).

I'll try later to discover by trial and error what's the register that
solves this issue on Xeon 35xx.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:50 -03:00
Mauro Carvalho Chehab
d5381642ab i7core_edac: Add edac_mce glue
Adds a glue code to allow i7core to work with mcelog. With the glue,
i7core registers itself on edac_mce. At mce, when an error is detected,
it calls all registered drivers (in this case, i7core), for EDAC error
handling.

TODO: It currently just prints the MCE error log using about the same
      format as mce panic messages. The error message should be enhanced
      with mcelog userspace info and converted into the proper EDAC format,
      to feed the EDAC error counts.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:50 -03:00
Mauro Carvalho Chehab
41fcb7feed i7core_edac: CodingStyle fixes
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:48 -03:00
Mauro Carvalho Chehab
eb94fc402f i7core_edac: fill csrows edac sysfs info
csrows is still fake, since we can't identify its representation with
Nehalem registers.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:48 -03:00
Mauro Carvalho Chehab
5566cb7c91 i7core_edac: Memory info fixes and preparation for properly filling cswrow data
Now, memory size is properly displayed:

    EDAC i7core: DOD Max limits: DIMMS: 2, 1-ranked, 8-banked
    EDAC i7core: DOD Max rows x colums = 0x4000 x 0x400
    EDAC i7core: Memory channel configuration:
    EDAC i7core: Ch0 phy rd0, wr0 (0x063f7c31): 2 ranks, UDIMMs
    EDAC i7core:    dimm 0 (0x00000288) 1024 Mb offset: 0, numbank: 8,
                    numrank: 1, numrow: 0x4000, numcol: 0x400
    EDAC i7core:    dimm 1 (0x00001288) 1024 Mb offset: 4, numbank: 8,
                    numrank: 1, numrow: 0x4000, numcol: 0x400
    EDAC i7core: Ch1 phy rd1, wr1 (0x063f7c31): 2 ranks, UDIMMs
    EDAC i7core:    dimm 0 (0x00000288) 1024 Mb offset: 0, numbank: 8,
                    numrank: 1, numrow: 0x4000, numcol: 0x400
    EDAC i7core: Ch2 phy rd3, wr3 (0x063f7c31): 2 ranks, UDIMMs
    EDAC i7core:    dimm 0 (0x00000288) 1024 Mb offset: 0, numbank: 8,
                    numrank: 1, numrow: 0x4000, numcol: 0x400

Still, as the way to retrieve csrows info is not known, it does a
mapping of what's available to csrows basic unit at edac core.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:48 -03:00
Mauro Carvalho Chehab
854d334997 i7core_edac: Get more info about the memory DIMMs
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:48 -03:00
Mauro Carvalho Chehab
7dd6953c5f i7core_edac: Add more information about each active dimm
Thanks-to: Aristeu Rozanski <aris@redhat.com> for part of the code

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:47 -03:00
Mauro Carvalho Chehab
b7c761512c i7core_edac: Improve error handling
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:47 -03:00
Mauro Carvalho Chehab
1c6fed808f i7core_edac: Properly fill struct csrow_info
Thanks-to: Aristeu Rozanski <aris@redhat.com> for part of the code

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:47 -03:00
Mauro Carvalho Chehab
ef708b53b9 i7core_edac: Add additional tests for error detection
Properly check the number of channels and improve probing error detection

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:47 -03:00
Mauro Carvalho Chehab
442305b152 i7core_edac: Add a memory check routine, based on device 3 function 4
This function appears only on Xeon 5500 datasheet. Yet, testing with a
Xeon 3503 showed that this is also implemented on other Nehalem
processors.

At the first read, MC_TEST_ERR_RCV1 and MC_TEST_ERR_RCV0 can contain any
value. Modify CE error logic to update the error count only after the
second read.

An alternative approach would be to do a write at rcv0 and rcv1
registers, but it seemed better to keep they untouched, since BIOS might
eventually assume that they are exclusive for their usage.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:46 -03:00
Mauro Carvalho Chehab
87d1d272ba i7core_edac: need mci->edac_check, otherwise module removal doesn't work
There are some locking troubles with edac_core: if you don't declare an
edac_check, module may suffer from soft lock.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:46 -03:00
Mauro Carvalho Chehab
7b029d03c3 i7core_edac: A few fixes at error injection code
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:46 -03:00
Mauro Carvalho Chehab
f122a89222 i7core_edac: Show read/write virtual/physical channel association
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:46 -03:00
Mauro Carvalho Chehab
8f33190757 i7core_edac: Registers all supported MC functions
Now, it will try to register on all supported Memory Controller
functions.

It should be noticed that dev3, function 2 is present only on chips with
Registered DIMM's, according to the datasheet. So, the driver doesn't
return -ENODEV is all functions but this one were successfully
registered and enabled:

    EDAC i7core: Registered device 8086:2c18 fn=3 0
    EDAC i7core: Registered device 8086:2c19 fn=3 1
    EDAC i7core: Device not found: PCI ID 8086:2c1a (dev 3, func 2)
    EDAC i7core: Registered device 8086:2c1c fn=3 4
    EDAC i7core: Registered device 8086:2c20 fn=4 0
    EDAC i7core: Registered device 8086:2c21 fn=4 1
    EDAC i7core: Registered device 8086:2c22 fn=4 2
    EDAC i7core: Registered device 8086:2c23 fn=4 3
    EDAC i7core: Registered device 8086:2c28 fn=5 0
    EDAC i7core: Registered device 8086:2c29 fn=5 1
    EDAC i7core: Registered device 8086:2c2a fn=5 2
    EDAC i7core: Registered device 8086:2c2b fn=5 3
    EDAC i7core: Registered device 8086:2c30 fn=6 0
    EDAC i7core: Registered device 8086:2c31 fn=6 1
    EDAC i7core: Registered device 8086:2c32 fn=6 2
    EDAC i7core: Registered device 8086:2c33 fn=6 3
    EDAC i7core: Driver loaded.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:45 -03:00
Mauro Carvalho Chehab
0b2b7b7ec0 i7core_edac: Add more status functions to EDAC driver
This patch were co-authored with Aristeu Rozanski.

Signed-off-by: Aristeu Sergio <arozansk@redhat.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:45 -03:00
Mauro Carvalho Chehab
194a40feab i7core_edac: Add error insertion code for Nehalem
Implements set_inject_error() with the low-level code needed to inject
memory errors at Nehalem, and adds some sysfs nodes to allow error injection

The next patch will add an API for error injection.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:45 -03:00
Mauro Carvalho Chehab
a0c36a1f0f i7core_edac: Add an EDAC memory controller driver for Nehalem chipsets
This driver is meant to support i7 core/i7core extreme desktop
processors and Xeon 35xx/55xx series with integrated memory controller.
It is likely that it can be expanded in the future to work with other
processor series based at the same Memory Controller design.

For now, it has just a few MCH status reads.

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2010-05-10 11:44:45 -03:00