Commit Graph

2022 Commits

Author SHA1 Message Date
Linus Torvalds
77286b868f - Add support for Bluefield-2 SOCs to bluefield_edac
- Add support for Intel Panther Lake-H to igen6_edac
 
 - Add polling support to igen6_edac as some Intel M100 chips have trouble with
   error interrupts
 
 - Add Kaby Lake-S support to ie31200_edac
 
 - Fix memory source detection in the SKX common module which is used by
   a couple of Intel EDAC drivers
 
 - Add support for the NXP i.MX9 memory controller to fsl_edac
 
 - The usual fixes and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmc7KAcACgkQEsHwGGHe
 VUpzxQ/6Ahr49jXu58M69UQSW3DdzEU+5NNxmUrZdRrdW/oCJXGpuRdmdFWzvWTj
 HtfCS7GmaSIUPjLaNisyKdCaZxWysBqyLe0Vaexw5nuyybF5TzdYWETqFef1ij9z
 Wqq1j5LPrz+9BiqFqkpbgzo6Y6Ubsv2RKuZu+1GkMT2zRrgEJuJgHi6RlJ8vqj//
 7FePl3CFQ3HDdTom0/L/gsMqSObj7HEq9cbalIjIYw/GRVkZol21vDwKrUkM7rpF
 tfrN1qq3NuJyqM7Du2jw2VtXDomrQ/ZkABNXCbtbczf8trLYUHR5QqIQjxy2ZFts
 jMKIbdCNAfgiqai6bpmm4QHWAIAV3L5DX7OuPmbpQeAzSmOqSEqNbnLbvA1e472f
 5upQH4OLOsHgbnnFTQJ7vcU5jHf41DSauMCFp60h2hyn5RIiVY5ASxRfQ3xdh/+a
 hp2N+hB/y46AjXAidsGhAuUw8nt44MN2x1gtiUfbtMIx6gTewtuu0SbwOb85JW16
 glhD8vxRGTUWoQit+Nh3u/P/rLSGkUJK87mfPr6O/95lleYy5hOizK2jGDbDWkA+
 zOnNXnSWKK/WM+B9qnJnU1sCC7vT3j7cTaDXB1XS2MtcJbArkNC0FOd6xD81PoGh
 MhfWBAKpirXQEomFqpVziDa2wlaUnZrv7/4GGmaBRO401O9iaE4=
 =C3dY
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add support for Bluefield-2 SOCs to bluefield_edac

 - Add support for Intel Panther Lake-H to igen6_edac

 - Add polling support to igen6_edac as some Intel M100 chips have
   trouble with error interrupts

 - Add Kaby Lake-S support to ie31200_edac

 - Fix memory source detection in the SKX common module which is used by
   a couple of Intel EDAC drivers

 - Add support for the NXP i.MX9 memory controller to fsl_edac

 - The usual fixes and cleanups all over the place

* tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/igen6: Add polling support
  EDAC/igen6: Initialize edac_op_state according to the configuration data
  EDAC/igen6: Avoid segmentation fault on module unload
  EDAC/ie31200: Add Kaby Lake-S dual-core host bridge ID
  MAINTAINERS: Change FSL DDR EDAC maintainership
  EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
  EDAC/skx_common: Differentiate memory error sources
  EDAC/fsl_ddr: Add support for i.MX9 DDR controller
  dt-bindings: memory: fsl: Add compatible string nxp,imx9-memory-controller
  EDAC/fsl_ddr: Fix bad bit shift operations
  EDAC/fsl_ddr: Move global variables into struct fsl_mc_pdata
  EDAC/fsl_ddr: Pass down fsl_mc_pdata in ddr_in32() and ddr_out32()
  RAS/AMD/ATL: Add debug prints for DF register reads
  EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2
  EDAC/bluefield: Fix potential integer overflow
  EDAC/igen6: Add Intel Panther Lake-H SoCs support
2024-11-19 12:00:10 -08:00
Borislav Petkov (AMD)
1b38da0115 Merge branch 'edac-misc' into edac-updates
* edac-misc:
  MAINTAINERS: Change FSL DDR EDAC maintainership
  RAS/AMD/ATL: Add debug prints for DF register reads
  EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2
  EDAC/bluefield: Fix potential integer overflow
  EDAC/igen6: Add Intel Panther Lake-H SoCs support

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-11-18 11:33:23 +01:00
Orange Kao
e14232afa9 EDAC/igen6: Add polling support
Some PCs with Intel N100 (with PCI device 8086:461c, DID_ADL_N_SKU4)
experienced issues with error interrupts not working, even with the
following configuration in the BIOS.

    In-Band ECC Support: Enabled
    In-Band ECC Operation Mode: 2 (make all requests protected and
                                   ignore range checks)
    IBECC Error Injection Control: Inject Correctable Error on insertion
                                   counter
    Error Injection Insertion Count: 251658240 (0xf000000)

Add polling mode support for these machines to ensure that memory error
events are handled.

Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-3-orange@aiven.io
2024-11-08 13:36:55 -08:00
Qiuxu Zhuo
1d512b1aa5 EDAC/igen6: Initialize edac_op_state according to the configuration data
Currently, igen6_edac sets edac_op_state to EDAC_OPSTATE_NMI, while the
driver also supports memory errors reported from Machine Check. Initialize
edac_op_state to the correct value according to the configuration data
that the driver probed.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-2-orange@aiven.io
2024-11-08 13:35:21 -08:00
Orange Kao
fefaae9039 EDAC/igen6: Avoid segmentation fault on module unload
The segmentation fault happens because:

During modprobe:
1. In igen6_probe(), igen6_pvt will be allocated with kzalloc()
2. In igen6_register_mci(), mci->pvt_info will point to
   &igen6_pvt->imc[mc]

During rmmod:
1. In mci_release() in edac_mc.c, it will kfree(mci->pvt_info)
2. In igen6_remove(), it will kfree(igen6_pvt);

Fix this issue by setting mci->pvt_info to NULL to avoid the double
kfree.

Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219360
Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241104124237.124109-2-orange@aiven.io
2024-11-04 12:09:45 -08:00
James Ye
f12c946ee7 EDAC/ie31200: Add Kaby Lake-S dual-core host bridge ID
Add device ID for dual-core Kaby Lake-S processors e.g. i3-7100.

Signed-off-by: James Ye <jye836@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Jason Baron <jbaron@akamai.com>
Link: https://lore.kernel.org/r/20240824120622.46226-1-jye836@gmail.com
2024-11-04 17:40:22 +01:00
Qiuxu Zhuo
a36667037a EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
The Granite Rapids CPUs with Flat2LM memory configurations may
mistakenly report near-memory errors as far-memory errors, resulting
in the invalid decoded ADXL results:

  EDAC skx: Bad imc -1

Fix this incorrect far-memory error source indicator by prefetching the
decoded far-memory controller ID, and adjust the error source indicator
to near-memory if the far-memory controller ID is invalid.

Fixes: ba987eaaab ("EDAC/i10nm: Add Intel Granite Rapids server support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com>
Link: https://lore.kernel.org/r/20241015072236.24543-3-qiuxu.zhuo@intel.com
2024-10-23 11:59:21 -07:00
Qiuxu Zhuo
2397f79573 EDAC/skx_common: Differentiate memory error sources
The current skx_common determines whether the memory error source is the
near memory of the 2LM system and then retrieves the decoded error results
from the ADXL components (near-memory vs. far-memory) accordingly.

However, some memory controllers may have limitations in correctly
reporting the memory error source, leading to the retrieval of incorrect
decoded parts from the ADXL.

To address these limitations, instead of simply determining whether the
memory error is from the near memory of the 2LM system, it is necessary to
distinguish the memory error source details as follows:

  Memory error from the near memory of the 2LM system.
  Memory error from the far memory of the 2LM system.
  Memory error from the 1LM system.
  Not a memory error.

This will enable the i10nm_edac driver to take appropriate actions for
those memory controllers that have limitations in reporting the memory
error source.

Fixes: ba987eaaab ("EDAC/i10nm: Add Intel Granite Rapids server support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Diego Garcia Rodriguez <diego.garcia.rodriguez@intel.com>
Link: https://lore.kernel.org/r/20241015072236.24543-2-qiuxu.zhuo@intel.com
2024-10-23 11:58:43 -07:00
Ye Li
ddb8a8a022 EDAC/fsl_ddr: Add support for i.MX9 DDR controller
Add support for the i.MX9 DDR controller, which has different register
offsets and some function changes compared to the existing fsl_ddr
controller. The ECC and error injection functions are almost the same,
so update and reuse the driver for i.MX9. Add a special type 'TYPE_IMX9'
specifically for the i.MX9 controller to distinguish the differences.

Signed-off-by: Ye Li <ye.li@nxp.com>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Link: https://lore.kernel.org/r/20241016-imx95_edac-v3-5-86ae6fc2756a@nxp.com
2024-10-23 16:53:55 +02:00
Priyanka Singh
9ec22ac4fe EDAC/fsl_ddr: Fix bad bit shift operations
Fix undefined behavior caused by left-shifting a negative value in the
expression:

    cap_high ^ (1 << (bad_data_bit - 32))

The variable bad_data_bit ranges from 0 to 63. When it is less than 32,
bad_data_bit - 32 becomes negative, and left-shifting by a negative
value in C is undefined behavior.

Fix this by combining cap_high and cap_low into a 64-bit variable.

  [ bp: Massage commit message, simplify error bits handling. ]

Fixes: ea2eb9a8b6 ("EDAC, fsl-ddr: Separate FSL DDR driver from MPC85xx")
Signed-off-by: Priyanka Singh <priyanka.singh@nxp.com>
Signed-off-by: Li Yang <leoyang.li@nxp.com>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241016-imx95_edac-v3-3-86ae6fc2756a@nxp.com
2024-10-23 16:52:58 +02:00
Frank Li
5d9aeaa607 EDAC/fsl_ddr: Move global variables into struct fsl_mc_pdata
Move global variables into the struct fsl_mc_pdata to handle systems
with multiple DDR controllers.

No functional change.

Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241016-imx95_edac-v3-2-86ae6fc2756a@nxp.com
2024-10-23 13:25:48 +02:00
Frank Li
6c9748fbdf EDAC/fsl_ddr: Pass down fsl_mc_pdata in ddr_in32() and ddr_out32()
Pass down fsl_mc_pdata in helper functions ddr_in32() and ddr_out32() to
prepare for adding iMX9 support. The iMX9 has a slightly different
register layout.

No functional change.

Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241016-imx95_edac-v3-1-86ae6fc2756a@nxp.com
2024-10-23 12:59:32 +02:00
David Thompson
e419675754 EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2
The BlueField EDAC driver supports the first generation BlueField-1 SoC, but
not the second generation BlueField-2 SoC. The BlueField-2 SoC is different in
that only secure accesses are allowed to the External Memory Interface (EMI)
register block. On BlueField-2, all read/write accesses from Linux to EMI
registers are routed via the Arm Secure Monitor Call (SMC) through Arm Trusted
Firmware (ATF), which runs at EL3 privileged state.

On BlueField-1, EMI registers are mapped and accessed directly. In order to
support BlueField-2, the driver's read and write access methods must be
extended with additional logic to include secure access to the EMI registers
via SMCs.

  [ bp: Move struct member comments above them, simplify. ]

Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shravan Kumar Ramani <shravankr@nvidia.com>
Link: https://lore.kernel.org/r/20241021233013.18405-1-davthompson@nvidia.com
2024-10-22 18:36:13 +02:00
David Thompson
1fe774a93b EDAC/bluefield: Fix potential integer overflow
The 64-bit argument for the "get DIMM info" SMC call consists of mem_ctrl_idx
left-shifted 16 bits and OR-ed with DIMM index.  With mem_ctrl_idx defined as
32-bits wide the left-shift operation truncates the upper 16 bits of
information during the calculation of the SMC argument.

The mem_ctrl_idx stack variable must be defined as 64-bits wide to prevent any
potential integer overflow, i.e. loss of data from upper 16 bits.

Fixes: 82413e562e ("EDAC, mellanox: Add ECC support for BlueField DDR4")
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shravan Kumar Ramani <shravankr@nvidia.com>
Link: https://lore.kernel.org/r/20240930151056.10158-1-davthompson@nvidia.com
2024-10-17 14:10:18 +02:00
Lili Li
0be9f1af39 EDAC/igen6: Add Intel Panther Lake-H SoCs support
Panther Lake-H SoCs share the same IBECC registers with Meteor Lake-P
SoCs. Add Panther Lake-H SoC compute die IDs for EDAC support.

Signed-off-by: Lili Li <lili.li@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20241012071439.54165-1-qiuxu.zhuo@intel.com
2024-10-14 11:28:41 -07:00
Rajendra Nayak
0a97195d21 EDAC/qcom: Make irq configuration optional
On most modern qualcomm SoCs, the configuration necessary to enable the
Tag/Data RAM related irqs being propagated to the SoC irq controller is
already done in firmware (in DSF or 'DDR System Firmware')

On some like the x1e80100, these registers aren't even accesible to the
kernel causing a crash when edac device is probed.

Hence, make the irq configuration optional in the driver and mark x1e80100
as the SoC on which this should be avoided.

Fixes: af16b00578 ("arm64: dts: qcom: Add base X1E80100 dtsi and the QCP dts")
Reported-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Rajendra Nayak <quic_rjendra@quicinc.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Link: https://lore.kernel.org/r/20240903101510.3452734-1-quic_rjendra@quicinc.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2024-10-05 22:17:08 -05:00
Linus Torvalds
7dfc15c473 - Drop a now obsolete ppc4xx_edac driver
- Fix conversion to physical memory addresses on Intel's Elkhart Lake and Ice
   Lake hardware when the system address is above the (Top-Of-Memory) TOM
   address
 
 - Pay attention to the memory hole on Zynq UltraScale+ MPSoC DDR controllers
   when injecting errors for testing purposes
 
 - Add support for translating normalized error addresses reported by an AMD
   memory controller into system physical addresses using an UEFI mechanism
   called platform runtime mechanism (PRM).
 
 - The usual cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmbeuNcACgkQEsHwGGHe
 VUoELw//fZaWbfYg7yYw8iTMojc01LCmS5m6nQeJc6PewcIfLp6FXr4V4Rq99NUn
 FBVIMunm0unRAqep9WTY+xphxlP9u9VovyaLR0cxRf1aEi3xRFit7PIG7P3RyTUn
 ipDKBnx0plTlwB9US5XllhGCM6xAvrNBoKPe1LV+bd7z9wOJvIy3GeV/65ajLsLV
 +7wNBJ8CMXIJ+319FK35ZUM1butp2XFLVtLqKL53nPsumowZcegfaD1u6sfsX4SO
 je8BpNMXKHl0ftZ3DPAMAGrr4M54lsXX/62k3PqcUr4LMbVGLzQmDGyoHUWwdruT
 OGb5tVWqBXoR6DA03/P25q1SGKwGsbuzK33E8T9vkwIqBrj73vA+tVBv03U3QFMO
 RSb4/BS09q/GtA70OFCnigumLoKMmuZu0tcLGQaUMP6sWVVVMp1vVctTapl22h57
 sonEUf0+GMsVu4ueS/vSfU3R3Dqadg/4LxZPG7njc06hCNDAu7u4/0gGdGuiQwqF
 ZyLUZO3SlJX/SkWfNyW4Lc4GNWRWgtFfh5sgODxATCE5NyUrazsQZg5Jsxr/5Jwv
 aBDsbHEUHO0zKRGfDBfHyaWK8318z+my8zvVhIGLuQCKEY8GSTK35rfthkp6vbEe
 UNrCgea+HaDZt6jN4ahaZjK/0DjiMSO12gA3GPt7tdO6v+U46/0=
 =+/Fq
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Drop a now obsolete ppc4xx_edac driver

 - Fix conversion to physical memory addresses on Intel's Elkhart Lake
   and Ice Lake hardware when the system address is above the
   (Top-Of-Memory) TOM address

 - Pay attention to the memory hole on Zynq UltraScale+ MPSoC DDR
   controllers when injecting errors for testing purposes

 - Add support for translating normalized error addresses reported by an
   AMD memory controller into system physical addresses using an UEFI
   mechanism called platform runtime mechanism (PRM).

 - The usual cleanups and fixes

* tag 'edac_updates_for_v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC: Drop obsolete PPC4xx driver
  EDAC/sb_edac: Fix the compile warning of large frame size
  EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5
  EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
  EDAC/igen6: Fix conversion of system address to physical memory address
  EDAC/synopsys: Fix error injection on Zynq UltraScale+
  RAS/AMD/ATL: Translate normalized to system physical addresses using PRM
  ACPI: PRM: Add PRM handler direct call support
2024-09-16 06:36:37 +02:00
Borislav Petkov (AMD)
92f8358bce Merge remote-tracking branches 'ras/edac-amd-atl', 'ras/edac-misc' and 'ras/edac-drivers' into edac-updates
* ras/edac-amd-atl:
  RAS/AMD/ATL: Translate normalized to system physical addresses using PRM
  ACPI: PRM: Add PRM handler direct call support

* ras/edac-misc:
  EDAC/synopsys: Fix error injection on Zynq UltraScale+

* ras/edac-drivers:
  EDAC: Drop obsolete PPC4xx driver
  EDAC/sb_edac: Fix the compile warning of large frame size
  EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5
  EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
  EDAC/igen6: Fix conversion of system address to physical memory address

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-09-09 10:51:30 +02:00
Rob Herring (Arm)
a5f285d9cf EDAC: Drop obsolete PPC4xx driver
Since

  47d13a269b ("powerpc/40x: Remove 40x platforms.")

support for PPC40x platforms has been removed. While the EDAC driver also
mentions PPC440 and PPC460 processors, the driver refuses to probe on anything
other than PPC405. It's unlikely support will ever be added at this point for
these other old platforms, so the driver can be removed.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lore.kernel.org/r/20240904192224.3060307-2-robh@kernel.org
2024-09-05 16:56:38 +02:00
Qiuxu Zhuo
43247abd09 EDAC/sb_edac: Fix the compile warning of large frame size
Compiling sb_edac driver with GCC 11.4.0 and the W=1 option reported
the following warning:

  drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’:
  drivers/edac/sb_edac.c:3249:1: warning: the frame size of 1032 bytes is larger than 1024 bytes [-Wframe-larger-than=]

As there is no concurrent invocation of sbridge_mce_output_error(),
fix this warning by moving the large-size variables 'msg' and 'msg_full'
from the stack to the pre-allocated data segment.

[Tony: Fix checkpatch warnings for code alignment & use of strcpy()]

Reported-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20240829120903.84152-1-qiuxu.zhuo@intel.com
2024-09-03 15:09:22 -07:00
Qiuxu Zhuo
7a33c144c2 EDAC/{skx_common,i10nm}: Remove the AMAP register for determing DDR5
The configuration flag 'res_config->support_ddr5 = true' sufficiently
indicates DDR5 memory support for Sapphire Rapids and Granite Rapids.
Additionally, the i10nm_edac driver doesn't need to use the AMAP
register for setting the 'fine_grain_bank' of each DIMM. Therefore,
remove the AMAP register for determining DDR5.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20240829061309.57738-1-qiuxu.zhuo@intel.com
2024-09-03 12:36:59 -07:00
Qiuxu Zhuo
8b93582353 EDAC/{skx_common,skx,i10nm}: Move the common debug code to skx_common
Commit

  afdb82fd763c ("EDAC, i10nm: make skx_common.o a separate module")

made skx_common.o a separate module. With skx_common.o now a separate
module, move the common debug code setup_{skx,i10nm}_debug() and
teardown_{skx,i10nm}_debug() in {skx,i10nm}_base.c to skx_common.c to
reduce code duplication. Additionally, prefix these function names with
'skx' to maintain consistency with other names in the file.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20240829055101.56245-1-qiuxu.zhuo@intel.com
2024-09-03 12:35:06 -07:00
Qiuxu Zhuo
0ad875f442 EDAC/igen6: Fix conversion of system address to physical memory address
The conversion of system address to physical memory address (as viewed by
the memory controller) by igen6_edac is incorrect when the system address
is above the TOM (Total amount Of populated physical Memory) for Elkhart
Lake and Ice Lake (Neural Network Processor). Fix this conversion.

Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/stable/20240814061011.43545-1-qiuxu.zhuo%40intel.com
2024-09-03 12:27:19 -07:00
Shubhrajyoti Datta
35e6dbfe18 EDAC/synopsys: Fix error injection on Zynq UltraScale+
The Zynq UltraScale+ MPSoC DDR has a disjoint memory from 2GB to 32GB.
The DDR host interface has a contiguous memory so while injecting
errors, the driver should remove the hole else the injection fails as
the address translation is incorrect.

Introduce a get_mem_info() function pointer and set it for Zynq
UltraScale+ platform to return host address.

Fixes: 1a81361f75 ("EDAC, synopsys: Add Error Injection support for ZynqMP DDR controller")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240711100656.31376-1-shubhrajyoti.datta@amd.com
2024-08-01 16:27:46 +02:00
Linus Torvalds
1a251f52cf minmax: make generic MIN() and MAX() macros available everywhere
This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics.  The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.

These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:

 - trivial duplicated macro definitions have been removed

   Note that 'trivial' here means that it's obviously kernel code that
   already included all the major kernel headers, and thus gets the new
   generic MIN/MAX macros automatically.

 - non-trivial duplicated macro definitions are guarded with #ifndef

   This is the "yes, they define their own versions, but no, the include
   situation is not entirely obvious, and maybe they don't get the
   generic version automatically" case.

 - strange use case #1

   A couple of drivers decided that the way they want to describe their
   versioning is with

	#define MAJ 1
	#define MIN 2
	#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)

   which adds zero value and I just did my Alexander the Great
   impersonation, and rewrote that pointless Gordian knot as

	#define DRV_VERSION "1.2"

   instead.

 - strange use case #2

   A couple of drivers thought that it's a good idea to have a random
   'MIN' or 'MAX' define for a value or index into a table, rather than
   the traditional macro that takes arguments.

   These values were re-written as C enum's instead. The new
   function-line macros only expand when followed by an open
   parenthesis, and thus don't clash with enum use.

Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.

Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28 15:49:18 -07:00
Linus Torvalds
4477b39c32 minmax: add a few more MIN_T/MAX_T users
Commit 3a7e02c040 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.

The complexity of those macros stems from two issues:

 (a) trying to use them in situations that require a C constant
     expression (in static initializers and for array sizes)

 (b) the type sanity checking

and MIN_T/MAX_T avoids both of these issues.

Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.

But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.

However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.

This does exactly that.

Which in turn will then allow for much simpler implementations of
min_t()/max_t().  All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.

We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.

Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-28 13:41:14 -07:00
Linus Torvalds
222dfb8326 - Make error checking of AMD SMN accesses more robust in the callers as
they're the only ones who can interpret the results properly
 
  - The usual cleanups and fixes, left and right
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaVOU0ACgkQEsHwGGHe
 VUqeFBAAl9X4bj08GwSAXfqBangXaGpKO4Nx0VZiFCYDkQ/TDnchMEBbpRWSuVzS
 SEnVSrcAXCxKqhv295UyFMmv2a+q3UUidkxTzRfznekMZMMylHYcfCFrg16w9ZNJ
 N/cBquTu96hSJHd2/usNUvNPLllTrMoIg3gofBav+NTaHQQDmzvM5htfewREY9OF
 SRS/86o3u5oIsRKKiJRyzfLzzX9lEGUvU+lvxv/yu1x2Q6SG0guhfM3HeaSxCIOs
 yeB23bwe/N/pO5KlqOtEJJL49Ypu2k/jfiS2rhH6AxSqNfXVpBlDbnahu9sA973n
 irzWwycJhVU4OQ3pqmPXdcKDqn7GmUWDsjrkEIOqJeBCSukmlM7APi8Ss8yGZ3X4
 HgDw10c900ldrxSo0H5PdpeULvowpeptpzBY8gzcdum4s0vNUvZLy/n1AKo7ydea
 oJ+ZBdXvywnR66uGQLkTxLvpGTNgyFrKDORHuyOAwJTN5CbLuco2SV/82mkcQCZt
 sAgyiWFvIcLoHZPfY8BNztYWVX01lWDIxFHJE8ca/B97mBeZCC3w1DnHJla8Kxsg
 zCMV0yn61BdMvjVS9AGaKqEuN0gYYrs/QOjtOp5ggAv7QC1ke/wqgZoFGvLbmcP9
 pIf8GzCt34u3tACGAl76toP0rtnMjGvKD8xXdHGHf7AAj1jKo28=
 =rd6Q
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:

 - Make error checking of AMD SMN accesses more robust in the callers as
   they're the only ones who can interpret the results properly

 - The usual cleanups and fixes, left and right

* tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/kmsan: Fix hook for unaligned accesses
  x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos
  x86/pci/xen: Fix PCIBIOS_* return code handling
  x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling
  x86/of: Return consistent error type from x86_of_pci_irq_enable()
  hwmon: (k10temp) Rename _data variable
  hwmon: (k10temp) Remove unused HAVE_TDIE() macro
  hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters
  hwmon: (k10temp) Define a helper function to read CCD temperature
  x86/amd_nb: Enhance SMN access error checking
  hwmon: (k10temp) Check return value of amd_smn_read()
  EDAC/amd64: Check return value of amd_smn_read()
  EDAC/amd64: Remove unused register accesses
  tools/x86/kcpuid: Add missing dir via Makefile
  x86, arm: Add missing license tag to syscall tables files
2024-07-15 19:53:07 -07:00
Linus Torvalds
8028e290b6 - The AMD memory controllers data fabric version 4.5 supports
non-power-of-2 denormalization in the sense that certain bits of the
   system physical address cannot be reconstructed from the normalized
   address reported by the RAS hardware. Add support for handling such
   addresses
 
 - Switch the EDAC drivers to the new Intel CPU model defines
 
 - The usual fixes and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmaU9aQACgkQEsHwGGHe
 VUpkKQ//eWbeC4JosmRohUECE7MtZppAJ7iX7I7DbQkpKAjdeN4qnPESIQleFN9o
 qg7CYkLRUOi8sYJ3MKmIG5l+yxgztKZl7EvzfAaKiCPDt2EK0DDLmhO3VTE1muTn
 bYo3kk0HpxCVFfuWxmDCu36CC11wkGmjUo5k6XCE5L4hFlywvVwrktc55jQWsbWk
 Kc5iAJxxSc+C8/7oTjqnYuARNl/6Fl4S376GYoxHXzlZI8VoFLO/sW20fz7gQjZg
 n/y25CEHki/K9y+bU8Gsexcwhd0jbU02HYtKQI7klcDqyamm8IlmLcTEXZ6Ozlhg
 C/dYs2FI9vi6V8B3f8tGHSA3jZgFmcU0OJV9Zl1Pr/ORax9+nbhfxyJbYgp/SgT5
 1so5d3iqM2vD+UHnyld0WftVO/HxurhhKPgfCHvcagQnseFwNNqSKGUuwcJ33RCs
 iUMBtwmupJL4nAoF+7ZskYbT2zTUduxgCjRiw0ok3h/mxZ+HvmPne5T8y1c1nzUC
 +GJbPmprLhKhxKaBrd8w2vrWZHb3X0OccZzfyoS/Eiy0VTdZsVGZfhFEYHvRxYHA
 rpM2ex0HrrI3RwrGRmp80PJjMVdGTVbue9yWRBN7LTyBmB+GkUPzCnGpFzyxibNe
 iKnwwUjIzhZ48ImImbiCcVA+VMUHSqvLvBMEeYD3nyrZO1x9OKI=
 =kLNX
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - The AMD memory controllers data fabric version 4.5 supports
   non-power-of-2 denormalization in the sense that certain bits of the
   system physical address cannot be reconstructed from the normalized
   address reported by the RAS hardware. Add support for handling such
   addresses

 - Switch the EDAC drivers to the new Intel CPU model defines

 - The usual fixes and cleanups all over the place

* tag 'edac_updates_for_v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC: Add missing MODULE_DESCRIPTION() macros
  EDAC/dmc520: Use devm_platform_ioremap_resource()
  EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support
  RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA
  RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization
  RAS/AMD/ATL: Validate address map when information is gathered
  RAS/AMD/ATL: Expand helpers for adding and removing base and hole
  RAS/AMD/ATL: Read DRAM hole base early
  RAS/AMD/ATL: Add amd_atl pr_fmt() prefix
  RAS/AMD/ATL: Add a missing module description
  EDAC, i10nm: make skx_common.o a separate module
  EDAC/skx: Switch to new Intel CPU model defines
  EDAC/sb_edac: Switch to new Intel CPU model defines
  EDAC, pnd2: Switch to new Intel CPU model defines
  EDAC/i10nm: Switch to new Intel CPU model defines
  EDAC/ghes: Add missing newline to pr_info() statement
  RAS/AMD/ATL: Add missing newline to pr_info() statement
  EDAC/thunderx: Remove unused struct error_syndrome
2024-07-15 18:20:24 -07:00
Jeff Johnson
3afa157f43 EDAC: Add missing MODULE_DESCRIPTION() macros
With ARCH=arm64

  make allmodconfig && make W=1 C=1

reports:

  WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/edac/layerscape_edac_mod.o

Add the missing invocation of the MODULE_DESCRIPTION() macro to all
files which have a MODULE_LICENSE().

This includes mpc85xx_edac.c and four octeon_edac-*.c files which,
although they did not produce a warning with the arm64 allmodconfig
configuration, may cause this warning with other configurations.

  [ bp: s/module/driver/ for layerscape_edac ]

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240617-md-arm64-drivers-edac-v2-1-6d6c5dd1e5da@quicinc.com
2024-06-29 16:21:01 +02:00
Jai Arora
420c324d59 EDAC/dmc520: Use devm_platform_ioremap_resource()
platform_get_resource() and devm_ioremap_resource() are wrapped up in the
devm_platform_ioremap_resource() helper. Use the helper and get rid of the
local variable for struct resource *.

Signed-off-by: Jai Arora <jai.arora@samsung.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240618110226.97395-1-jai.arora@samsung.com
2024-06-23 10:48:55 +02:00
Qiuxu Zhuo
88150cd950 EDAC/igen6: Add Intel Arrow Lake-U/H SoCs support
Arrow Lake-U/H SoCs share same IBECC registers with Meteor Lake-P
SoCs. Add Arrow Lake-U/H SoC compute die IDs for EDAC support.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240614030354.69180-1-qiuxu.zhuo@intel.com
2024-06-14 08:08:12 -07:00
Yazen Ghannam
5ac6293047 EDAC/amd64: Check return value of amd_smn_read()
Check the return value of amd_smn_read() before saving a value. This
ensures invalid values aren't saved. The struct umc instance is
initialized to 0 during memory allocation. Therefore, a bad read will
keep the value as 0 providing the expected Read-as-Zero behavior.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-2-ffde21931c3f@amd.com
2024-06-12 11:33:45 +02:00
Yazen Ghannam
f97a8b9170 EDAC/amd64: Remove unused register accesses
A number of UMC registers are read only for the purpose of debug printing. They
are not used in any calculations. Nor do they have any specific debug value.

Remove them.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20240606-fix-smn-bad-read-v4-1-ffde21931c3f@amd.com
2024-06-12 11:33:45 +02:00
Ilpo Järvinen
f8367a74ae EDAC/igen6: Convert PCIBIOS_* return codes to errnos
errcmd_enable_error_reporting() uses pci_{read,write}_config_word()
that return PCIBIOS_* codes. The return code is then returned all the
way into the probe function igen6_probe() that returns it as is. The
probe functions, however, should return normal errnos.

Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it from errcmd_enable_error_reporting().

Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240527132236.13875-2-ilpo.jarvinen@linux.intel.com
2024-06-04 11:29:52 +02:00
Ilpo Järvinen
3ec8ebd8a5 EDAC/amd64: Convert PCIBIOS_* return codes to errnos
gpu_get_node_map() uses pci_read_config_dword() that returns PCIBIOS_*
codes. The return code is then returned all the way into the module
init function amd64_edac_init() that returns it as is. The module init
functions, however, should return normal errnos.

Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it from gpu_get_node_map().

For consistency, convert also the other similar cases which return
PCIBIOS_* codes even if they do not have any bugs at the moment.

Fixes: 4251566ebc ("EDAC/amd64: Cache and use GPU node map")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240527132236.13875-1-ilpo.jarvinen@linux.intel.com
2024-06-04 11:24:16 +02:00
Arnd Bergmann
123b158635 EDAC, i10nm: make skx_common.o a separate module
Commit 598afa0504 ("kbuild: warn objects shared among multiple modules")
was added to track down cases where the same object is linked into
multiple modules. This can cause serious problems if some modules are
builtin while others are not.

That test triggers this warning:

scripts/Makefile.build:236: drivers/edac/Makefile: skx_common.o is added to multiple modules: i10nm_edac skx_edac

Make this a separate module instead.

[Tony: Added more background details to commit message]

Fixes: d4dc89d069 ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20240529095132.1929397-1-arnd@kernel.org/
2024-05-29 13:30:10 -07:00
Tony Luck
c2c887e9f9 EDAC/skx: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-39-tony.luck@intel.com
2024-05-28 16:04:44 -07:00
Tony Luck
9593189cf0 EDAC/sb_edac: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-38-tony.luck@intel.com
2024-05-28 16:04:17 -07:00
Tony Luck
e09d576c86 EDAC, pnd2: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-37-tony.luck@intel.com
2024-05-28 16:03:43 -07:00
Tony Luck
bc39bfbaa2 EDAC/i10nm: Switch to new Intel CPU model defines
New CPU #defines encode vendor and family as well as model.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240520224620.9480-36-tony.luck@intel.com
2024-05-28 16:02:44 -07:00
Vasyl Gomonovych
e6f53274c0 EDAC/ghes: Add missing newline to pr_info() statement
Add a missing newline character even if printk() adds newlines to
non-\n-terminated strings because in the unlikely case a KERN_CONT print
statement is added after the unterminated statement, the two will get
glued together which is not the expected behavior.

[ bp: Rewrite commit message. ]

Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240517204951.2019031-1-gomonovych@gmail.com
2024-05-28 16:13:09 +02:00
Dr. David Alan Gilbert
9aa31612d9 EDAC/thunderx: Remove unused struct error_syndrome
struct error_syndrome appears never to have been used. Remove it,
together with the MAX_SYNDROME_REGS it used.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240516133404.251397-1-linux@treblig.org
2024-05-27 14:42:04 +02:00
Linus Torvalds
eba77c0477 - Have skx_edac decode error addresses belonging to SGX properly
- Remove a bunch of unused struct members
 
 - Other cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmZB1k0ACgkQEsHwGGHe
 VUryTQ/8DuVnwHwPcRMrQmge6x2ZPuKZ73RBuFrDqnAcJdNau6YTzd1Iav2r5DE3
 Op6ubfUT1RJv7pmE5Q8EBZ5qoWJQ3SnIifFXT8HDqg2iqTlibXS7NUJCxHzeOzTs
 Z+YgAU618x18IZ0j+Dq55U7yUtvQTviwY8FkO+D+mr/4TFt7w6zCfKNomZm5sDi8
 3RfQD10OGVAlDBFdHVziKyhj82dNyQ20OMLrQ0RnhSG6D2e/3+gB88t9SaM0yiJ2
 ogWHlGiB9vLQEnGuru9+HXahHqJd0DZQPJc5ygO4EufNTpuDWFctm5zGzNcnk1rz
 tMvvyaN8ix7KTo5a9gWRqb5ElW7dDHJkM86z/uvGsNhD1DjVGZl5VUwgJp+sSuL6
 oepW1t6zqmNw81OgiZuhvWWk99HPEDQT2u1zxzmTkjXKEa2cY6Ju1KpzPxNECD4Y
 WwJPyUZhUsdEJ8+oQdZT2MzG3enAE/CxGlxcDEKbZU6WL19N+ofDiWYgMJaLLmW4
 5k0zejE6GMgts6seFNu7NfEAVieaT7proar0GPdi4WR+oERrlEDExyzkNPyUHShR
 H+Q7tlEQlKQQdApoa4H6WuKiSpPZtxkRgOW5W7AE4LHEvd3MGzT5PC+qu/s/vo/H
 uzL9rCnYjBdKNwNEg0bpWGXHk/hXIRJcXWdPtcIXMP5w+vTSIY4=
 =SVW7
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Have skx_edac decode error addresses belonging to SGX properly

 - Remove a bunch of unused struct members

 - Other cleanups

* tag 'edac_updates_for_v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/skx_common: Allow decoding of SGX addresses
  EDAC/mc_sysfs: Convert sprintf()/snprintf() to sysfs_emit()
  EDAC: Remove unused struct members
  EDAC: Remove dynamic attributes from edac_device_alloc_ctl_info()
  EDAC/device: Remove edac_dev_sysfs_block_attribute::store()
  EDAC/device: Remove edac_dev_sysfs_block_attribute::{block,value}
  EDAC/amd64: Remove unused struct member amd64_pvt::ext_nbcfg
2024-05-14 08:31:10 -07:00
Serge Semin
591c946675 EDAC/synopsys: Fix ECC status and IRQ control race condition
The race condition around the ECCCLR register access happens in the IRQ
disable method called in the device remove() procedure and in the ECC IRQ
handler:

  1. Enable IRQ:
     a. ECCCLR = EN_CE | EN_UE
  2. Disable IRQ:
     a. ECCCLR = 0
  3. IRQ handler:
     a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT
     b. ECCCLR = 0
     c. ECCCLR = EN_CE | EN_UE

So if the IRQ disabling procedure is called concurrently with the IRQ
handler method the IRQ might be actually left enabled due to the
statement 3c.

The root cause of the problem is that ECCCLR register (which since
v3.10a has been called as ECCCTL) has intermixed ECC status data clear
flags and the IRQ enable/disable flags. Thus the IRQ disabling (clear EN
flags) and handling (write 1 to clear ECC status data) procedures must
be serialised around the ECCCTL register modification to prevent the
race.

So fix the problem described above by adding the spin-lock around the
ECCCLR modifications and preventing the IRQ-handler from modifying the
IRQs enable flags (there is no point in disabling the IRQ and then
re-enabling it again within a single IRQ handler call, see the
statements 3a/3b and 3c above).

Fixes: f7824ded41 ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR")
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240222181324.28242-2-fancer.lancer@gmail.com
2024-05-06 14:19:07 +02:00
Shubhrajyoti Datta
1a24733e80 EDAC/versal: Do not log total error counts
When logging errors, the driver currently logs the total error count.
However, it should log the current error only. Fix it.

  [ bp: Rewrite text. ]

Fixes: 6f15b178cd ("EDAC/versal: Add a Xilinx Versal memory controller driver")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240425121942.26378-4-shubhrajyoti.datta@amd.com
2024-04-25 18:08:05 +02:00
Shubhrajyoti Datta
de87ba848d EDAC/versal: Check user-supplied data before injecting an error
The function inject_data_ue_store() lacks a NULL check for the user
passed values. To prevent below kernel crash include a NULL check.

Call trace:

  kstrtoull
  kstrtou8
  inject_data_ue_store
  full_proxy_write
  vfs_write
  ksys_write
  __arm64_sys_write
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

Fixes: 83bf24051a ("EDAC/versal: Make the bit position of injected errors configurable")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240425121942.26378-3-shubhrajyoti.datta@amd.com
2024-04-25 18:04:47 +02:00
Shubhrajyoti Datta
edbe59428e EDAC/versal: Do not register for NOC errors
The NOC errors are not handled in the driver. Remove the request for
registration.

Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240425121942.26378-2-shubhrajyoti.datta@amd.com
2024-04-25 18:04:14 +02:00
Qiuxu Zhuo
e0d3350778 EDAC/skx_common: Allow decoding of SGX addresses
There are no "struct page" associations with SGX pages, causing the check
pfn_to_online_page() to fail. This results in the inability to decode the
SGX addresses and warning messages like:

  Invalid address 0x34cc9a98840 in IA32_MC17_ADDR

Add an additional check to allow the decoding of the error address and to
skip the warning message, if the error address is an SGX address.

Fixes: 1e92af09fa ("EDAC/skx_common: Filter out the invalid address")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240408120419.50234-1-qiuxu.zhuo@intel.com
2024-04-08 09:49:45 -07:00
Li Zhijian
d7518ad4ed EDAC/mc_sysfs: Convert sprintf()/snprintf() to sysfs_emit()
Per Documentation/filesystems/sysfs.rst, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.

Generated by:

  make coccicheck M=<path/to/file> MODE=patch \
    COCCI=scripts/coccinelle/api/device_attr_show.cocci

No functional change intended.

  [ bp: Massage. ]

Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240314084628.1322006-1-lizhijian@fujitsu.com
2024-04-04 18:23:57 +02:00
Jiri Slaby (SUSE)
c8d37084e9 EDAC: Remove unused struct members
Remove unused

- edac_pci_ctl_info::edac_subsys
- edac_pci_ctl_info::complete
- edac_device_ctl_info::removal_complete

members.

Found by https://github.com/jirislaby/clang-struct.

  [ bp: Squash three almost identical trivial patches into one. ]

Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240213112051.27715-6-jirislaby@kernel.org
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-03-27 18:26:58 +01:00