errcmd_enable_error_reporting() uses pci_{read,write}_config_word()
that return PCIBIOS_* codes. The return code is then returned all the
way into the probe function igen6_probe() that returns it as is. The
probe functions, however, should return normal errnos.
Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it from errcmd_enable_error_reporting().
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240527132236.13875-2-ilpo.jarvinen@linux.intel.com
gpu_get_node_map() uses pci_read_config_dword() that returns PCIBIOS_*
codes. The return code is then returned all the way into the module
init function amd64_edac_init() that returns it as is. The module init
functions, however, should return normal errnos.
Convert PCIBIOS_* returns code using pcibios_err_to_errno() into normal
errno before returning it from gpu_get_node_map().
For consistency, convert also the other similar cases which return
PCIBIOS_* codes even if they do not have any bugs at the moment.
Fixes: 4251566ebc ("EDAC/amd64: Cache and use GPU node map")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240527132236.13875-1-ilpo.jarvinen@linux.intel.com
The race condition around the ECCCLR register access happens in the IRQ
disable method called in the device remove() procedure and in the ECC IRQ
handler:
1. Enable IRQ:
a. ECCCLR = EN_CE | EN_UE
2. Disable IRQ:
a. ECCCLR = 0
3. IRQ handler:
a. ECCCLR = CLR_CE | CLR_CE_CNT | CLR_CE | CLR_CE_CNT
b. ECCCLR = 0
c. ECCCLR = EN_CE | EN_UE
So if the IRQ disabling procedure is called concurrently with the IRQ
handler method the IRQ might be actually left enabled due to the
statement 3c.
The root cause of the problem is that ECCCLR register (which since
v3.10a has been called as ECCCTL) has intermixed ECC status data clear
flags and the IRQ enable/disable flags. Thus the IRQ disabling (clear EN
flags) and handling (write 1 to clear ECC status data) procedures must
be serialised around the ECCCTL register modification to prevent the
race.
So fix the problem described above by adding the spin-lock around the
ECCCLR modifications and preventing the IRQ-handler from modifying the
IRQs enable flags (there is no point in disabling the IRQ and then
re-enabling it again within a single IRQ handler call, see the
statements 3a/3b and 3c above).
Fixes: f7824ded41 ("EDAC/synopsys: Add support for version 3 of the Synopsys EDAC DDR")
Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240222181324.28242-2-fancer.lancer@gmail.com
When logging errors, the driver currently logs the total error count.
However, it should log the current error only. Fix it.
[ bp: Rewrite text. ]
Fixes: 6f15b178cd ("EDAC/versal: Add a Xilinx Versal memory controller driver")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240425121942.26378-4-shubhrajyoti.datta@amd.com
The function inject_data_ue_store() lacks a NULL check for the user
passed values. To prevent below kernel crash include a NULL check.
Call trace:
kstrtoull
kstrtou8
inject_data_ue_store
full_proxy_write
vfs_write
ksys_write
__arm64_sys_write
invoke_syscall
el0_svc_common.constprop.0
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
Fixes: 83bf24051a ("EDAC/versal: Make the bit position of injected errors configurable")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240425121942.26378-3-shubhrajyoti.datta@amd.com
There are no "struct page" associations with SGX pages, causing the check
pfn_to_online_page() to fail. This results in the inability to decode the
SGX addresses and warning messages like:
Invalid address 0x34cc9a98840 in IA32_MC17_ADDR
Add an additional check to allow the decoding of the error address and to
skip the warning message, if the error address is an SGX address.
Fixes: 1e92af09fa ("EDAC/skx_common: Filter out the invalid address")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240408120419.50234-1-qiuxu.zhuo@intel.com
Per Documentation/filesystems/sysfs.rst, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.
Generated by:
make coccicheck M=<path/to/file> MODE=patch \
COCCI=scripts/coccinelle/api/device_attr_show.cocci
No functional change intended.
[ bp: Massage. ]
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240314084628.1322006-1-lizhijian@fujitsu.com
Dynamic attributes are not passed from any caller of
edac_device_alloc_ctl_info(). Drop this unused/untested functionality
completely.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240213112051.27715-5-jirislaby@kernel.org
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Commit
cfe40fdb4a ("amd64_edac: add driver header")
added amd64_pvt struct with ext_nbcfg in it. But no one used that member
since then.
Therefore, remove it.
Found by https://github.com/jirislaby/clang-struct.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20240213112051.27715-2-jirislaby@kernel.org
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
collects and manages previously encountered hw errors in order to
save them to persistent storage across reboots. Previously recorded
errors are "replayed" upon reboot in order to poison memory which has
caused said errors in the past.
The main use case is stacked, on-chip memory which cannot simply be
replaced so poisoning faulty areas of it and thus making them
inaccessible is the only strategy to prolong its lifetime.
- Add an AMD address translation library glue which converts the
reported addresses of hw errors into system physical addresses in
order to be used by other subsystems like memory failure, for
example. Add support for MI300 accelerators to that library.
- igen6: Add support for Alder Lake-N SoC
- i10nm: Add Grand Ridge support
- The usual fixlets and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXvKHcACgkQEsHwGGHe
VUo4Lg/+OwXDI1EaCDyaHJ+f6JRmNok1EGjKMVjpp71/XmE3eUjiXfCv/b0bwl3V
oIXGlXpJ5RSME+9aFDWADaE3h5zAGzTwQXuKtOUQPiJ6UuCebXodm8SaIG8V8trG
yaW/hhP98AoJD+fN6qzv4XWYvTG8VRQs4tdISg9FXiljTjv4mKA+sxuCu8KpfrDh
Tg+9F4Rre6gyR5GaB6N7Cc0k97DM7n5yKBZZGKucv+oYzDyf6n631ZSJ2zA9NC51
CJlux917hCXI/IWrCQ2nkyfPPXxn8AaznUAA30wKgwlt8TFSdKTW+DvRA2zyuAU3
0UDHO4FezOKuzVnWkzdnKsIMAnDyTGOz3Fi2LU4mC+JHaHHmI2quSWDxp5phWBuy
S+T3XHxpbSsLGEI7zxT5F9u1oAlCvYu1C7HJw+yxNSn2iCy5LoNo0H/kl/nhR8Xr
FgVp8SYgQRU2Pp8vgGOibMYY/TAHX55EticKdxvBI0yY+iqoJyAbZ0fb0XyLNc7s
GqoWfvrK1KQzf5/Ya1Mm//0/QTPyFmJwujMJ2eEnMRRER+23bYpGvVBBT8E1sG9s
gqEJkKjmVCPt9xJTcivm96sLJ7CG36w8+r/axSqpKXdcvDG9ec8G8PRqjlo5pcvh
gYevmCBIcKny1xuhALwD6Rn2mkPip7araycDx9X9nd5z1qCxBaU=
=FR2l
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add a FRU (Field Replaceable Unit) memory poison manager which
collects and manages previously encountered hw errors in order to
save them to persistent storage across reboots. Previously recorded
errors are "replayed" upon reboot in order to poison memory which has
caused said errors in the past.
The main use case is stacked, on-chip memory which cannot simply be
replaced so poisoning faulty areas of it and thus making them
inaccessible is the only strategy to prolong its lifetime.
- Add an AMD address translation library glue which converts the
reported addresses of hw errors into system physical addresses in
order to be used by other subsystems like memory failure, for
example. Add support for MI300 accelerators to that library.
- igen6: Add support for Alder Lake-N SoC
- i10nm: Add Grand Ridge support
- The usual fixlets and cleanups
* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/versal: Convert to platform remove callback returning void
RAS/AMD/FMPM: Fix off by one when unwinding on error
RAS/AMD/FMPM: Add debugfs interface to print record entries
RAS/AMD/FMPM: Save SPA values
RAS: Export helper to get ras_debugfs_dir
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Add MI300 row retirement support
Documentation: Move RAS section to admin-guide
EDAC/versal: Make the bit position of injected errors configurable
EDAC/i10nm: Add Intel Grand Ridge micro-server support
EDAC/igen6: Add one more Intel Alder Lake-N SoC support
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/ATL: Add MI300 support
Documentation: RAS: Add index and address translation section
EDAC/amd64: Use new AMD Address Translation Library
RAS: Introduce AMD Address Translation Library
EDAC/synopsys: Convert to devm_platform_ioremap_resource()
* ras/edac-drivers:
EDAC/i10nm: Add Intel Grand Ridge micro-server support
EDAC/igen6: Add one more Intel Alder Lake-N SoC support
* ras/edac-misc:
EDAC/versal: Convert to platform remove callback returning void
EDAC/versal: Make the bit position of injected errors configurable
EDAC/synopsys: Convert to devm_platform_ioremap_resource()
* ras/edac-amd-atl:
RAS/AMD/FMPM: Fix off by one when unwinding on error
RAS/AMD/FMPM: Add debugfs interface to print record entries
RAS/AMD/FMPM: Save SPA values
RAS: Export helper to get ras_debugfs_dir
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Add MI300 row retirement support
Documentation: Move RAS section to admin-guide
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/ATL: Add MI300 support
Documentation: RAS: Add index and address translation section
EDAC/amd64: Use new AMD Address Translation Library
RAS: Introduce AMD Address Translation Library
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve this, there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Link: https://lore.kernel.org/r/83deca1ce260f7e17ff3cb106c9a6946d4ca4505.1709886922.git.u.kleine-koenig@pengutronix.de
AMD (ab)uses topology_die_id() to store the Node ID information and
topology_max_dies_per_pkg to store the number of nodes per package.
This collides with the proper processor die level enumeration which is
coming on AMD with CPUID 8000_0026, unless there is a correlation between
the two. There is zero documentation about that.
So provide new storage and new accessors which for now still access die_id
and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are
converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.956116738@linutronix.de
Currently, the bit positions to inject correctable and uncorrectable
errors are hardcoded. To make that configurable add separate sysfs entries
to set the bit positions for injecting CE and UE errors. Allow for
single bit error for CE and two bits errors for UE injection.
[ bp: Massage. ]
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240208094653.11704-1-shubhrajyoti.datta@amd.com
The Grand Ridge CPU model uses similar memory controller registers with
Granite Rapids server. Add Grand Ridge CPU model ID for EDAC support.
Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20240129062040.60809-3-qiuxu.zhuo@intel.com
Add a new Intel Alder Lake-N SoC compute die ID for EDAC support.
Signed-off-by: Lili Li <lili.li@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20240129062040.60809-2-qiuxu.zhuo@intel.com
Remove old address translation code and use the new AMD Address
Translation Library.
Use "imply" in Kconfig so that the "AMD_ATL" config option takes the
value of "EDAC_AMD64" as its default.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240123041401.79812-3-yazen.ghannam@amd.com
Use devm_platform_ioremap_resource() to simplify code.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Michal Simek <michal.simek@amd.com>
Link: https://lore.kernel.org/r/20230704101811.49637-3-frank.li@vivo.com
Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing
major in here this release cycle, just lots of small cleanups and some
tweaks on kernfs that in the very end, got reverted and will come back
in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeOrg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymtcwCffzvKKkSY9qAp6+0v2WQNkZm1JWoAoJCPYUwF
If6wEoPLWvRfKx4gIoq9
=D96r
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
Here is the big set of char/misc and other driver subsystem changes for
6.8-rc1. Lots of stuff in here, but first off, you will get a merge
conflict in drivers/android/binder_alloc.c when merging this tree due to
changing coming in through the -mm tree.
The resolution of the merge issue can be found here:
https://lore.kernel.org/r/20231207134213.25631ae9@canb.auug.org.au
or in a simpler patch form in that thread:
https://lore.kernel.org/r/ZXHzooF07LfQQYiE@google.com
If there are issues with the merge of this file, please let me know.
Other than lots of binder driver changes (as you can see by the merge
conflicts) included in here are:
- lots of iio driver updates and additions
- spmi driver updates
- eeprom driver updates
- firmware driver updates
- ocxl driver updates
- mhi driver updates
- w1 driver updates
- nvmem driver updates
- coresight driver updates
- platform driver remove callback api changes
- tags.sh script updates
- bus_type constant marking cleanups
- lots of other small driver updates
All of these have been in linux-next for a while with no reported issues
(other than the binder merge conflict.)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeMMQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynWNgCfQ/Yz7QO6EMLDwHO5LRsb3YMhjL4AoNVdanjP
YoI7f1I4GBcC0GKNfK6s
=+Kyv
-----END PGP SIGNATURE-----
Merge tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH:
"Here is the big set of char/misc and other driver subsystem changes
for 6.8-rc1.
Other than lots of binder driver changes (as you can see by the merge
conflicts) included in here are:
- lots of iio driver updates and additions
- spmi driver updates
- eeprom driver updates
- firmware driver updates
- ocxl driver updates
- mhi driver updates
- w1 driver updates
- nvmem driver updates
- coresight driver updates
- platform driver remove callback api changes
- tags.sh script updates
- bus_type constant marking cleanups
- lots of other small driver updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (341 commits)
android: removed duplicate linux/errno
uio: Fix use-after-free in uio_open
drivers: soc: xilinx: add check for platform
firmware: xilinx: Export function to use in other module
scripts/tags.sh: remove find_sources
scripts/tags.sh: use -n to test archinclude
scripts/tags.sh: add local annotation
scripts/tags.sh: use more portable -path instead of -wholename
scripts/tags.sh: Update comment (addition of gtags)
firmware: zynqmp: Convert to platform remove callback returning void
firmware: turris-mox-rwtm: Convert to platform remove callback returning void
firmware: stratix10-svc: Convert to platform remove callback returning void
firmware: stratix10-rsu: Convert to platform remove callback returning void
firmware: raspberrypi: Convert to platform remove callback returning void
firmware: qemu_fw_cfg: Convert to platform remove callback returning void
firmware: mtk-adsp-ipc: Convert to platform remove callback returning void
firmware: imx-dsp: Convert to platform remove callback returning void
firmware: coreboot_table: Convert to platform remove callback returning void
firmware: arm_scpi: Convert to platform remove callback returning void
firmware: arm_scmi: Convert to platform remove callback returning void
...
solution which allows for more timely detection and reporting of
errors
- Start a documentation section which will hold down relevant
RAS features description and how they should be used
- Add new AMD error bank types
- Slim down and remove error type descriptions from the kernel side of
error decoding to rasdaemon which can be used from now on to decode
hw errors on AMD
- Mark pages containing uncorrectable errors as poison so that kdump can
avoid them and thus not cause another panic
- The usual cleanups and fixlets
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWamqkACgkQEsHwGGHe
VUrkMw/+Jv//z5pKFMXF2GlI/Xefh8pXDB57D22B1T6zNJm+Qq1b58U44WpxoU24
b9gPtkgxjzA3JwoQG+cBDkCSs1xLV3McUS0UoEqQ28QvFN+mzxYKu4ura3F+rcZG
PiCT4gPEYgZirYl0PKYeypnBPq3Krx/RYeTUE0vQ9HtmeBCmH71X1egyB4TqFbFU
7ui9aITLAnLBwgO6On1qviGPvJoEDGAsQ656XoY0Js+8dUeYwAI4qpaaUwtXUo1D
ARGzss55/qTRo2G+IkDDhbJ8e8G+eX22oa0n1tdhNFBqfYAN6gM+t4NiFlNnn+18
nlbaSZfDygciE8DVjEnkVIrRJtkq7uj0dk7LvnqEI2y7J0LybHojC3hDKrqYLK3o
PRgfPwmykOCwZQldGFYbShvmY8KoEQc/V9OWi/+A/M/uTJsForQmHn78Z2YkO9kG
K6VaLuYszSCqz47wf56pHBwtMrivEPmkcxaz9ErkK90vM/NmV7kfLl+R8IK8apNJ
cJkuLBjfgGpBP+AlpXGl9OE0lRJK5MCbEBlbGBBl58REBaB4DNkM4QHItrUSRR82
ADLcfLAIRWT8UwwXieDbWF1jb+4L+IJnXCKGCCQ7+eYxcFI9V9TABD2B8io+Dzvz
evZwLCPKmjuPc2CMcDu/eUdBKLNTn3QAoN/NLcVmzJ23lguQW/M=
=o3UM
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Convert the hw error storm handling into a finer-grained, per-bank
solution which allows for more timely detection and reporting of
errors
- Start a documentation section which will hold down relevant RAS
features description and how they should be used
- Add new AMD error bank types
- Slim down and remove error type descriptions from the kernel side of
error decoding to rasdaemon which can be used from now on to decode
hw errors on AMD
- Mark pages containing uncorrectable errors as poison so that kdump
can avoid them and thus not cause another panic
- The usual cleanups and fixlets
* tag 'ras_core_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Handle Intel threshold interrupt storms
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Remove old CMCI storm mitigation code
Documentation: Begin a RAS section
x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
EDAC/mce_amd: Remove SMCA Extended Error code descriptions
x86/mce/amd, EDAC/mce_amd: Move long names to decoder module
x86/mce/inject: Clear test status value
x86/mce: Remove redundant check from mce_device_create()
x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
driver callback return void
- Add support for AMD AI accelerators
- Add support for a number of Intel SoCs: Alder Lake-N, Raptor Lake-P,
Meteor Lake-{P,PS}
- Random fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWZRIAACgkQEsHwGGHe
VUpesxAAux/mgyqJ4lfzpP4I3LkDA7X9uU3Iv6j9NtS64TGvulT4ZMiSL/U/nqTY
/jsJYNJE4yIXm2Bf/T3zx+Dxh6MmUZNEYzSgZyyNx9NPEkHi4gX+b9F2nUx9i66s
D0Pn9IVbM1liNfGr2RSSsKwFa/5UBg+58nGF1DSUz+rOaQ0KOoeta9WE98ne4n81
fC0o4RCzcPswy3NVhi3wPoBiAfyMYmrtvBcSezIbV6CaVJB8aFkvZK4xGKkEBFoJ
ApVrSr6bEjJ7I7DdudisfIhLfU6iji7ZGfkwiouN2TIfVCaOZdAVan/OETpPLC6b
IPNac1iIIl6v2OfxQp6s6b/Vq+dL+vVa4SW5tNj242s9eHunuevEcly37x4W254m
vt91K9dvZI3I5DzAbVagw3Z9tFHbIH7uiSYyjMoyYw11EvwY+bZPXfZExvEUuPQK
2f+VL4ELlBj0imrWSCur5hF1g8R2ejTjb+kYlTcULyaXB4ANeBVzeShff3qc/21g
3QojVA7v+7xkPwURKxRwlqkwE113cz0Rr/kKjXhXnBOuPulX0T5eSzX8WExZzSCT
4FsEvAwZPFt7iCTxB+cspykNxyxLqzMb8GZFG+Ji8szGNbVgiidn/QEQzohs8yBu
aFenuLEM7XwvpPCjmLGGnRkFV9mBDfPATLegTFvMpPQVYhjPdp4=
=Bptq
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- The EDAC drivers part of the effort to make the ->remove() platform
driver callback return void
- Add support for AMD AI accelerators
- Add support for a number of Intel SoCs: Alder Lake-N, Raptor Lake-P,
Meteor Lake-{P,PS}
- Random fixes and cleanups all over the place
* tag 'edac_updates_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (39 commits)
EDAC/skx_common: Filter out the invalid address
EDAC, pnd2: Sort headers alphabetically
EDAC, pnd2: Correct misleading error message in mk_region_mask()
EDAC, pnd2: Apply bit macros and helpers where it makes sense
EDAC, pnd2: Replace custom definition by one from sizes.h
EDAC/igen6: Add Intel Meteor Lake-P SoCs support
EDAC/igen6: Add Intel Meteor Lake-PS SoCs support
EDAC/igen6: Add Intel Raptor Lake-P SoCs support
EDAC/igen6: Add Intel Alder Lake-N SoCs support
EDAC/igen6: Make get_mchbar() helper function
EDAC/amd64: Add support for family 0x19, models 0x90-9f devices
EDAC/mc: Add support for HBM3 memory type
EDAC/{sb,i7core}_edac: Do not use a plain integer for a NULL pointer
EDAC/armada_xp: Explicitly include correct DT includes
EDAC/pci_sysfs: Use PCI_HEADER_TYPE_MASK instead of literals
EDAC/thunderx: Fix possible out-of-bounds string access
EDAC/fsl_ddr: Convert to platform remove callback returning void
EDAC/zynqmp: Convert to platform remove callback returning void
EDAC/xgene: Convert to platform remove callback returning void
EDAC/ti: Convert to platform remove callback returning void
...
Some error event IDs for Versal and Versal NET are different.
Both the platforms should access their respective error event
IDs so use sub_family_code to check for platform and check
error IDs for respective platforms. The family code is passed
via platform data to avoid platform detection again.
Platform data is setup when even driver is registered.
Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com>
Link: https://lore.kernel.org/r/20231219055025.27570-3-jay.buddhabhatti@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In many places in the edac code, struct bus_type pointers are passed
around and then eventually sent to the driver core, which can handle a
constant pointer. So constantify all of the edac usage of these as well
because the data in them is never modified by the edac code either.
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Robert Richter <rric@kernel.org>
Cc: <linux-edac@vger.kernel.org>
Link: https://lore.kernel.org/r/2023121909-tribute-punctuate-4b22@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Decoding an invalid address with certain firmware decoders could
cause a #PF (Page Fault) in the EFI runtime context, which could
subsequently hang the system. To make {i10nm,skx}_edac more robust
against such bogus firmware decoders, filter out invalid addresses
before allowing the firmware decoder to process them.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20231207014512.78564-1-qiuxu.zhuo@intel.com
Sort the headers in alphabetic order in order to ease
the maintenance for this part.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The mask parameter is expected to be of a sequence of the set bits.
It does not mean it must be power of two (only single bit set).
Correct misleading error message.
Suggested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Apply bit macros (BIT()/BIT_ULL()/GENMASK()/etc) and helpers
(for_each_set_bit()/etc) where it makes sense.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The sizes.h provides a set of common size definitions, use it.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add Intel Meteor Lake-PS SoC compute die IDs for EDAC support.
These SoCs share similar IBECC registers with Alder Lake-P SoCs.
The only difference is that IBECC presence is detected through an
MMIO-mapped register instead of the capability register in the
PCI configuration space.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add Intel Raptor Lake-P SoC compute die IDs for EDAC support.
These Raptor Lake-P SoCs share similar IBECC registers with Alder Lake-P
SoCs but extend the most significant bit of the error address logged
in IBECC from bit 38 to bit 45.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add Intel Alder Lake-N SoC compute die IDs for EDAC support.
Alder Lake-N, with one memory controller, is a reduced version of
Alder Lake-P, which has two memory controllers.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Make get_mchbar() helper function to retrieve the BAR address of
the memory controller. No function changes.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
AMD Models 90h-9fh are APUs. They have built-in HBM3 memory. ECC support
is enabled by default.
APU models have a single Data Fabric (DF) per Package. Each DF is
visible to the OS in the same way as chiplet-based systems like Zen2
CPUs and later. However, the Unified Memory Controllers (UMCs) are
arranged in the same way as GPU-based MI200 devices rather than
CPU-based systems.
Use the existing gpu_ops for hetergeneous systems to support enumeration
of nodes and memory topology with few fixups.
[ bp: Massage comments. ]
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231102114225.2006878-5-muralimk@amd.com
Sparse warns about the use of the integer constant 0 as a NULL pointer
with the -Wnon-pointer-null switch.
Even though the C standard requires that 0 == NULL and type conversion
rules turn an integer constant 0 into a NULL pointer when cast to a void
* type, Linus notes that this is a very poor situation from a type
safety angle and a pointer should be initialized with a pointer type
- not an integer constant.
See https://www.spinics.net/lists/linux-sparse/msg10066.html for more
info.
[ bp: Rewrite commit message, drop useless comments in the code. ]
Signed-off-by: Abhinav Singh <singhabhinav9051571833@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231128141703.614605-1-singhabhinav9051571833@gmail.com
On AMD systems with Scalable MCA each machine check error of a SMCA bank
type has an associated bit position in the bank's control (CTL)
register.
An error's bit position in the CTL register is used during error decoding
for offsetting into the corresponding bank's error description structure.
As new errors are being added in newer AMD systems for existing SMCA bank
types, the underlying SMCA architecture guarantees that the bit positions
of existing errors are not altered.
However, on some AMD systems some of the existing bit definitions in the
CTL register of SMCA bank type are reassigned without defining new HWID
and McaType. Consequently, the errors whose bit definitions have been
reassigned in the CTL register are being erroneously decoded.
Remove SMCA Extended Error Code descriptions, this avoids decoding
issues for incorrectly reassigned bits, and avoids the related
maintenance burden in the kernel. But the bank type and Extended Error
Code value for an error will continue to be printed as a convenience.
The decoding of SMCA Extended Error Code description can be done by
referring to AMD documentation or use external tools such as rasdaemon.
Offline decoding can be done using below option in rasdaemon. For example:
$ rasdaemon -p --status <STATUS> --ipid <IPID> --smca
Also, the user can pass particular family and model to decode the error
string.
$ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family>
--model <CPU Model> --bank <BANK_NUM>
Refer to the rasdaemon commit for details:
https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536
Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20231102114225.2006878-2-muralimk@amd.com
The DT of_device.h and of_platform.h date back to the separate
of_platform_bus_type before it was merged into the regular platform bus.
As part of that merge prepping Arm DT support 13 years ago, they
"temporarily" include each other. They also include platform_device.h
and of.h.
As a result, there's a pretty much random mix of those include files
used throughout the tree. In order to detangle these headers and replace
the implicit includes with struct declarations, users need to explicitly
include the correct includes.
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Jan Luebbe <jlu@pengutronix.de>
Link: https://lore.kernel.org/r/20231013190342.246973-1-robh@kernel.org
The long names of the SMCA banks are only used by the MCE decoder
module.
Move them out of the arch code and into the decoder module.
[ bp: Name the long names array "smca_long_names", drop local ptr in
decode_smca_error(), constify arrays. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-5-yazen.ghannam@amd.com
Enabling -Wstringop-overflow globally exposes a warning for a common bug
in the usage of strncat():
drivers/edac/thunderx_edac.c: In function 'thunderx_ocx_com_threaded_isr':
drivers/edac/thunderx_edac.c:1136:17: error: 'strncat' specified bound 1024 equals destination size [-Werror=stringop-overflow=]
1136 | strncat(msg, other, OCX_MESSAGE_SIZE);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
1145 | strncat(msg, other, OCX_MESSAGE_SIZE);
...
1150 | strncat(msg, other, OCX_MESSAGE_SIZE);
...
Apparently the author of this driver expected strncat() to behave the
way that strlcat() does, which uses the size of the destination buffer
as its third argument rather than the length of the source buffer. The
result is that there is no check on the size of the allocated buffer.
Change it to strlcat().
[ bp: Trim compiler output, fixup commit message. ]
Fixes: 41003396f9 ("EDAC, thunderx: Add Cavium ThunderX EDAC driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20231122222007.3199885-1-arnd@kernel.org
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
fsl_mc_err_remove() is used as callback in two drivers. So these have to
be converted together to the void returning remove callback.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231013100422.1382040-2-u.kleine-koenig@pengutronix.de
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Michal Simek <michal.simek@amd.com>
Link: https://lore.kernel.org/r/20231004131254.2673842-22-u.kleine-koenig@pengutronix.de