linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-29 07:31:29 +00:00

Author	SHA1	Message	Date
Yazen Ghannam	6f15e617cc	RAS: Introduce a FRU memory poison manager Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acceptable, based on manufacturer and/or admin policy. During run time, memory with errors may be retired so it is no longer used by the system. This is done in mm through page poisoning, and the effect will remain until the system is restarted. If a memory location is consistently faulty, then the same run time error handling may occur in the next reboot cycle, leading to terminating jobs due to that already known bad memory. This could be prevented if information from the previous boot was not lost. Some add-in cards with driver-managed memory have on-board persistent storage. Their driver saves memory error information to the persistent storage during run time. The information is then restored after reset, and known bad memory will be retired before the hardware is used. A running log of bad memory locations is kept across multiple resets. A similar solution is desirable for CPUs. However, this solution should leverage industry-standard components as much as possible, rather than a bespoke platform driver. Two components are needed: a record format and a persistent storage interface. Implement a new module to manage the record formats on persistent storage. Use the requirements for an AMD MI300-based system to start. Vendor- and platform-specific details can be abstracted later as needed. [ bp: Massage commit message and code, squash 30-ish more fixes from Yazen and me. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: <naveenkrishna.chatradhi@amd.com> Signed-off-by: <naveenkrishna.chatradhi@amd.com> Co-developed-by: <muralidhara.mk@amd.com> Signed-off-by: <muralidhara.mk@amd.com> Tested-by: <sathyapriya.k@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com	2024-02-20 18:56:15 +01:00
Yazen Ghannam	3f3174996b	RAS: Introduce AMD Address Translation Library AMD Zen-based systems report memory errors through Machine Check banks representing Unified Memory Controllers (UMCs). The address value reported for DRAM ECC errors is a "normalized address" that is relative to the UMC. This normalized address must be converted to a system physical address to be usable by the OS. Support for this address translation was introduced to the MCA subsystem with Zen1 systems. The code was later moved to the AMD64 EDAC module, since this was the only user of the code at the time. However, there are uses for this translation outside of EDAC. The system physical address can be used in MCA for preemptive page offlining as done in some MCA notifier functions. Also, this translation is needed as the basis of similar functionality needed for some CXL configurations on AMD systems. Introduce a common address translation library that can be used for multiple subsystems including MCA, EDAC, and CXL. Include support for UMC normalized to system physical address translation for current CPU systems. The Data Fabric Indirect register access offsets and one of the register fields were changed. Default to the current offsets and register field definition. And fallback to the older values if running on a "legacy" system. Provide built-in code to facilitate the loading and unloading of the library module without affecting other modules or built-in code. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240123041401.79812-2-yazen.ghannam@amd.com	2024-01-24 12:49:35 +01:00
Valdis Kletnieks	b6ff24f7b5	RAS: Build debugfs.o only when enabled in Kconfig In addition, the 0day bot reported this build error: >> drivers/ras/debugfs.c:10:5: error: redefinition of 'ras_userspace_consumers' int ras_userspace_consumers(void) ^~~~~~~~~~~~~~~~~~~~~~~ In file included from drivers/ras/debugfs.c:3:0: include/linux/ras.h:14:19: note: previous definition of 'ras_userspace_consumers' was here static inline int ras_userspace_consumers(void) { return 0; } ^~~~~~~~~~~~~~~~~~~~~~~ for a riscv-specific .config where CONFIG_DEBUG_FS is not set. Fix all that by making debugfs.o depend on that define. [ bp: Rewrite commit message. ] Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac@vger.kernel.org Cc: x86@kernel.org Link: http://lkml.kernel.org/r/7053.1565218556@turing-police	2019-08-08 17:44:02 +02:00
Thomas Gleixner	ec8f24b7fa	treewide: Add SPDX license identifier - Makefile/Kconfig Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-21 10:50:46 +02:00
Borislav Petkov	011d826111	RAS: Add a Corrected Errors Collector Introduce a simple data structure for collecting correctable errors along with accessors. More detailed description in the code itself. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it and then the rest of the chain doesn't hear about it - basically the main reason for the CE collector - or to continue running the notifiers. When the CEC hits the action threshold, it will try to soft-offine the page containing the ECC and then the whole decoding chain gets to see the error. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-28 08:54:48 +02:00
Chen, Gong	d963cd95be	RAS, debugfs: Add debugfs interface for RAS subsystem Implement a new debugfs interface for RAS susbsystem. A file named daemon_active is added there accordingly. This file is used to track if user space daemon accesses perf/trace interface or not. One can track which daemon opens it via "lsof /path/to/debugfs/ras/daemon_active". Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1402475691-30045-5-git-send-email-gong.chen@linux.intel.com Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com>	2014-06-25 11:22:03 -07:00
Chen, Gong	76ac8275f2	trace, RAS: Add basic RAS trace event To avoid confuision and conflict of usage for RAS related trace event, add an unified RAS trace event stub. Start a RAS subsystem menu which will be fleshed out in time, when more features get added to it. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1402475691-30045-2-git-send-email-gong.chen@linux.intel.com Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com>	2014-06-23 10:12:19 -07:00

7 Commits