present on AMD zen2 servers
- The usual fixes and cleanups all over EDAC land
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/rQAACgkQEsHwGGHe
VUqTsg//QSI/akiJnnZ5Ea3CMsFAWH/LIAZzjWa3zcit6OcxrjRMsym1JGs4llUY
7Uis0lGTUH2je3kFrYdPVGXyoWPNdsQCQi/89Anabl5Jrf2u7N2MmUIKcCnzQWjS
PHb4BSSWW2zR8wU8UXOo2I1QKltenKlaO97uE9Te9YZCYZ6PzS0z2qevd0SCPUtF
tPQ+AqYNKE0JNNAqlPietSHs7MisyCuxTHqLiwC0QNuKJ3GhCtVhmYGKqaLOMcfr
oKdBFaAZOhmtUFT0z8cLka8sKO5WDfqli9Qp6Zphv9E6iz6gJq7NCBaKRz8gOkPi
boOjmSOhCi/xJq63DUtfKOogVRM+qhms1Kxni3EgJxyGR/oLwnVX4n0AopKBstb6
AsHPUB8AxrtnEtoBPFM/bO18Eto3HkCZD2yCzztOc5NmtUnVlIh2/qN4oYjBG7pc
20iXwZ49OAYVQwrLdoNvyXQ5BdKGC5tyMgvPWVyOO8Iad2YYfM5iUWDGSycEWSo3
/xZgVljbFRBwjxaTqgoUDxOlBEgQsDg4ENTathCpxuR+meac/kqGZDSLc0zc7hV1
zV2kHB7y6gWGAQiuS3Lq+j0BTNtgbUi43R61J00ecQEIOomsBaWLi6uewNJmvOZ9
Ef7c0t/ree/PMTaXVer0IA8lU3vcHBMTrbNtQsPb4TCRsZOgFhI=
=O0zA
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
"A small pile of EDAC updates which the autumn wind blew my way. :)
- amd64_edac: Add support for three-rank interleaving mode which is
present on AMD zen2 servers
- The usual fixes and cleanups all over EDAC land"
* tag 'edac_updates_for_v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
EDAC/ti: Remove redundant error messages
EDAC/amd64: Handle three rank interleaving mode
EDAC/mc_sysfs: Print MC-scope sysfs counters unsigned
EDAC/al_mc: Make use of the helper function devm_add_action_or_reset()
EDAC/mc: Replace strcpy(), sprintf() and snprintf() with strscpy() or scnprintf()
The number of correctable errors is displayed as uncorrectable
errors because the "SBE" error count is passed to both calls of
edac_mc_handle_error().
Pass the correct uncorrectable error count to the second
edac_mc_handle_error() call when logging uncorrectable errors.
[ bp: Massage commit message. ]
Fixes: 7f6998a412 ("ARM: 8888/1: EDAC: Add driver for the Marvell Armada XP SDRAM and L2 cache ECC")
Signed-off-by: Hans Potsch <hans.potsch@nokia.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20211006121332.58788-1-hans.potsch@nokia.com
The computation of TOHM is off by one bit. This missed bit results in
too low a value for TOHM, which can cause errors in regular memory to
incorrectly report:
EDAC MC0: 1 CE Error at MMIOH area, on addr 0x000000207fffa680 on any memory
Fixes: 50d1bb9367 ("sb_edac: add support for Haswell based systems")
Cc: stable@vger.kernel.org
Reported-by: Meeta Saggi <msaggi@purestorage.com>
Signed-off-by: Eric Badger <ebadger@purestorage.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20211010170127.848113-1-ebadger@purestorage.com
In the function ti_edac_probe(), devm_ioremap_resource() and
platform_get_irq() already issue error messages when they fail so remove
the redundant error messages in the EDAC driver.
Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210811112626.27848-1-tangbin@cmss.chinamobile.com
AMD Rome systems and later support interleaving between three identical
ranks within a channel.
Check for this mode by counting the number of enabled chip selects and
comparing their masks. If there are exactly three enabled chip selects
and their masks are identical, then three rank interleaving is enabled.
The size of a rank is determined from its mask value. However, three
rank interleaving doesn't follow the method of swapping an interleave
bit with the most significant bit. Rather, the interleave bit is flipped
and the most significant bit remains the same. There is only a single
interleave bit in this case.
Account for this when determining the chip select size by keeping the
most significant bit at its original value and ignoring any zero bits.
This will return a full bitmask in [MSB:1].
Fixes: e53a3b267f ("EDAC/amd64: Find Chip Select memory size using Address Mask")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211005154419.2060504-1-yazen.ghannam@amd.com
This is cosmetically nicer for counts > INT32_MAX, and aligns the
MC-scope format with that of the lower layer sysfs counter files.
Signed-off-by: Eric Badger <ebadger@purestorage.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20211003181653.GA685515@ebadger-ThinkPad-T590
The helper function devm_add_action_or_reset() will internally call
devm_add_action(), and if devm_add_action() fails then it will
execute the action mentioned and return the error code. So use
devm_add_action_or_reset() instead of devm_add_action() to simplify the
error handling, reduce the code.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Talel Shenhar <talel@amazon.com>
Link: https://lkml.kernel.org/r/20210922125924.321-1-caihuoqing@baidu.com
strcpy() performs no bounds checking on the destination buffer. This
could result in linear overflows beyond the end of the buffer, leading
to all kinds of misbehavior. The safe replacement is strscpy().
[1][2]
However, to simplify and clarify the code, to concatenate labels use the
scnprintf() function. This way it is not necessary to check the return
value of strscpy() (-E2BIG if the parameter count is 0 or the src was
truncated) since scnprintf() always returns the number of chars written
into the buffer. This function always returns a nul-terminated string
even if it needs to be truncated.
While at it, fix all other broken string generation code that wrongly
interprets snprintf()'s return code or just uses sprintf(), implement
that using scnprintf() here too. Drop breaks in loops around
scnprintf() as it is safe now to loop.
Moreover, the check is not needed: for the case when the buffer is
exhausted, len never gets zero because scnprintf() takes the full buffer
length as input parameter, but excludes the trailing '\0' in its return
code and thus, 1 is the minimum len.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy
[2] https://github.com/KSPP/linux/issues/88
[ rric: Replace snprintf() with scnprintf(), rework sprintf() user,
drop breaks in loops around scnprintf(), introduce 'end' pointer to
reduce pointer arithmetic, use prefix pattern for e->location,
adjust subject and description ]
Co-developed-by: Joe Perches <joe@perches.com>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Len Baker <len.baker@gmx.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210903150539.7282-1-len.baker@gmx.com
Core changes:
- The usual set of small fixes and improvements all over the place, but nothing
outstanding
MSI changes:
- Further consolidation of the PCI/MSI interrupt chip code
- Make MSI sysfs code independent of PCI/MSI and expose the MSI interrupts
of platform devices in the same way as PCI exposes them.
Driver changes:
- Support for ARM GICv3 EPPI partitions
- Treewide conversion to generic_handle_domain_irq() for all chained
interrupt controllers
- Conversion to bitmap_zalloc() throughout the irq chip drivers
- The usual set of small fixes and improvements
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsnpsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoS+/EACQdpRkzl3IDIYqThxVZ8KQzp2rKKVn
qisAQiWg/6koNJx/yYy62KNAUyKjCIObNtRnWi7OAOx6OvNtQTD2WOLAwkh3Pgw1
8ePYYl55k+yCs8VoITsZM9jYeO+Tk878pU2A6R943zR+g6G7bskGJrxEyZ9TbzIe
qKfusNKnRY9/jMQaRALUAAtA9VIVR867GqORX5X8hKz8yE2rqlpb4y+1CFba5BTV
Vlxw7cIXvXBn7BKAom5diRqEGDNJEbX+56jJ7yDZshgLo7m11D7QLw72kmb6TNVC
g7PchvFi4afpc1ifEAAp0tk4RiSIAQ91nS3n0+jLcLbodOjIkl14eY02ZCJGAP29
uslyzUbmy1wgejG6CA63JtZ4MYdrf/OSMGuoN78qnOKYcIsWFzOvlJmBWWNW34qW
LCaUF9QdJ/slXu6B4vIx30GfN9q4myml8bFUobE5q9mBRrEk4R0B7iyBvPu1xKYr
ZEan67prI5VEu+afJGpp4r294m4HNVkMLfl3nYmE5+y4MoLeMNKDY3IPTvI9iP4G
kaFgoPvQo23WnuclNYpJ+CaA4aRASlB2nTY+oAXIYfehbey9EW5vq4/EK864ek6w
oyUTepxxNhE81tG2jpQbf2tR4COsEHy986clxqPP4AvsZXcbypCw8O2FcflpQbHO
5DLEAfTmp7cziQ==
=qyll
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates to the interrupt core and driver subsystems:
Core changes:
- The usual set of small fixes and improvements all over the place,
but nothing stands out
MSI changes:
- Further consolidation of the PCI/MSI interrupt chip code
- Make MSI sysfs code independent of PCI/MSI and expose the MSI
interrupts of platform devices in the same way as PCI exposes them.
Driver changes:
- Support for ARM GICv3 EPPI partitions
- Treewide conversion to generic_handle_domain_irq() for all chained
interrupt controllers
- Conversion to bitmap_zalloc() throughout the irq chip drivers
- The usual set of small fixes and improvements"
* tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
platform-msi: Add ABI to show msi_irqs of platform devices
genirq/msi: Move MSI sysfs handling from PCI to MSI core
genirq/cpuhotplug: Demote debug printk to KERN_DEBUG
irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy
irqdomain: Export irq_domain_disconnect_hierarchy()
irqchip/gic-v3: Fix priority comparison when non-secure priorities are used
irqchip/apple-aic: Fix irq_disable from within irq handlers
pinctrl/rockchip: drop the gpio related codes
gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type
gpio/rockchip: support next version gpio controller
gpio/rockchip: use struct rockchip_gpio_regs for gpio controller
gpio/rockchip: add driver for rockchip gpio
dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank
pinctrl/rockchip: add pinctrl device to gpio bank struct
pinctrl/rockchip: separate struct rockchip_pin_bank to a head file
pinctrl/rockchip: always enable clock for gpio controller
genirq: Fix kernel doc indentation
EDAC/altera: Convert to generic_handle_domain_irq()
powerpc: Bulk conversion to generic_handle_domain_irq()
nios2: Bulk conversion to generic_handle_domain_irq()
...
to the Intel SKx drivers
- Print additional useful per-channel error information on i10nm, like on SKL
- Check whether the AMD decoder is loaded on a guest and if so, don't
- The usual round of fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEso2QACgkQEsHwGGHe
VUqZRRAAqraSdImfO4a7dV4lHErI43GDx9WZIKhrfNyvEv32pDc7E6dwUqJ+NHa1
ON469MjmV4N3X5Deq+Ecaij6giieBKazfqbbZA9pSbNtX4+ta+TEDTCmO7IX38BC
pESwHlWinJbpy8SfopXQh9wgx0pWOHfPysStHS0pmG5mH+o9mIDqsn/20Q0o/kIY
Cn+PZ1mJ2f7iJXvjQv7XsGMM/HotLFkJ5RNUoVjZqteOOdTcHRWytuL/r7QY0SpX
iWAcmkkW3gcW1qypvo/n53ghxvSydZpqWJhI25QBJ79CGOCVocYfitkIdIYAYOtt
xH82GFuC5EtDBTsQXQa2gzbraTwnxPCr6BXKY819a7+xsw/WX2x7MctZc6vCH4R+
UoDRQoUAqgNG2D89lgtQHspy3h3gSSWbGRysLdFCohJlDv3mCU4hk7yyvoCtatqR
ki725nkJn0XT1HZAQc+o3p6E+PCmVYd2Km8TtNJ2ZS8z+97rjsZQAV5eeZKLnUC7
lVX57lZ9MciSlrBNe1k9b1er1Ir3/rLs1X33K+yqE8IwwCX1hEw1yPNEf6jSXQwK
qce95INCamyl1Z8y0uhFZfgl8aEN5ZbaL/lBYz1Gk5R7IQCSVd9Amq5fqxOUdPb/
PcJk24Ng+h0oPKiBBVHrsDhFiDnZKAv7JGqEPFNmOi3dWZPrNrI=
=JYRz
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
"The usual EDAC stuff which managed to trickle in for 5.15:
- Add new HBM2 (High Bandwidth Memory Gen 2) type and add support for
it to the Intel SKx drivers
- Print additional useful per-channel error information on i10nm,
like on SKL
- Don't load the AMD EDAC decoder in virtual images
- The usual round of fixes and cleanups"
* tag 'edac_updates_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/i10nm: Retrieve and print retry_rd_err_log registers
EDAC/i10nm: Fix NVDIMM detection
EDAC/skx_common: Set the memory type correctly for HBM memory
EDAC/altera: Skip defining unused structures for specific configs
EDAC/mce_amd: Do not load edac_mce_amd module on guests
EDAC/mc: Add new HBM2 memory type
EDAC/amd64: Use DEVICE_ATTR helper macros
Retrieve and print retry_rd_err_log registers like the earlier change:
commit e80634a75a ("EDAC, skx: Retrieve and print retry_rd_err_log registers")
This is a little trickier than on Skylake because of potential
interference with BIOS use of the same registers. The default
behavior is to ignore these registers.
A module parameter retry_rd_err_log(default=0) controls the mode of operation:
- 0=off : Default.
- 1=bios : Linux doesn't reset any control bits, but just reports values.
This is "no harm" mode, but it may miss reporting some data.
- 2=linux: Linux tries to take control and resets mode bits,
clears valid/UC bits after reading. This should be
more reliable (especially if BIOS interference is reduced
by disabling eMCA reporting mode in BIOS setup).
Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210818175701.1611513-3-tony.luck@intel.com
MCDDRCFG is a per-channel register and uses bit{0,1} to indicate
the NVDIMM presence on DIMM slot{0,1}. Current i10nm_edac driver
wrongly uses MCDDRCFG as per-DIMM register and fails to detect
the NVDIMM.
Fix it by reading MCDDRCFG as per-channel register and using its
bit{0,1} to check whether the NVDIMM is populated on DIMM slot{0,1}.
Fixes: d4dc89d069 ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Reported-by: Fan Du <fan.du@intel.com>
Tested-by: Wen Jin <wen.jin@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210818175701.1611513-2-tony.luck@intel.com
The Altera EDAC driver has several features conditionally built
depending on Kconfig options. The edac_device_prv_data structures
are conditionally used in of_device_id tables. They reference other
functions and structures which can be defined as __maybe_unused.
Silence build warnings like:
drivers/edac/altera_edac.c:643:37: warning:
‘altr_edac_device_inject_fops’ defined but not used [-Wunused-const-variable=]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Link: https://lkml.kernel.org/r/20210601092704.203555-1-krzysztof.kozlowski@canonical.com
Hypervisors likely do not expose the SMCA feature to the guest and
loading this module leads to false warnings. This module should not be
loaded in guests to begin with, but people tend to do so, especially
when testing kernels in VMs. And then they complain about those false
warnings.
Do the practical thing and do not load this module when running as a
guest to avoid all that complaining.
[ bp: Rewrite commit message. ]
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Kim Phillips <kim.phillips@amd.com>
Link: https://lkml.kernel.org/r/20210628172740.245689-1-Smita.KoralahalliChannabasappa@amd.com
Add a new entry to 'enum mem_type' and a new string to 'edac_mem_types[]'
for HBM2 (High Bandwidth Memory Gen 2) new memory type.
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210630152828.162659-4-nchatrad@amd.com
Instead of "open coding" DEVICE_ATTR, use the corresponding
helper macros DEVICE_ATTR_{RW,RO,WO} in amd64_edac.c
Some function names needed to be changed to match the device
conventions <foo>_show and <foo>_store, but the functionality
itself is unchanged.
The devices using EDAC_DCT_ATTR_SHOW() are left unchanged.
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210713065130.2151-1-dwaipayanray1@gmail.com
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Clean up error messages from thunderx_edac
- Add MODULE_DEVICE_TABLE to ti_edac so it will autoload
- Use %pR to print resources in aspeed_edac
- Add Yazen Ghannam as MAINTAINER for AMD edac drivers
- Fix Ice Lake and Sapphire Rapids drivers to report correct
"near" or "far" device for errors in 2LM configurations
- Add support of on package high bandwidth memory in Sapphire Rapids
- New CPU support for three CPUs supporting in-band ECC (IOT SKUs for
ICL-NNPI, Tiger Lake and Alder Lake)
- Don't even try to load Intel EDAC drivers when running as a guest
- Fix Kconfig dependency on X86_MCE_INTEL for EDAC_IGEN6
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQW3WBGcnu5yJnSXn0kTJLX0iGMLAUCYNujCRQcdG9ueS5sdWNr
QGludGVsLmNvbQAKCRAkTJLX0iGMLCazAPsHoGc9ymStw0hL06Lw71Va/3VyqiUZ
ha1OTWXCcV42UgD/QlwhKkHC3UkEI1dTEAI6McXcC88s8+yXQ8fKAPeQyQw=
=iTWK
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Tony Luck:
"Various fixes and support for new CPUs:
- Clean up error messages from thunderx_edac
- Add MODULE_DEVICE_TABLE to ti_edac so it will autoload
- Use %pR to print resources in aspeed_edac
- Add Yazen Ghannam as MAINTAINER for AMD edac drivers
- Fix Ice Lake and Sapphire Rapids drivers to report correct "near"
or "far" device for errors in 2LM configurations
- Add support of on package high bandwidth memory in Sapphire Rapids
- New CPU support for three CPUs supporting in-band ECC (IOT SKUs for
ICL-NNPI, Tiger Lake and Alder Lake)
- Don't even try to load Intel EDAC drivers when running as a guest
- Fix Kconfig dependency on X86_MCE_INTEL for EDAC_IGEN6"
* tag 'edac_updates_for_v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/igen6: fix core dependency
EDAC/Intel: Do not load EDAC driver when running as a guest
EDAC/igen6: Add Intel Alder Lake SoC support
EDAC/igen6: Add Intel Tiger Lake SoC support
EDAC/igen6: Add Intel ICL-NNPI SoC support
EDAC/i10nm: Add support for high bandwidth memory
EDAC/i10nm: Add detection of memory levels for ICX/SPR servers
EDAC/skx_common: Add new ADXL components for 2-level memory
MAINTAINERS: Make Yazen Ghannam maintainer for EDAC-AMD64
EDAC/aspeed: Use proper format string for printing resource
EDAC/ti: Add missing MODULE_DEVICE_TABLE
EDAC/thunderx: Remove irrelevant variable from error messages
igen6_edac needs mce_register()/unregister() functions,
so it should depend on X86_MCE (or X86_MCE_INTEL).
That change prevents these build errors:
ld: drivers/edac/igen6_edac.o: in function `igen6_remove':
igen6_edac.c:(.text+0x494): undefined reference to `mce_unregister_decode_chain'
ld: drivers/edac/igen6_edac.o: in function `igen6_probe':
igen6_edac.c:(.text+0xf5b): undefined reference to `mce_register_decode_chain'
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210619160203.2026-1-rdunlap@infradead.org
There's little to no point in loading an EDAC driver running in a guest:
1) The CPU model reported by CPUID may not represent actual h/w
2) The hypervisor likely does not pass in access to memory controller devices
3) Hypervisors generally do not pass corrected error details to guests
Add a check in each of the Intel EDAC drivers for X86_FEATURE_HYPERVISOR
and simply return -ENODEV in the init routine.
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210615174419.GA1087688@agluck-desk2.amr.corp.intel.com
Alder Lake SoC shares the same memory controller and In-Band ECC
(IBECC) IP with Tiger Lake SoC. Like Tiger Lake, it also has two
memory controllers each associated one IBECC instance. The minor
differences include the MMIO offset of each memory controller and
the type of memory error address logged in the IBECC.
So add Alder Lake compute die IDs, adjust the MMIO offset for each
memory controller and handle the type of memory error address logged
in the IBECC for Alder Lake EDAC support.
Tested-by: Vrukesh V Panse <vrukesh.v.panse@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-7-tony.luck@intel.com
Tiger Lake SoC shares the same memory controller and In-Band ECC
(IBECC) IP with Elkhart Lake SoC. The main differences are that Tiger
Lake has two memory controllers each associated with one IBECC and
uses Machine Check for the memory error notification.
So add Tiger Lake compute die IDs, MCE decoding chain registration,
and memory slice decoding for Tiger Lake EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-6-tony.luck@intel.com
The Ice Lake Neural Network Processor for Deep Learning Inference
(ICL-NNPI) SoC shares the same memory controller and In-Band ECC with
Elkhart Lake SoC. Add the ICL-NNPI compute die IDs for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-5-tony.luck@intel.com
A future Xeon processor will include in-package HBM (high bandwidth
memory). The in-package HBM memory controller shares the same
architecture with the regular DDR memory controller.
Add the HBM memory controller devices for EDAC support.
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-4-tony.luck@intel.com
Current i10nm_edac driver is only for system configured in 1-level
memory. If the system is configured in 2-level memory, the driver
doesn't report the 1st level memory DIMM for the error address, even
if the error occurs in the 1st level memory.
Both Ice Lake servers and Sapphire Rapids servers can be configured
in 2-level memory. Add detection of memory levels to i10nm_edac for
the two kinds of servers so that the driver can report the 2nd level
memory DIMM or the 1st level memory DIMM according to error source.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-3-tony.luck@intel.com
Some Intel servers may configure memory in 2 levels, using
fast "near" memory (e.g. DDR) as a cache for larger, slower,
"far" memory (e.g. 3D X-point).
In these configurations the BIOS ADXL address translation for
an address in a 2-level memory range will provide details of
both the "near" and far components.
Current exported ADXL components are only for 1-level memory
system or for 2nd level memory of 2-level memory system. So
add new ADXL components for 1st level memory of 2-level memory
system to fully support 2-level memory system and the detection
of memory error source(1st level memory or 2nd level memory).
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-2-tony.luck@intel.com
There is an uppercase letter I in one of the MCE error descriptions
instead of a lowercase one. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20210603103349.79117-1-colin.king@canonical.com
Add the (HWID, MCATYPE) tuples and names for new SMCA bank types.
Also, add their respective error descriptions to the MCE decoding module
edac_mce_amd. Also while at it, optimize the string names for some SMCA
banks.
[ bp: Drop repeated comments, explain why UMC_V2 is a separate entry. ]
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20210526164601.66228-1-nchatrad@amd.com
On ARMv7, resource_size_t can be 64-bit, which breaks printing
it as %x:
drivers/edac/aspeed_edac.c: In function 'init_csrows':
drivers/edac/aspeed_edac.c:257:28: error: format '%x' expects argument of \
type 'unsigned int', but argument 4 has type 'resource_size_t' {aka 'long \
long unsigned int'} [-Werror=format=]
257 | dev_dbg(mci->pdev, "dt: /memory node resources: first page \
r.start=0x%x, resource_size=0x%x, PAGE_SHIFT macro=0x%x\n",
Use the special %pR format string to pretty-print the entire resource
instead.
Fixes: edfc2d73ca ("EDAC/aspeed: Add support for AST2400 and AST2600")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andrew Jeffery <andrew@aj.id.au>
Link: https://lkml.kernel.org/r/20210421135500.3518661-1-arnd@kernel.org
The module misses MODULE_DEVICE_TABLE() for of_device_id tables and thus
never autoloads on ID matches.
Add the missing declaration.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Tero Kristo <kristo@kernel.org>
Link: https://lkml.kernel.org/r/20210512033727.26701-1-cuibixuan@huawei.com
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
Simplify 32-bit and 64-bit Intel SoCFPGA Kconfig options by having only
one for both of them. This the common practice for other platforms.
Additionally, the ARCH_SOCFPGA is too generic as SoCFPGA designs come
from multiple vendors.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
amd64_edac was converted to CPU family autoprobing (from PCI device
IDs) to not have to add a new PCI device ID each time a new platform is
shipped but to support the whole family out-of-the-box.
However, this caused a lot of noise in dmesg even when the machine
doesn't have ECC DIMMs or ECC has been disabled in the BIOS:
EDAC MC: Ver: 3.0.0
EDAC amd64: F17h detected (node 0).
EDAC amd64: Node 0: DRAM ECC disabled.
EDAC amd64: F17h detected (node 1).
EDAC amd64: Node 1: DRAM ECC disabled.
EDAC amd64: F17h detected (node 2).
EDAC amd64: Node 2: DRAM ECC disabled.
EDAC amd64: F17h detected (node 3).
EDAC amd64: Node 3: DRAM ECC disabled.
EDAC amd64: F17h detected (node 4).
EDAC amd64: Node 4: DRAM ECC disabled.
EDAC amd64: F17h detected (node 5).
EDAC amd64: Node 5: DRAM ECC disabled.
EDAC amd64: F17h detected (node 6).
EDAC amd64: Node 6: DRAM ECC disabled.
EDAC amd64: F17h detected (node 7).
EDAC amd64: Node 7: DRAM ECC disabled.
or even
$ grep EDAC dmesg.log | sed 's/\[.*\] //' | sort | uniq -c
128 EDAC amd64: F17h detected (node 0).
128 EDAC amd64: Node 0: DRAM ECC disabled.
1 EDAC MC: Ver: 3.0.0
on a big machine. Yap, that's once per CPU for 128 of them.
So move the init messages after all probing has succeeded to avoid
unnecessary spew in dmesg.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210119164141.17417-1-bp@alien8.de
Coccinelle reports a redundant error print in xgene_edac_probe() because
platform_get_irq() will already print an error message when it is unable
to get an IRQ.
Use platform_get_irq_optional() instead which avoids the error message
and keep the driver-specific one.
[ bp: Sanitize commit message. ]
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Robert Richter <rric@kernel.org>
Link: https://lkml.kernel.org/r/20210112103540.7818-1-dong.menglong@zte.com.cn
Families up to and including 0x16 allow access to the injection
hardware. Starting with family 0x17, access to those registers is
blocked by security policy.
Limit that only on the families which support it.
Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201222180013.GD13463@zn.tnic
Merge them into the main driver and put them inside an EDAC_DEBUG
ifdeffery to simplify the driver and have all debugging/injection stuff
behind a debug build-time switch.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20201215110517.5215-2-bp@alien8.de
There's no need for them to be in a separate file so merge them into the
main driver compilation unit like the other EDAC drivers do.
Drop now-unneeded function export, make the function static and shorten
static function names.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20201215110517.5215-1-bp@alien8.de
Give these messages a debug severity as they are really only useful to
the module developers.
Also, drop the "(broken BIOS?)" phrase, since this can cause churn for
BIOS folks. The PCI IDs needed by the module, at least on modern systems,
are fixed in hardware.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201215170131.8496-1-Yazen.Ghannam@amd.com
Those were only laptops and are very very unlikely to have ECC memory.
Currently, when the driver attempts to load, it issues:
EDAC amd64: Error: F1 not found: device 0x1601 (broken BIOS?)
because the PCI device is the wrong one (it uses the F15h default one).
So do not load the driver on them as that is pointless.
Reported-by: Don Curtis <bugrprt21882@online.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Don Curtis <bugrprt21882@online.de>
Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1179763
Link: https://lkml.kernel.org/r/20201218160622.20146-1-bp@alien8.de
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...