The current all in one code is unreadable and really not suited for
adding future features like uniform loading with package or system
scope.
Provide a set of new control functions which split the handling of the
primary and secondary CPUs. These will replace the current rendezvous
all in one function in the next step. This is intentionally a separate
change because diff makes an complete unreadable mess otherwise.
So the flow separates the primary and the secondary CPUs into their own
functions which use the control field in the per CPU ucode_ctrl struct.
primary() secondary()
wait_for_all() wait_for_all()
apply_ucode() wait_for_release()
release() apply_ucode()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.377922731@linutronix.de
Add a per CPU control field to ucode_ctrl and define constants for it
which are going to be used to control the loading state machine.
In theory this could be a global control field, but a global control does
not cover the following case:
15 primary CPUs load microcode successfully
1 primary CPU fails and returns with an error code
With global control the sibling of the failed CPU would either try again or
the whole operation would be aborted with the consequence that the 15
siblings do not invoke the apply path and end up with inconsistent software
state. The result in dmesg would be inconsistent too.
There are two additional fields added and initialized:
ctrl_cpu and secondaries. ctrl_cpu is the CPU number of the primary thread
for now, but with the upcoming uniform loading at package or system scope
this will be one CPU per package or just one CPU. Secondaries hands the
control CPU a CPU mask which will be required to release the secondary CPUs
out of the wait loop.
Preparatory change for implementing a properly split control flow for
primary and secondary CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.319959519@linutronix.de
The microcode rendezvous is purely acting on global state, which does
not allow to analyze fails in a coherent way.
Introduce per CPU state where the results are written into, which allows to
analyze the return codes of the individual CPUs.
Initialize the state when walking the cpu_present_mask in the online
check to avoid another for_each_cpu() loop.
Enhance the result print out with that.
The structure is intentionally named ucode_ctrl as it will gain control
fields in subsequent changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.632681010@linutronix.de
The code is too complicated for no reason:
- The return value is pointless as this is a strict boolean.
- It's way simpler to count down from num_online_cpus() and check for
zero.
- The timeout argument is pointless as this is always one second.
- Touching the NMI watchdog every 100ns does not make any sense, neither
does checking every 100ns. This is really not a hotpath operation.
Preload the atomic counter with the number of online CPUs and simplify the
whole timeout logic. Delay for one microsecond and touch the NMI watchdog
once per millisecond.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.204251527@linutronix.de
reload_store() is way too complicated. Split the inner workings out and
make the following enhancements:
- Taint the kernel only when the microcode was actually updated. If. e.g.
the rendezvous fails, then nothing happened and there is no reason for
tainting.
- Return useful error codes
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20231002115903.145048840@linutronix.de
On CPUs where microcode loading is not NMI-safe the SMT siblings which
are parked in one of the play_dead() variants still react to NMIs.
So if an NMI hits while the primary thread updates the microcode the
resulting behaviour is undefined. The default play_dead() implementation on
modern CPUs is using MWAIT which is not guaranteed to be safe against
a microcode update which affects MWAIT.
Take the cpus_booted_once_mask into account to detect this case and
refuse to load late if the vendor specific driver does not advertise
that late loading is NMI safe.
AMD stated that this is safe, so mark the AMD driver accordingly.
This requirement will be partially lifted in later changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.087472735@linutronix.de
This function has nothing to do with suspend. It's a hotplug
callback. Remove the bogus comment.
Drop the pointless debug printk. The hotplug core provides tracepoints
which track the invocation of those callbacks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.028651784@linutronix.de
Scheduling work on all CPUs to collect the microcode information is just
another extra step for no value. Let the CPU hotplug callback registration
do it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.354748138@linutronix.de
Get rid of the initrd_gone hack which was required to keep
find_microcode_in_initrd() functional after init.
As find_microcode_in_initrd() is now only used during init, mark it
accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.298854846@linutronix.de
Now that the microcode cache is initialized before the APs are brought
up, there is no point in scanning builtin/initrd microcode during AP
loading.
Convert the AP loader to utilize the cache, which in turn makes the CPU
hotplug callback which applies the microcode after initrd/builtin is
gone, obsolete as the early loading during late hotplug operations
including the resume path depends now only on the cache.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.243426023@linutronix.de
There is no reason to scan builtin/initrd microcode on each AP.
Cache the builtin/initrd microcode in an early initcall so that the
early AP loader can utilize the cache.
The existing fs initcall which invoked save_microcode_in_initrd_amd() is
still required to maintain the initrd_gone flag. Rename it accordingly.
This will be removed once the AP loader code is converted to use the
cache.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.187566507@linutronix.de
save_microcode_in_initrd_amd() fails to cache builtin microcode and only
scans initrd.
Use find_blobs_in_containers() instead which covers both.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231010150702.495139089@linutronix.de
find_blobs_in_containers() is invoked on every CPU but overwrites
unconditionally ucode_cpu_info of CPU0.
Fix this by using the proper CPU data and move the assignment into the
call site apply_ucode_from_containers() so that the function can be
reused.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231010150702.433454320@linutronix.de
Microcode is applied on the APs during early bringup. There is no point
in trying to apply the microcode again during the hotplug operations and
neither at the point where the microcode device is initialized.
Collect CPU info and microcode revision in setup_online_cpu() for now.
This will move to the CPU hotplug callback later.
[ bp: Leave the starting notifier for the following scenario:
- boot, late load, suspend to disk, resume
without the starting notifier, only the last core manages to update the
microcode upon resume:
# rdmsr -a 0x8b
10000bf
10000bf
10000bf
10000bf
10000bf
10000dc <----
This is on an AMD F10h machine.
For the future, one should check whether potential unification of
the CPU init path could cover the resume path too so that this can
be simplified even more.
tglx: This is caused by the odd handling of APs which try to find the
microcode blob in builtin or initrd instead of caching the microcode
blob during early init before the APs are brought up. Will be cleaned
up in a later step. ]
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231017211723.018821624@linutronix.de
Take a cpu_signature argument and work from there. Move the match()
helper next to the callsite as there is no point for having it in
a header.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.797820205@linutronix.de
Nothing needs struct ucode_cpu_info. Make it take struct cpu_signature,
let it return a boolean and simplify the implementation. Rename it now
that the silly name clash with collect_cpu_info() is gone.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.851573238@linutronix.de
Deduplicate the early and late apply() functions.
[ bp: Rename the function which does the actual application to
__apply_microcode() to differentiate it from
microcode_ops.apply_microcode(). ]
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231017211722.795508212@linutronix.de
There are situations where the late microcode is loaded into memory but
is not applied:
1) The rendezvous fails
2) The microcode is rejected by the CPUs
If any of this happens then the pointer which was updated at firmware
load time is stale and subsequent CPU hotplug operations either fail to
update or create inconsistent microcode state.
Save the loaded microcode in a separate pointer before the late load is
attempted and when successful, update the hotplug pointer accordingly
via a new microcode_ops callback.
Remove the pointless fallback in the loader to a microcode pointer which
is never populated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.505491309@linutronix.de
The early loading code is overly complicated:
- It scans the builtin/initrd for microcode not only on the BSP, but also
on all APs during early boot and then later in the boot process it
scans again to duplicate and save the microcode before initrd goes
away.
That's a pointless exercise because this can be simply done before
bringing up the APs when the memory allocator is up and running.
- Saving the microcode from within the scan loop is completely
non-obvious and a left over of the microcode cache.
This can be done at the call site now which makes it obvious.
Rework the code so that only the BSP scans the builtin/initrd microcode
once during early boot and save it away in an early initcall for later
use.
[ bp: Test and fold in a fix from tglx ontop which handles the need to
distinguish what save_microcode() does depending on when it is
called:
- when on the BSP during early load, it needs to find a newer
revision than the one currently loaded on the BSP
- later, before SMP init, it still runs on the BSP and gets the BSP
revision just loaded and uses that revision to know which patch
to save for the APs. For that it needs to find the exact one as
on the BSP.
]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.629085215@linutronix.de
Sanitize the microcode scan loop, fixup printks and move the loading
function for builtin microcode next to the place where it is used and mark
it __init.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.389400871@linutronix.de
so it becomes less obfuscated and rename it because there is nothing
generic about it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.330295409@linutronix.de
Mixed steppings aren't supported on Intel CPUs. Only one microcode patch
is required for the entire system. The caching of microcode blobs which
match the family and model is therefore pointless and in fact is
dysfunctional as CPU hotplug updates use only a single microcode blob,
i.e. the one where *intel_ucode_patch points to.
Remove the microcode cache and make it an AMD local feature.
[ tglx:
- save only at the end. Otherwise random microcode ends up in the
pointer for early loading
- free the ucode patch pointer in save_microcode_patch() only
after kmemdup() has succeeded, as reported by Andrew Cooper ]
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.404362809@linutronix.de
32-bit loads microcode before paging is enabled. The commit which
introduced that has zero justification in the changelog. The cover
letter has slightly more content, but it does not give any technical
justification either:
"The problem in current microcode loading method is that we load a
microcode way, way too late; ideally we should load it before turning
paging on. This may only be practical on 32 bits since we can't get
to 64-bit mode without paging on, but we should still do it as early
as at all possible."
Handwaving word salad with zero technical content.
Someone claimed in an offlist conversation that this is required for
curing the ATOM erratum AAE44/AAF40/AAG38/AAH41. That erratum requires
an microcode update in order to make the usage of PSE safe. But during
early boot, PSE is completely irrelevant and it is evaluated way later.
Neither is it relevant for the AP on single core HT enabled CPUs as the
microcode loading on the AP is not doing anything.
On dual core CPUs there is a theoretical problem if a split of an
executable large page between enabling paging including PSE and loading
the microcode happens. But that's only theoretical, it's practically
irrelevant because the affected dual core CPUs are 64bit enabled and
therefore have paging and PSE enabled before loading the microcode on
the second core. So why would it work on 64-bit but not on 32-bit?
The erratum:
"AAG38 Code Fetch May Occur to Incorrect Address After a Large Page is
Split Into 4-Kbyte Pages
Problem: If software clears the PS (page size) bit in a present PDE
(page directory entry), that will cause linear addresses mapped through
this PDE to use 4-KByte pages instead of using a large page after old
TLB entries are invalidated. Due to this erratum, if a code fetch uses
this PDE before the TLB entry for the large page is invalidated then it
may fetch from a different physical address than specified by either the
old large page translation or the new 4-KByte page translation. This
erratum may also cause speculative code fetches from incorrect addresses."
The practical relevance for this is exactly zero because there is no
splitting of large text pages during early boot-time, i.e. between paging
enable and microcode loading, and neither during CPU hotplug.
IOW, this load microcode before paging enable is yet another voodoo
programming solution in search of a problem. What's worse is that it causes
at least two serious problems:
1) When stackprotector is enabled, the microcode loader code has the
stackprotector mechanics enabled. The read from the per CPU variable
__stack_chk_guard is always accessing the virtual address either
directly on UP or via %fs on SMP. In physical address mode this
results in an access to memory above 3GB. So this works by chance as
the hardware returns the same value when there is no RAM at this
physical address. When there is RAM populated above 3G then the read
is by chance the same as nothing changes that memory during the very
early boot stage. That's not necessarily true during runtime CPU
hotplug.
2) When function tracing is enabled, the relevant microcode loader
functions and the functions invoked from there will call into the
tracing code and evaluate global and per CPU variables in physical
address mode. What could potentially go wrong?
Cure this and move the microcode loading after the early paging enable, use
the new temporary initrd mapping and remove the gunk in the microcode
loader which is required to handle physical address mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.348298216@linutronix.de
Early microcode loading on 32-bit runs in physical address mode because
the initrd is not covered by the initial page tables. That results in
a horrible mess all over the microcode loader code.
Provide a temporary mapping for the initrd in the initial page tables by
appending it to the actual initial mapping starting with a new PGD or
PMD depending on the configured page table levels ([non-]PAE).
The page table entries are located after _brk_end so they are not
permanently using memory space. The mapping is invalidated right away in
i386_start_kernel() after the early microcode loader has run.
This prepares for removing the physical address mode oddities from all
over the microcode loader code, which in turn allows further cleanups.
Provide the map and unmap code and document the place where the
microcode loader needs to be invoked with a comment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.292291436@linutronix.de
Create an aggregate config switch which covers X86_32, MICROCODE and
BLK_DEV_INITRD to avoid lengthy #ifdeffery in upcoming code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.236208250@linutronix.de
Prepare it for adding a temporary initrd mapping by splitting out the
actual map loop.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.175910753@linutronix.de
Move the ifdeffery out of the function and use proper typedefs to make it
work for both 2 and 3 level paging.
No functional change.
[ bp: Move mk_early_pgtbl_32() declaration into a header. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.111059491@linutronix.de
Use the existing macro instead of undefining and redefining __pa().
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.051625827@linutronix.de
Stackprotector cannot work before paging is enabled. The read from the per
CPU variable __stack_chk_guard is always accessing the virtual address
either directly on UP or via FS on SMP. In physical address mode this
results in an access to memory above 3GB.
So this works by chance as the hardware returns the same value when there
is no RAM at this physical address. When there is RAM populated above 3G
then the read is by chance the same as nothing changes that memory during
the very early boot stage.
Stop relying on pure luck and disable the stack protector for the only C
function which is called during early boot before paging is enabled.
Remove function tracing from the whole source file as there is no way to
trace this at all, but in case of CONFIG_DYNAMIC_FTRACE=n
mk_early_pgtbl_32() would access global function tracer variables in
physical address mode which again might work by chance.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.156063939@linutronix.de
Building with GCC 11.x results in the following warning:
arch/x86/kernel/cpu/microcode/amd.c: In function ‘find_blobs_in_containers’:
arch/x86/kernel/cpu/microcode/amd.c:504:58: error: ‘h.bin’ directive output may be truncated writing 5 bytes into a region of size between 1 and 7 [-Werror=format-truncation=]
arch/x86/kernel/cpu/microcode/amd.c:503:17: note: ‘snprintf’ output between 35 and 41 bytes into a destination of size 36
The issue is that GCC does not know that the family can only be a byte
(it ultimately comes from CPUID). Suggest the right size to the compiler
by marking the argument as char-size ("hh"). While at it, instead of
using the slightly more obscure precision specifier use the width with
zero padding (over 23000 occurrences in kernel sources, vs 500 for
the idiom using the precision).
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Closes: https://lore.kernel.org/oe-kbuild-all/202308252255.2HPJ6x5Q-lkp@intel.com/
Link: https://lore.kernel.org/r/20231016224858.2829248-1-pbonzini@redhat.com
This reverts commit 45e34c8af5, and the
two subsequent fixes to it:
3f874c9b2a ("x86/smp: Don't send INIT to non-present and non-booted CPUs")
b1472a60a5 ("x86/smp: Don't send INIT to boot CPU")
because it seems to result in hung machines at shutdown. Particularly
some Dell machines, but Thomas says
"The rest seems to be Lenovo and Sony with Alderlake/Raptorlake CPUs -
at least that's what I could figure out from the various bug reports.
I don't know which CPUs the DELL machines have, so I can't say it's a
pattern.
I agree with the revert for now"
Ashok Raj chimes in:
"There was a report (probably this same one), and it turns out it was a
bug in the BIOS SMI handler.
The client BIOS's were waiting for the lowest APICID to be the SMI
rendevous master. If this is MeteorLake, the BSP wasn't the one with
the lowest APIC and it triped here.
The BIOS change is also being pushed to others for assimilation :)
Server BIOS's had this correctly for a while now"
and it does look likely to be some bad interaction between SMI and the
non-BSP cores having put into INIT (and thus unresponsive until reset).
Link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
Link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
Link: https://forum.artixlinux.org/index.php/topic,5997.0.html
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
prototypes between architectures.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrobMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ggUw/7BGHe360tsrsMAcOHcwvvGhnQ0UKuoqLs
IJl3dfsdX+JnL7cpbNcRBVDqgH2seIwdQFa73gALColcxntEBbnC/gVS4QLLSxSp
HIq5C1OELT/jPMOjc6aimJx/qPvW/CLgo2WJx78rv0ykkf1RJIzqCTVKf8VQX6Vu
t0/9jEhBNuL8DZthJ5ZV448WtlJcdnWXVGxq/UHEheV219Rjtp3NGf8s/K+WMzF3
x9Zhmb+/UPgjhaZtrQDP2mf7ZYgmVhLvJTRSQdQNrcDe/ZaNrCrEGOwHuOpQ0vXw
v42rd1AVGV/xgIhfBOABLdb5snBbQMDvYLcma04bkBd6H6WPFJZ1PvnGovTagxUO
FP4117VBA5ARwZemxwGEPJkNF9lVEPSBVDv7bx2OO0zVCViuAbKJXxGzCW/GiSGA
BRk5FogxJ7TcjWsYWWaZfYlq8RFI5UI3K/IEQIUpQKtC9OMhScdp532xEP48dc8u
pHjPjVoYCXtoD/ZD73ZJJduN6Hn30HEE/IJ6+YKJRo4EZquUrGtbgG46QbFBqxqW
xIPPTx7OPAaAfq020+c6BuUnta1iEY6I4De/+XbRdQf+AIWqfLsWNI8en8aO4t5p
rtlrYD2Si1F0KBWtDWR7JhCm8CD2klWVWrD9d4DLpz9ljHXKa9d7BYp3BxkvgcQm
x8f1D9yC9X4=
=20Na
-----END PGP SIGNATURE-----
Merge tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug fix from Ingo Molnar:
"Fix a Longsoon build warning by harmonizing the
arch_[un]register_cpu() prototypes between architectures"
* tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu-hotplug: Provide prototypes for arch CPU registration
Fei has reported that KASAN triggers during apply_alternatives() on
a 5-level paging machine:
BUG: KASAN: out-of-bounds in rcu_is_watching()
Read of size 4 at addr ff110003ee6419a0 by task swapper/0/0
...
__asan_load4()
rcu_is_watching()
trace_hardirqs_on()
text_poke_early()
apply_alternatives()
...
On machines with 5-level paging, cpu_feature_enabled(X86_FEATURE_LA57)
gets patched. It includes KASAN code, where KASAN_SHADOW_START depends on
__VIRTUAL_MASK_SHIFT, which is defined with cpu_feature_enabled().
KASAN gets confused when apply_alternatives() patches the
KASAN_SHADOW_START users. A test patch that makes KASAN_SHADOW_START
static, by replacing __VIRTUAL_MASK_SHIFT with 56, works around the issue.
Fix it for real by disabling KASAN while the kernel is patching alternatives.
[ mingo: updated the changelog ]
Fixes: 6657fca06e ("x86/mm: Allow to boot without LA57 if CONFIG_X86_5LEVEL=y")
Reported-by: Fei Yang <fei.yang@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231012100424.1456-1-kirill.shutemov@linux.intel.com
Provide common prototypes for arch_register_cpu() and
arch_unregister_cpu(). These are called by acpi_processor.c, with weak
versions, so the prototype for this is already set. It is generally not
necessary for function prototypes to be conditional on preprocessor macros.
Some architectures (e.g. Loongarch) are missing the prototype for this, and
rather than add it to Loongarch's asm/cpu.h, do the job once for everyone.
Since this covers everyone, remove the now unnecessary prototypes in
asm/cpu.h, and therefore remove the 'static' from one of ia64's
arch_register_cpu() definitions.
[ tglx: Bring back the ia64 part and remove the ACPI prototypes ]
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1qkoRr-0088Q8-Da@rmk-PC.armlinux.org.uk
Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.
Reported-by: René Rebe <rene@exactcode.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: René Rebe <rene@exactcode.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmUk4fcTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXhhqCACWsBYTB0EJ3oMJnzfnHeuN418ZDx/O
AL0k0O5MT6roEFmvGUhzJ/jsoxL+W+Wj3aFwzReyOSQpgjTTF/Ja26LPvxRzDxKi
sZPojnR2ykW31l7y+eh1p9qSM/aYvTMDP5zO7L1fBnWMAGMv8w8RezpCJ7bh4BgA
FTMZZrvKYVT9hCGkYqKUZGBtDTPZ56WE+MCiRxTWQvF+4QKaIff0tpno8V7203bE
D/b4+Ouh19RXFTC5dUq/0JtAdV2AadrPHnScUupc8Hk/MMFiU5CzvH4bAqiwXBcU
YqqlD3kZbIqqbKE93+03jvyrRDvDGlq+rpA3KMk5MBAfrkM4DytpWvMs
=SVq1
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- fixes for Hyper-V VTL code (Saurabh Sengar and Olaf Hering)
- fix hv_kvp_daemon to support keyfile based connection profile
(Shradha Gupta)
* tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv/hv_kvp_daemon:Support for keyfile based connection profile
hyperv: reduce size of ms_hyperv_info
x86/hyperv: Add common print prefix "Hyper-V" in hv_init
x86/hyperv: Remove hv_vtl_early_init initcall
x86/hyperv: Restrict get_vtl to only VTL platforms
We found that a panic can occur when a vsyscall is made while LBR sampling
is active. If the vsyscall is interrupted (NMI) for perf sampling, this
call sequence can occur (most recent at top):
__insn_get_emulate_prefix()
insn_get_emulate_prefix()
insn_get_prefixes()
insn_get_opcode()
decode_branch_type()
get_branch_type()
intel_pmu_lbr_filter()
intel_pmu_handle_irq()
perf_event_nmi_handler()
Within __insn_get_emulate_prefix() at frame 0, a macro is called:
peek_nbyte_next(insn_byte_t, insn, i)
Within this macro, this dereference occurs:
(insn)->next_byte
Inspecting registers at this point, the value of the next_byte field is the
address of the vsyscall made, for example the location of the vsyscall
version of gettimeofday() at 0xffffffffff600000. The access to an address
in the vsyscall region will trigger an oops due to an unhandled page fault.
To fix the bug, filtering for vsyscalls can be done when
determining the branch type. This patch will return
a "none" branch if a kernel address if found to lie in the
vsyscall region.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
The kernel test robot reported kernel-doc warnings here:
monitor.c:34: warning: Cannot understand * @rmid_free_lru A least recently used list of free RMIDs on line 34 - I thought it was a doc line
monitor.c:41: warning: Cannot understand * @rmid_limbo_count count of currently unused but (potentially) on line 41 - I thought it was a doc line
monitor.c:50: warning: Cannot understand * @rmid_entry - The entry in the limbo and free lists. on line 50 - I thought it was a doc line
We don't have a syntax for documenting individual data items via
kernel-doc, so remove the "/**" kernel-doc markers and add a hyphen
for consistency.
Fixes: 6a445edce6 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
Fixes: 24247aeeab ("x86/intel_rdt/cqm: Improve limbo list processing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231006235132.16227-1-rdunlap@infradead.org
In snp_accept_memory(), the npages variables value is calculated from
phys_addr_t variables but is an unsigned int. A very large range passed
into snp_accept_memory() could lead to truncating npages to zero. This
doesn't happen at the moment but let's be prepared.
Fixes: 6c32117963 ("x86/sev: Add SNP-specific unaccepted memory support")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/6d511c25576494f682063c9fb6c705b526a3757e.1687441505.git.thomas.lendacky@amd.com
SNP retrieves the majority of CPUID information from the SNP CPUID page.
But there are times when that information needs to be supplemented by the
hypervisor, for example, obtaining the initial APIC ID of the vCPU from
leaf 1.
The current implementation uses the MSR protocol to retrieve the data from
the hypervisor, even when a GHCB exists. The problem arises when an NMI
arrives on return from the VMGEXIT. The NMI will be immediately serviced
and may generate a #VC requiring communication with the hypervisor.
Since a GHCB exists in this case, it will be used. As part of using the
GHCB, the #VC handler will write the GHCB physical address into the GHCB
MSR and the #VC will be handled.
When the NMI completes, processing resumes at the site of the VMGEXIT
which is expecting to read the GHCB MSR and find a CPUID MSR protocol
response. Since the NMI handling overwrote the GHCB MSR response, the
guest will see an invalid reply from the hypervisor and self-terminate.
Fix this problem by using the GHCB when it is available. Any NMI
received is properly handled because the GHCB contents are copied into
a backup page and restored on NMI exit, thus preserving the active GHCB
request or result.
[ bp: Touchups. ]
Fixes: ee0bfa08a3 ("x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlers")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/a5856fa1ebe3879de91a8f6298b6bbd901c61881.1690578565.git.thomas.lendacky@amd.com
to issues which were introduced after 6.5.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZRmSDAAKCRDdBJ7gKXxA
jlSaAQCe3SnBdjRmuzbp5iIfNJOY7GXLN4NwMsArRUxRGY27IwD+KWhXZP/ydVnt
ZgS4x9rmarHuh5Pxds+6SRGhihRz/Ak=
=sf/5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Fourteen hotfixes, eleven of which are cc:stable. The remainder
pertain to issues which were introduced after 6.5"
* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Crash: add lock to serialize crash hotplug handling
selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
mm, memcg: reconsider kmem.limit_in_bytes deprecation
mm: zswap: fix potential memory corruption on duplicate store
arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
mm: hugetlb: add huge page size param to set_huge_pte_at()
maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
maple_tree: add mas_is_active() to detect in-tree walks
nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
mm: abstract moving to the next PFN
mm: report success more often from filemap_map_folio_range()
fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
AMD-derived Hygon processors, and fix a SGX kernel crash in the
page fault handler that can trigger when ksgxd races to reclaim
the SECS special page, by making the SECS page unswappable.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUZNa4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hFYA//aCXuIxVgMdgwDs7uuaghAX7v3NWlx1Pu
qxioHgpGimOl5Sm28siT29GCnFK3a+DBd7wCNQ9yhRxTIESwAG9wGlD8cfZISvyI
qinPU6Yo0OLEAI//g2IWzr/Hw8QecrjLGGoqhFj8m2vsLANWcTXkeRoFxNwWlobx
OSGQL+SYP5tuAhsrjbsQMHiOAUxXAdAuT62R8nYgCcj6A/VTvjSUF9N3C6G/CtnM
Y7pi8n4VPxDuU8dwhi1HptHzBrAl6GYXiC1A9UgddKaDk710R+Fe4+LkJR4NVrnL
zXv5YY4qoHzVcQ3gznBUOTwDPDWjsGPaTU2Gcya3FRXxZlc1i7p+/kBvxAxQz05H
Z6ixkfnWOhdYPWbr7yau7r0RRR4ZvK22pIoAfxKdTYbbI5lOUqbSof0d9mHAY+An
tHkYqqFAabnPs4ogGT4tK7nHr9pnCFEEfd2JAAPDu78XkVlokvV7e4sm7bjS4D4D
tIHpp3gd04PaEq9I2mEvP57/Sn2fYr0PO3mg6jUppv35k2+hjjkfPilKzqPbcXP6
bD4gdjYXQV367kCfpN2SXUaPtn0UZfqdEol1UVzteVyNOgXHiPzCd/2K7YtW2MM5
wlJ35BDvC7uQr8XxOJBQoAETJQ7TtePnhIjDHOp+WYzn8dmd+r31/EJMIOC0TD/C
nlvkY3/gvYA=
=YK/4
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a kerneldoc build warning fix, add SRSO mitigation for
AMD-derived Hygon processors, and fix a SGX kernel crash in the page
fault handler that can trigger when ksgxd races to reclaim the SECS
special page, by making the SECS page unswappable"
* tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
x86/srso: Add SRSO mitigation for Hygon processors
x86/kgdb: Fix a kerneldoc warning when build with W=1
and fix kexec kernel PMI handlers on AMD systems that get loaded
on older kernels that have an unexpected register state.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUZLo8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gzXw/8C6YUlO05xpocZuBzbQU/BtpAj1P8VwTe
gSPD3vaPeXvEFSDqoNBEXXs54VRpAnHfLOGmr5DrjG9h8yr7MU5/hUfov6a8dLtH
rAjwYOtZapHz/nc/+knbGUcM4d7H8IRYF62A4Vy2uugLlfNQY6YAuEeg/5/ykS5Y
mCUAWgwwygPdN5VOQZNlJluEw6DexyUmM3fg9Gbq+Hx1bycPJtUzub5q3fGQ/FcG
wxUUgL1lKIUoOHXL2D3Xq0fE7A8khXJIjzxqBysSrnMqe7BDS3ur1SoLjDPGaXcP
f2wBpEGdSU9xxfEL3W8gdtmsffihU8ExbNZYFdpTlOf02aB7z03M0Lr4NwsJDiCZ
Qek6IpK0ZHyC67X7UC7iTzAnI/c32jhkoTzHXpI1aOX/dX8TLNy0r/FKxoUcbv16
HDPy4RnuhNQS3X8P1EqcXJ72AEF0Ce5S9tU8MfSKDWSElfXtCtE+hvIilqoUaW9F
RTtHl6cOezCEkx3SryhhZibMaOVrLGFMeF01lpHXckwHQt8q2TMMEaCRNYjpGfwM
zsBS1+E1/XeVsVqadM3/vbu7SyUUimlPZy8gc2gkSqyq8NZcs+OqbYaYF00v4lcN
TtxRv9I3J8oXaZWTA7XUC+00IwMuYVjdv4ZJL/rRWpQaatoAFRLf0rS5/LgY9kSc
xIYuX63A4Wo=
=rzXG
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
"Misc fixes: work around an AMD microcode bug on certain models, and
fix kexec kernel PMI handlers on AMD systems that get loaded on older
kernels that have an unexpected register state"
* tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd: Do not WARN() on every IRQ
perf/x86/amd/core: Fix overflow reset on hotplug
In order to fix the L1TF vulnerability, x86 can invert the PTE bits for
PROT_NONE VMAs, which means we cannot move from one PTE to the next by
adding 1 to the PFN field of the PTE. This results in the BUG reported at
[1].
Abstract advancing the PTE to the next PFN through a pte_next_pfn()
function/macro.
Link: https://lkml.kernel.org/r/20230920040958.866520-1-willy@infradead.org
Fixes: bcc6cc8325 ("mm: add default definition of set_ptes()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: syzbot+55cc72f8cc3a549119df@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d099fa0604f03351@google.com [1]
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an
enclave and set secs.epc_page to NULL. The SECS page is used for EAUG
and ELDU in the SGX page fault handler. However, the NULL check for
secs.epc_page is only done for ELDU, not EAUG before being used.
Fix this by doing the same NULL check and reloading of the SECS page as
needed for both EAUG and ELDU.
The SECS page holds global enclave metadata. It can only be reclaimed
when there are no other enclave pages remaining. At that point,
virtually nothing can be done with the enclave until the SECS page is
paged back in.
An enclave can not run nor generate page faults without a resident SECS
page. But it is still possible for a #PF for a non-SECS page to race
with paging out the SECS page: when the last resident non-SECS page A
triggers a #PF in a non-resident page B, and then page A and the SECS
both are paged out before the #PF on B is handled.
Hitting this bug requires that race triggered with a #PF for EAUG.
Following is a trace when it happens.
BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:sgx_encl_eaug_page+0xc7/0x210
Call Trace:
? __kmem_cache_alloc_node+0x16a/0x440
? xa_load+0x6e/0xa0
sgx_vma_fault+0x119/0x230
__do_fault+0x36/0x140
do_fault+0x12f/0x400
__handle_mm_fault+0x728/0x1110
handle_mm_fault+0x105/0x310
do_user_addr_fault+0x1ee/0x750
? __this_cpu_preempt_check+0x13/0x20
exc_page_fault+0x76/0x180
asm_exc_page_fault+0x27/0x30
Fixes: 5a90d2c3f5 ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.intel.com