generic_processor_info() aside of being a complete misnomer is used for
both early boot registration and ACPI CPU hotplug.
While it's arguable that this can share some code, it results in code which
is hard to understand and kept around post init for no real reason.
Also the call sites do lots of manual fiddling in topology related
variables instead of having proper interfaces for the purpose which handle
the topology internals correctly.
Provide topology_register_apic(), topology_hotplug_apic() and
topology_hotunplug_apic() which have the extra magic of the call sites
incorporated and for now are wrappers around generic_processor_info().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240213210251.605007456@linutronix.de
The APIC/CPU registration sits in the middle of the APIC code. In fact this
is a topology evaluation function and has nothing to do with the inner
workings of the local APIC.
Move it out into a file which reflects what this is about.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240213210251.543948812@linutronix.de
The ACPI ID for CPUs is preset with U32_MAX which is completely non
obvious. Use a proper define for it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154640.177504138@linutronix.de
Paranoia is not wrong, but having an APIC callback which is in most
implementations a complete NOOP and in one actually looking whether the
APICID of an upcoming CPU has been registered. The same APICID which was
used to bring the CPU out of wait for startup.
That's paranoia for the paranoia sake. Remove the voodoo.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154640.116510935@linutronix.de
There is absolutely no point to write the APIC ID which was read from the
local APIC earlier, back into the local APIC for the 64-bit UP case.
Remove that along with the apic callback which is solely there for this
pointless exercise.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154640.055288922@linutronix.de
physid_t is a wrapper around bitmap. Just remove the onion layer and use
bitmap functionality directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.994904510@linutronix.de
There is no reason to have the early mptable evaluation conditionally
invoked only from the AMD numa topology code.
Make it explicit and invoke it from setup_arch() right after the
corresponding ACPI init call. Remove the pointless wrapper and invoke
x86_init::mpparse::early_parse_smp_config() directly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.931761608@linutronix.de
Now that all platforms have the new split SMP configuration callbacks set
up, flip the switch and remove the old callback pointer and mop up the
platform code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.870883080@linutronix.de
Provide a wrapper around the existing function and fill the new callbacks
in.
No functional change as the new callbacks are not yet operational.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.683073662@linutronix.de
x86_dtb_init() is a misnomer and it really should be used as a SMP
configuration parser which is selected by the platform via
x86_init::mpparse:parse_smp_config().
Rename it to x86_dtb_parse_smp_config() in preparation for that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.495992801@linutronix.de
In preparation of splitting the get_smp_config() callback, rename
default_get_smp_config() to mpparse_get_smp_config() and provide an early
and late wrapper.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.433811243@linutronix.de
MPTABLE is no longer the default SMP configuration mechanism. Rename it to
mpparse_find_mptable() because that's what it does.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.306287711@linutronix.de
No need to go through APIC callbacks. It's already established that this is
an ancient APIC. So just copy the present mask and use the direct physid*
functions all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.181901887@linutronix.de
No need to go through APIC callbacks. It's already established that this is
an ancient APIC. So just copy the present mask and use the direct physid*
functions all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.119261725@linutronix.de
There is no point for this function. The only case where this is used is
when there is no XAPIC available, which means the broadcast address is 0xF.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154639.057209154@linutronix.de
Yet another set_bit() operation wrapped in oring a mask.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154638.995080989@linutronix.de
There is no point to do that. The ATOMs have an XAPIC for which this
function is a pointless exercise.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154638.931617775@linutronix.de
Detect all possible combinations of mismatch right in the CPUID evaluation
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20240212154638.867699078@linutronix.de
The package shift has been already evaluated by the early CPU init.
Put the mindless copy right next to the original leaf 0xb parser.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.637385562@linutronix.de
Now that the core code does not use this monstrosity anymore, it's time to
put it to rest.
The only real purpose was to read the APIC ID on UV and VSMP systems for
the actual evaluation. That's what the core code does now.
For doing the actual shift operation there is truly no APIC callback
required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.516536121@linutronix.de
No more users. Stick it into the ugly code museum.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.395230346@linutronix.de
Now that everything is converted switch it over and remove the intermediate
operation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.334185785@linutronix.de
Switch it over to use the consolidated topology evaluation and remove the
temporary safe guards which are not longer needed.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.207750409@linutronix.de
Switch it over to the new topology evaluation mechanism and remove the
random bits and pieces which are sprinkled all over the place.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.145745053@linutronix.de
When switching AMD over to the new topology parser then the match functions
need to look for AMD systems with the extended topology feature at the new
topo.amd_node_id member which is then holding the node id information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.082979150@linutronix.de
AMD/HYGON uses various methods for topology evaluation:
- Leaf 0x80000008 and 0x8000001e based with an optional leaf 0xb,
which is the preferred variant for modern CPUs.
Leaf 0xb will be superseded by leaf 0x80000026 soon, which is just
another variant of the Intel 0x1f leaf for whatever reasons.
- Subleaf 0x80000008 and NODEID_MSR base
- Legacy fallback
That code is following the principle of random bits and pieces all over the
place which results in multiple evaluations and impenetrable code flows in
the same way as the Intel parsing did.
Provide a sane implementation by clearly separating the three variants and
bringing them in the proper preference order in one place.
This provides the parsing for both AMD and HYGON because there is no point
in having a separate HYGON parser which only differs by 3 lines of
code. Any further divergence between AMD and HYGON can be handled in
different functions, while still sharing the existing parsers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153625.020038641@linutronix.de
AMD (ab)uses topology_die_id() to store the Node ID information and
topology_max_dies_per_pkg to store the number of nodes per package.
This collides with the proper processor die level enumeration which is
coming on AMD with CPUID 8000_0026, unless there is a correlation between
the two. There is zero documentation about that.
So provide new storage and new accessors which for now still access die_id
and topology_max_die_per_pkg(). Will be mopped up after AMD and HYGON are
converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.956116738@linutronix.de
Intel CPUs use either topology leaf 0xb/0x1f evaluation or the legacy
SMP/HT evaluation based on CPUID leaf 0x1/0x4.
Move it over to the consolidated topology code and remove the random
topology hacks which are sprinkled into the Intel and the common code.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.893644349@linutronix.de
detect_extended_topology() along with it's early() variant is a classic
example for duct tape engineering:
- It evaluates an array of subleafs with a boatload of local variables
for the relevant topology levels instead of using an array to save the
enumerated information and propagate it to the right level
- It has no boundary checks for subleafs
- It prevents updating the die_id with a crude workaround instead of
checking for leaf 0xb which does not provide die information.
- It's broken vs. the number of dies evaluation as it uses:
num_processors[DIE_LEVEL] / num_processors[CORE_LEVEL]
which "works" only correctly if there is none of the intermediate
topology levels (MODULE/TILE) enumerated.
There is zero value in trying to "fix" that code as the only proper fix is
to rewrite it from scratch.
Implement a sane parser with proper code documentation, which will be used
for the consolidated topology evaluation in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.830571770@linutronix.de
In preparation of a complete replacement for the topology leaf 0xb/0x1f
evaluation, move __max_die_per_package into the common code.
Will be removed once everything is converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.768188958@linutronix.de
Centaur and Zhaoxin CPUs use only the legacy SMP detection. Remove the
invocations from their 32bit path and exclude them from the 64-bit call
path.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.706794189@linutronix.de
The legacy topology detection via CPUID leaf 4, which provides the number
of cores in the package and CPUID leaf 1 which provides the number of
logical CPUs in case that FEATURE_HT is enabled and the CMP_LEGACY feature
is not set, is shared for Intel, Centaur and Zhaoxin CPUs.
Lift the code from common.c without the early detection hack and provide it
as common fallback mechanism.
Will be utilized in later changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240212153624.644448852@linutronix.de
Topology evaluation is a complete disaster and impenetrable mess. It's
scattered all over the place with some vendor implementations doing early
evaluation and some not. The most horrific part is the permanent
overwriting of smt_max_siblings and __max_die_per_package, instead of
establishing them once on the boot CPU and validating the result on the
APs.
The goals are:
- One topology evaluation entry point
- Proper sharing of pointlessly duplicated code
- Proper structuring of the evaluation logic and preferences.
- Evaluating important system wide information only once on the boot CPU
- Making the 0xb/0x1f leaf parsing less convoluted and actually fixing
the short comings of leaf 0x1f evaluation.
Start to consolidate the topology evaluation code by providing the entry
points for the early boot CPU evaluation and for the final parsing on the
boot CPU and the APs.
Move the trivial pieces into that new code:
- The initialization of cpuinfo_x86::topo
- The evaluation of CPUID leaf 1, which presets topo::initial_apicid
- topo_apicid is set to topo::initial_apicid when invoked from early
boot. When invoked for the final evaluation on the boot CPU it reads
the actual APIC ID, which makes apic_get_initial_apicid() obsolete
once everything is converted over.
Provide a temporary helper function topo_converted() which shields off the
not yet converted CPU vendors from invoking code which would break them.
This shielding covers all vendor CPUs which support SMP, but not the
historical pure UP ones as they only need the topology info init and
eventually the initial APIC initialization.
Provide two new members in cpuinfo_x86::topo to store the maximum number of
SMT siblings and the number of dies per package and add them to the debugfs
readout. These two members will be used to populate this information on the
boot CPU and to validate the APs against it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Tested-by: Wang Wendy <wendy.wang@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240212153624.581436579@linutronix.de
Make sure the default return thunk is not used after all return
instructions have been patched by the alternatives because the default
return thunk is insufficient when it comes to mitigating Retbleed or
SRSO.
Fix based on an earlier version by David Kaplan <david.kaplan@amd.com>.
[ bp: Fix the compilation error of warn_thunk_thunk being an invisible
symbol, hoist thunk macro into calling.h ]
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231010171020.462211-4-david.kaplan@amd.com
Link: https://lore.kernel.org/r/20240104132446.GEZZaxnrIgIyat0pqf@fat_crate.local
such hw can boot again
- Do not take into accout XSTATE buffer size info supplied by userspace
when constructing a sigreturn frame
- Switch get_/put_user* to EX_TYPE_UACCESS exception handling when an
MCE is encountered so that it can be properly recovered from instead
of simply panicking
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXIo3cACgkQEsHwGGHe
VUpnvg//THpQodOkgc8SLMut0fx/qcmWTZAxXKBPQklZkBq3sbA6wEDQqvBNkXfl
ovSss8TeL0KRrq3OsurJK+QXP94+nFt11q9SEhqPmhGb9d4H7aBimCrNjP0yEE1f
YuvkhGhylIPnrwYoJUrK024tuxkFFgIVqr+adv1PrvtohnpVhICJY2oTpxtpQDZi
r+k7P7VBG1oNvYETAbljbTQr5KV84YTmZa899/tncZaZbE+18bK/VJhL728ztSzD
Xdwoztrf37fqYk03l40MJwJwpiAC5t2g/qwa5yvHjr9Eavb5YeLX34nxeG2AdOpx
GTwrWkIW1dY4ck3lC4HR/igd2bDB4ZEfxJMMLkQAIvurGpQjU/jVXC28V4r6N5MW
UF1gf4i9m2/BrpX+wpDOi11tl5RQQcV7Y8qsMN1lqRM5sDjjh4PV9oT2TXKmuYn6
2T4Xv0A94FROFkQ9F52MFqTcwh0Yu9vtGsmtbCRP/em5OwqyyVFHWdEFR4PSZUpU
89V7zVFlLWTEuPjrUAU9sQmTL56gNlVmejWAzearhHgeFKUs0EK1hcn310454aVm
CzDN+4u8uCHFDKsF915nQnRI6jpRnf3mC4xWYheHcoCg02iSImWwVGGVHbJrWSNV
fFYxwWtpFw0N9jzCfUHnElp3jN1Ll1LkkWQC4NvCtZxeUioqKJI=
=b7B7
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Correct the minimum CPU family for Transmeta Crusoe in Kconfig so
that such hw can boot again
- Do not take into accout XSTATE buffer size info supplied by userspace
when constructing a sigreturn frame
- Switch get_/put_user* to EX_TYPE_UACCESS exception handling when an
MCE is encountered so that it can be properly recovered from instead
of simply panicking
* tag 'x86_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6
x86/fpu: Stop relying on userspace for info to fault in xsave buffer
x86/lib: Revert to _ASM_EXTABLE_UA() for {get,put}_user() fixups
It is more accurate to check if KVM is enabled, instead of having the
architecture say so. Architectures always "have" KVM, so for example
checking CONFIG_HAVE_KVM in x86 code is pointless, but if KVM is disabled
in a specific build, there is no need for support code.
Alternatively, many of the #ifdefs could simply be deleted. However,
this would add completely dead code. For example, when KVM is disabled,
there should not be any posted interrupts, i.e. NOT wiring up the "dummy"
handlers and treating IRQs on those vectors as spurious is the right
thing to do.
Cc: x86@kernel.org
Cc: kbingham@kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Avoid false positive for check that only matters on AMD processors
x86:
* Give a hint when Win2016 might fail to boot due to XSAVES && !XSAVEC configuration
* Do not allow creating an in-kernel PIT unless an IOAPIC already exists
RISC-V:
* Allow ISA extensions that were enabled for bare metal in 6.8
(Zbc, scalar and vector crypto, Zfh[min], Zihintntl, Zvfh[min], Zfa)
S390:
* fix CC for successful PQAP instruction
* fix a race when creating a shadow page
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmXB9EIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNF6Qf/VbNzzntY2BBNL6ZReqH+7GqMCMo7
Q8OYsP+B7TWc0C84JNBTmvC5lwY0FmXEV+i9XFUnyMt/eEHEfr/rko1McRf+byAM
vcfbTAz8t24bFSfojg7QJGM+pfUTrqjGmWqHwke/DuARsGB8Zntgtb50m966+xso
kDtcsrfGOlpHbnnWZQLLQKJ6tVv7Z2/clFlf4gCT/Quex4Jo76Uq08MA9BFS9iw1
e1oftwuXe6pCUcyt1M/AwOe8FnkP+Xm8oVmW0eJgO0TVDwob0Msx2LpVS2N/+/Oj
1mtBSz4rUQyDdI1j6D0+HkdAlNnwEWSV6eQb+qtjXbhIWBOHUpFXNpQWkg==
=LVAr
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86 guest:
- Avoid false positive for check that only matters on AMD processors
x86:
- Give a hint when Win2016 might fail to boot due to XSAVES &&
!XSAVEC configuration
- Do not allow creating an in-kernel PIT unless an IOAPIC already
exists
RISC-V:
- Allow ISA extensions that were enabled for bare metal in 6.8 (Zbc,
scalar and vector crypto, Zfh[min], Zihintntl, Zvfh[min], Zfa)
S390:
- fix CC for successful PQAP instruction
- fix a race when creating a shadow page"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
x86/coco: Define cc_vendor without CONFIG_ARCH_HAS_CC_PLATFORM
x86/kvm: Fix SEV check in sev_map_percpu_data()
KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum
KVM: x86: Check irqchip mode before create PIT
KVM: riscv: selftests: Add Zfa extension to get-reg-list test
RISC-V: KVM: Allow Zfa extension for Guest/VM
KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test
RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM
KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test
RISC-V: KVM: Allow Zihintntl extension for Guest/VM
KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test
RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM
KVM: riscv: selftests: Add vector crypto extensions to get-reg-list test
RISC-V: KVM: Allow vector crypto extensions for Guest/VM
KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list test
RISC-V: KVM: Allow scalar crypto extensions for Guest/VM
KVM: riscv: selftests: Add Zbc extension to get-reg-list test
RISC-V: KVM: Allow Zbc extension for Guest/VM
KVM: s390: fix cc for successful PQAP
KVM: s390: vsie: fix race during shadow creation
The KVM PTP driver now refers to the clocksource ID CSID_X86_KVM_CLK, not
to the clocksource itself any more. There are no remaining users of the
clocksource export.
Therefore, make the clocksource static again.
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-9-peter.hilber@opensynergy.com
The clocksource pointer in struct system_counterval_t is not evaluated any
more. Remove the code setting the member, and the member itself.
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-8-peter.hilber@opensynergy.com
Add a clocksource ID for the x86 kvmclock.
Also, for ptp_kvm, set the recently added struct system_counterval_t member
cs_id to the clocksource ID (x86 kvmclock or ARM Generic Timer). In the
future, get_device_system_crosststamp() will compare the clocksource ID in
struct system_counterval_t, rather than the clocksource.
For now, to avoid touching too many subsystems at once, extract the
clocksource ID from the clocksource. The clocksource dereference will be
removed once everything is converted over..
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-5-peter.hilber@opensynergy.com
Add a clocksource ID for TSC and a distinct one for the early TSC.
Use distinct IDs for TSC and early TSC, since those also have distinct
clocksource structs. This should help to keep existing semantics when
comparing clocksources.
Also, set the recently added struct system_counterval_t member cs_id to the
TSC ID in the cases where the clocksource member is being set to the TSC
clocksource. In the future, get_device_system_crosststamp() will compare
the clocksource ID in struct system_counterval_t, rather than the
clocksource.
For the x86 ART related code, system_counterval_t.cs == NULL corresponds to
system_counterval_t.cs_id == CSID_GENERIC (0).
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240201010453.2212371-4-peter.hilber@opensynergy.com
Add or modify function descriptions to remove kernel-doc warnings:
tsc.c:655: warning: missing initial short description on line:
* native_calibrate_tsc
tsc.c:1339: warning: Excess function parameter 'cycles' description in 'convert_art_ns_to_tsc'
tsc.c:1339: warning: Excess function parameter 'cs' description in 'convert_art_ns_to_tsc'
tsc.c:1373: warning: Function parameter or member 'work' not described in 'tsc_refine_calibration_work'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231221033620.32379-1-rdunlap@infradead.org
Refer to commit fd10cde929 ("KVM paravirt: Add async PF initialization
to PV guest") and commit 344d9588a9 ("KVM: Add PV MSR to enable
asynchronous page faults delivery"). It turns out that at the time when
asyncpf was introduced, the purpose was defining the shared PV data 'struct
kvm_vcpu_pv_apf_data' with the size of 64 bytes. However, it made a mistake
and defined the size to 68 bytes, which failed to make fit in a cache line
and made the code inconsistent with the documentation.
Below justification quoted from Sean[*]
KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and
the documentation clearly states that enabling is based solely on the
bit in the synthetic MSR.
So rather than update the documentation, fix the goof by removing the
enabled filed and use the separate percpu variable instread.
KVM-as-a-host obviously doesn't enforce anything or consume the size,
and changing the header will only affect guests that are rebuilt against
the new header, so there's no chance of ABI breakage between KVM and its
guests. The only possible breakage is if some other hypervisor is
emulating KVM's async #PF (LOL) and relies on the guest to set
kvm_vcpu_pv_apf_data.enabled. But (a) I highly doubt such a hypervisor
exists, (b) that would arguably be a violation of KVM's "spec", and
(c) the worst case scenario is that the guest would simply lose async
#PF functionality.
[*] https://lore.kernel.org/all/ZS7ERnnRqs8Fl0ZF@google.com/T/#u
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20231025055914.1201792-2-xiaoyao.li@intel.com
[sean: use true/false instead of 1/0 for booleans]
Signed-off-by: Sean Christopherson <seanjc@google.com>
The early startup code executes from a 1:1 mapping of memory, which
differs from the mapping that the code was linked and/or relocated to
run at. The latter mapping is not active yet at this point, and so
symbol references that rely on it will fault.
Given that the core kernel is built without -fPIC, symbol references are
typically emitted as absolute, and so any such references occuring in
the early startup code will therefore crash the kernel.
While an attempt was made to work around this for the early SEV/SME
startup code, by forcing RIP-relative addressing for certain global
SEV/SME variables via inline assembly (see snp_cpuid_get_table() for
example), RIP-relative addressing must be pervasively enforced for
SEV/SME global variables when accessed prior to page table fixups.
__startup_64() already handles this issue for select non-SEV/SME global
variables using fixup_pointer(), which adjusts the pointer relative to a
`physaddr` argument. To avoid having to pass around this `physaddr`
argument across all functions needing to apply pointer fixups, introduce
a macro RIP_RELATIVE_REF() which generates a RIP-relative reference to
a given global variable. It is used where necessary to force
RIP-relative accesses to global variables.
For backporting purposes, this patch makes no attempt at cleaning up
other occurrences of this pattern, involving either inline asm or
fixup_pointer(). Those will be addressed later.
[ bp: Call it "rip_rel_ref" everywhere like other code shortens
"rIP-relative reference" and make the asm wrapper __always_inline. ]
Co-developed-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/all/20240130220845.1978329-1-kevinloughlin@google.com
The function sev_map_percpu_data() checks if it is running on an SEV
platform by checking the CC_ATTR_GUEST_MEM_ENCRYPT attribute. However,
this attribute is also defined for TDX.
To avoid false positives, add a cc_vendor check.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 4d96f91091 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")
Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: David Rientjes <rientjes@google.com>
Message-Id: <20240124130317.495519-1-kirill.shutemov@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let cpu_init_exception_handling() call cpu_init_fred_exceptions() to
initialize FRED. However if FRED is unavailable or disabled, it falls
back to set up TSS IST and initialize IDT.
Co-developed-by: Xin Li <xin3.li@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-36-xin3.li@intel.com
Add cpu_init_fred_exceptions() to:
- Set FRED entrypoints for events happening in ring 0 and 3.
- Specify the stack level for IRQs occurred ring 0.
- Specify dedicated event stacks for #DB/NMI/#MCE/#DF.
- Enable FRED and invalidtes IDT.
- Force 32-bit system calls to use "int $0x80" only.
Add fred_complete_exception_setup() to:
- Initialize system_vectors as done for IDT systems.
- Set unused sysvec_table entries to fred_handle_spurious_interrupt().
Co-developed-by: Xin Li <xin3.li@intel.com>
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-35-xin3.li@intel.com
Because FRED uses the ring 3 FRED entrypoint for SYSCALL and SYSENTER and
ERETU is the only legit instruction to return to ring 3, there is NO need
to setup SYSCALL and SYSENTER MSRs for FRED, except the IA32_STAR MSR.
Split IDT syscall setup code into idt_syscall_init() to make it easy to
skip syscall setup code when FRED is enabled.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-34-xin3.li@intel.com
Add sysvec_install() to install a system interrupt handler into the IDT
or the FRED system interrupt handler table.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-28-xin3.li@intel.com
Like #DB, when occurred on different ring level, i.e., from user or kernel
context, #MCE needs to be handled on different stack: User #MCE on current
task stack, while kernel #MCE on a dedicated stack.
This is exactly how FRED event delivery invokes an exception handler: ring
3 event on level 0 stack, i.e., current task stack; ring 0 event on the
the FRED machine check entry stub doesn't do stack switch.
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-26-xin3.li@intel.com
On a FRED system, NMIs nest both with themselves and faults, transient
information is saved into the stack frame, and NMI unblocking only
happens when the stack frame indicates that so should happen.
Thus, the NMI entry stub for FRED is really quite small...
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231216063139.25567-1-xin3.li@intel.com
When occurred on different ring level, i.e., from user or kernel context,
stack, while kernel #DB on a dedicated stack. This is exactly how FRED
event delivery invokes an exception handler: ring 3 event on level 0
stack, i.e., current task stack; ring 0 event on the #DB dedicated stack
specified in the IA32_FRED_STKLVLS MSR. So unlike IDT, the FRED debug
exception entry stub doesn't do stack switch.
On a FRED system, the debug trap status information (DR6) is passed on
the stack, to avoid the problem of transient state. Furthermore, FRED
transitions avoid a lot of ugly corner cases the handling of which can,
and should be, skipped.
The FRED debug trap status information saved on the stack differs from
DR6 in both stickiness and polarity; it is exactly in the format which
debug_read_clear_dr6() returns for the IDT entry points.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-24-xin3.li@intel.com
Entering a new task is logically speaking a return from a system call
(exec, fork, clone, etc.). As such, if ptrace enables single stepping
a single step exception should be allowed to trigger immediately upon
entering user space. This is not optional.
NMI should *never* be disabled in user space. As such, this is an
optional, opportunistic way to catch errors.
Allow single-step trap and NMI when starting a new task, thus once
the new task enters user space, single-step trap and NMI are both
enabled immediately.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-21-xin3.li@intel.com
Because FRED always restores the full value of %rsp, ESPFIX is
no longer needed when it's enabled.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-20-xin3.li@intel.com
SWAPGS is no longer needed thus NOT allowed with FRED because FRED
transitions ensure that an operating system can _always_ operate
with its own GS base address:
- For events that occur in ring 3, FRED event delivery swaps the GS
base address with the IA32_KERNEL_GS_BASE MSR.
- ERETU (the FRED transition that returns to ring 3) also swaps the
GS base address with the IA32_KERNEL_GS_BASE MSR.
And the operating system can still setup the GS segment for a user
thread without the need of loading a user thread GS with:
- Using LKGS, available with FRED, to modify other attributes of the
GS segment without compromising its ability always to operate with
its own GS base address.
- Accessing the GS segment base address for a user thread as before
using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR.
Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE MSR
instead of the GS segment's descriptor cache. As such, the operating
system never changes its runtime GS base address.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-19-xin3.li@intel.com
struct pt_regs is hard to read because the member or section related
comments are not aligned with the members.
The 'cs' and 'ss' members of pt_regs are type of 'unsigned long' while
in reality they are only 16-bit wide. This works so far as the
remaining space is unused, but FRED will use the remaining bits for
other purposes.
To prepare for FRED:
- Cleanup the formatting
- Convert 'cs' and 'ss' to u16 and embed them into an union
with a u64
- Fixup the related printk() format strings
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Originally-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-14-xin3.li@intel.com
Add X86_CR4_FRED macro for the FRED bit in %cr4. This bit must not be
changed after initialization, so add it to the pinned CR4 bits.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-12-xin3.li@intel.com
Since
866b556efa ("x86/head/64: Install startup GDT")
the primary startup sequence sets the code segment register (CS) to
__KERNEL_CS before calling into the startup code shared between primary
and secondary boot.
This means a simple indirect call is sufficient here.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240129180502.4069817-24-ardb+git@google.com
Let command line option "fred" accept multiple options to make it
easier to tweak its behavior.
Currently, two options 'on' and 'off' are allowed, and the default
behavior is to disable FRED. To enable FRED, append "fred=on" to the
kernel command line.
[ bp: Use cpu_feature_enabled(), touch ups. ]
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-9-xin3.li@intel.com
Before this change, the expected size of the user space buffer was
taken from fx_sw->xstate_size. fx_sw->xstate_size can be changed
from user-space, so it is possible construct a sigreturn frame where:
* fx_sw->xstate_size is smaller than the size required by valid bits in
fx_sw->xfeatures.
* user-space unmaps parts of the sigrame fpu buffer so that not all of
the buffer required by xrstor is accessible.
In this case, xrstor tries to restore and accesses the unmapped area
which results in a fault. But fault_in_readable succeeds because buf +
fx_sw->xstate_size is within the still mapped area, so it goes back and
tries xrstor again. It will spin in this loop forever.
Instead, fault in the maximum size which can be touched by XRSTOR (taken
from fpstate->user_size).
[ dhansen: tweak subject / changelog ]
Fixes: fcb3635f50 ("x86/fpu/signal: Handle #PF in the direct restore path")
Reported-by: Konstantin Bogomolov <bogomolov@google.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240130063603.3392627-1-avagin%40google.com
Remove the include statement for <asm/bootparam.h> from several files
that don't require it and limit the exposure of those definitions within
the Linux kernel code.
[ bp: Massage commit message. ]
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240112095000.8952-5-tzimmermann@suse.de
Add a kdump safe version of sev_firmware_shutdown() and register it as a
crash_kexec_post_notifier so it will be invoked during panic/crash to do
SEV/SNP shutdown. This is required for transitioning all IOMMU pages to
reclaim/hypervisor state, otherwise re-init of IOMMU pages during
crashdump kernel boot fails and panics the crashdump kernel.
This panic notifier runs in atomic context, hence it ensures not to
acquire any locks/mutexes and polls for PSP command completion instead
of depending on PSP command completion interrupt.
[ mdr: Remove use of "we" in comments. ]
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240126041126.1927228-21-michael.roth@amd.com
The memory integrity guarantees of SEV-SNP are enforced through a new
structure called the Reverse Map Table (RMP). The RMP is a single data
structure shared across the system that contains one entry for every 4K
page of DRAM that may be used by SEV-SNP VMs. The APM Volume 2 section
on Secure Nested Paging (SEV-SNP) details a number of steps needed to
detect/enable SEV-SNP and RMP table support on the host:
- Detect SEV-SNP support based on CPUID bit
- Initialize the RMP table memory reported by the RMP base/end MSR
registers and configure IOMMU to be compatible with RMP access
restrictions
- Set the MtrrFixDramModEn bit in SYSCFG MSR
- Set the SecureNestedPagingEn and VMPLEn bits in the SYSCFG MSR
- Configure IOMMU
RMP table entry format is non-architectural and it can vary by
processor. It is defined by the PPR document for each respective CPU
family. Restrict SNP support to CPU models/families which are compatible
with the current RMP table entry format to guard against any undefined
behavior when running on other system types. Future models/support will
handle this through an architectural mechanism to allow for broader
compatibility.
SNP host code depends on CONFIG_KVM_AMD_SEV config flag which may be
enabled even when CONFIG_AMD_MEM_ENCRYPT isn't set, so update the
SNP-specific IOMMU helpers used here to rely on CONFIG_KVM_AMD_SEV
instead of CONFIG_AMD_MEM_ENCRYPT.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-5-michael.roth@amd.com
Without SEV-SNP, Automatic IBRS protects only the kernel. But when
SEV-SNP is enabled, the Automatic IBRS protection umbrella widens to all
host-side code, including userspace. This protection comes at a cost:
reduced userspace indirect branch performance.
To avoid this performance loss, don't use Automatic IBRS on SEV-SNP
hosts and all back to retpolines instead.
[ mdr: squash in changes from review discussion. ]
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/r/20240126041126.1927228-3-michael.roth@amd.com
Add CPU feature detection for Secure Encrypted Virtualization with
Secure Nested Paging. This feature adds a strong memory integrity
protection to help prevent malicious hypervisor-based attacks like
data replay, memory re-mapping, and more.
Since enabling the SNP CPU feature imposes a number of additional
requirements on host initialization and handling legacy firmware APIs
for SEV/SEV-ES guests, only introduce the CPU feature bit so that the
relevant handling can be added, but leave it disabled via a
disabled-features mask.
Once all the necessary changes needed to maintain legacy SEV/SEV-ES
support are introduced in subsequent patches, the SNP feature bit will
be unmasked/enabled.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Jarkko Sakkinen <jarkko@profian.com>
Signed-off-by: Ashish Kalra <Ashish.Kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240126041126.1927228-2-michael.roth@amd.com
Compare the opcode bytes at rIP for each #VC exit reason to verify the
instruction which raised the #VC exception is actually the right one.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240105101407.11694-1-bp@alien8.de
Any FRED enabled CPU will always have the following features as its
baseline:
1) LKGS, load attributes of the GS segment but the base address into
the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor
cache.
2) WRMSRNS, non-serializing WRMSR for faster MSR writes.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shan Kang <shan.kang@intel.com>
Link: https://lore.kernel.org/r/20231205105030.8698-7-xin3.li@intel.com
On some AMD machines, unknown NMI messages were printed on the console
continuously when using perf command with IBS. It was reported that it
can slow down the kernel. Ratelimit the unknown NMI messages.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ravi Bangoria <ravi.bangoria@amd.com>
Acked-by: Guilherme Amadio <amadio@gentoo.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231209015211.357983-1-namhyung@kernel.org
The kernel test robot reported the following warning after commit
54e35eb861 ("x86/resctrl: Read supported bandwidth sources from CPUID").
even though the issue is present even in the original commit
92bd5a1390 ("x86/resctrl: Add interface to write mbm_total_bytes_config")
which added this function. The reported warning is:
$ make C=1 CHECK=scripts/coccicheck arch/x86/kernel/cpu/resctrl/rdtgroup.o
...
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1621:5-8: Unneeded variable: "ret". Return "0" on line 1655
Remove the local variable 'ret'.
[ bp: Massage commit message, make mbm_config_write_domain() void. ]
Fixes: 92bd5a1390 ("x86/resctrl: Add interface to write mbm_total_bytes_config")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401241810.jbd8Ipa1-lkp@intel.com/
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/202401241810.jbd8Ipa1-lkp@intel.com
The mba_MBps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.
To avoid possibly stepping throttling up and down on every poll a flag
"delta_comp" is set whenever throttling is changed to indicate that the
actual change in bandwidth should be recorded on the next poll in
"delta_bw". Throttling is only reduced if the current bandwidth plus
delta_bw is below the user target.
This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plus delta_bw is above the target set by the user.
Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.
Fixes: ba0f26d852 ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Xiaochen Shen <xiaochen.shen@intel.com>
Link: https://lore.kernel.org/r/20240122180807.70518-1-tony.luck@intel.com
If the BMEC (Bandwidth Monitoring Event Configuration) feature is
supported, the bandwidth events can be configured. The maximum supported
bandwidth bitmask can be read from CPUID:
CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event Configuration]
Bits Description
31:7 Reserved
6:0 Identifies the bandwidth sources that can be tracked.
While at it, move the mask checking to mon_config_write() before
iterating over all the domains. Also, print the valid bitmask when the
user tries to configure invalid event configuration value.
The CPUID details are documented in the Processor Programming Reference
(PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the
Link tag.
Fixes: dc2a3e8579 ("x86/resctrl: Add interface to read mbm_total_bytes_config")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com
The QOS Memory Bandwidth Enforcement Limit is reported by
CPUID_Fn80000020_EAX_x01 and CPUID_Fn80000020_EAX_x02:
Bits Description
31:0 BW_LEN: Size of the QOS Memory Bandwidth Enforcement Limit.
Newer processors can support higher bandwidth limit than the current
hard-coded value. Remove latter and detect using CPUID instead. Also,
update the register variables eax and edx to match the AMD CPUID
definition.
The CPUID details are documented in the Processor Programming Reference
(PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the
Link tag below.
Fixes: 4d05bf71f1 ("x86/resctrl: Introduce AMD QOS feature")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com
In a "W=1" build gcc throws a warning:
arch/x86/kernel/cpu/resctrl/core.c: In function ‘cache_alloc_hsw_probe’:
arch/x86/kernel/cpu/resctrl/core.c:139:16: warning: variable ‘h’ set but not used
Switch from wrmsr_safe() to wrmsrl_safe(), and from rdmsr() to rdmsrl()
using a single u64 argument for the MSR value instead of the pair of u32
for the high and low halves.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/ZULCd/TGJL9Dmncf@agluck-desk3
Several inlined functions subject to paravirt patching are referencing
BUG_func() after the recent switch to the alternative patching
mechanism.
As those functions can legally be used by non-GPL modules, BUG_func()
must be usable by those modules, too. So use EXPORT_SYMBOL() when
exporting BUG_func().
Fixes: 9824b00c2b ("x86/paravirt: Move some functions and defines to alternative.c")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240109082232.22657-1-jgross@suse.com
Subsytem:
New driver:
- Analog Devices MAX31335
- Nuvoton ma35d1
- Texas Instrument TPS6594 PMIC RTC
Drivers:
- cmos: use ACPI alarm instead of HPET on recent AMD platforms
- nuvoton: add NCT3015Y-R and NCT3018Y-R support
- rv8803: proper suspend/resume and wakeup-source support
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEBqsFVZXh8s/0O5JiY6TcMGxwOjIFAmWpkqQACgkQY6TcMGxw
OjJ2eQ//cDaUPgvOUQ2+qQiZOyOrjnck+QEyXZr2mcnNfRnMPiafJX09v3W4BMPh
0oOVNF5BbGofTgJRnTC9BCRQnq8XZUwXaMr1B+mLQF4V9Xq3896pILkdRmd0q7EW
3qwKCNP1vkYx2hGyWB9wVAMESAdUIFHCLxJWeQZ3ESGUacMoON0cdFa96TUx4fKa
m29ybMTHRHKpnZsIYpegxG42lWp84IPvTbtySbT52dr2ucLToVos/dX23juQ40D0
nyUa8Q+g6aLoTxjZPcFwK6dHJJIwWz56s40IbMRGr6dVfRis5QZfIQ/cB8ULj48L
AkCtN6kptVsov/W2R9ZriTf5p53K7Fmwz+dhccW7SxA82cGyswWOD3BNzzOYnzPY
pKSVeTnR7mD2IC28pGrekSpg3ExuNu/4+WHnz0EwpjgyXmtlxnfK6oV9nf8bIUsn
JsY3tNOvenv8NQSQ1at3GugeQ4bGMbxay6pL9zm5EZjYGMDX0z7IcPFr9KeYC5tJ
60dYWCGuB2JF0WocuAXSNrj2l9VFFqhn7OdDVuiB00rBROQcKKV/H30EnmAwRUI1
8SZhVJ0TIeoG6XsajhbapvOH8EWUnKPn1mgJxSiyVR6Hz93oZldIPK3cUtfjcltx
xl7LdVeseE1CXGh4g222CI0MCX6x1QPQ5L8jhgc67dyTUx/wQ4Q=
=7u4i
-----END PGP SIGNATURE-----
Merge tag 'rtc-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"There are three new drivers this cycle. Also the cmos driver is
getting fixes for longstanding wakeup issues on AMD.
New drivers:
- Analog Devices MAX31335
- Nuvoton ma35d1
- Texas Instrument TPS6594 PMIC RTC
Drivers:
- cmos: use ACPI alarm instead of HPET on recent AMD platforms
- nuvoton: add NCT3015Y-R and NCT3018Y-R support
- rv8803: proper suspend/resume and wakeup-source support"
* tag 'rtc-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (26 commits)
rtc: nuvoton: Compatible with NCT3015Y-R and NCT3018Y-R
rtc: da9063: Use dev_err_probe()
rtc: da9063: Use device_get_match_data()
rtc: da9063: Make IRQ as optional
rtc: max31335: Fix comparison in max31335_volatile_reg()
rtc: max31335: use regmap_update_bits_check
rtc: max31335: remove unecessary locking
rtc: max31335: add driver support
dt-bindings: rtc: max31335: add max31335 bindings
rtc: rv8803: add wakeup-source support
rtc: ac100: remove misuses of kernel-doc
rtc: class: Remove usage of the deprecated ida_simple_xx() API
rtc: MAINTAINERS: drop Alessandro Zummo
rtc: ma35d1: remove hardcoded UIE support
dt-bindings: rtc: qcom-pm8xxx: fix inconsistent example
rtc: rv8803: Add power management support
rtc: ds3232: avoid unused-const-variable warning
rtc: lpc24xx: add missing dependency
rtc: tps6594: Add driver for TPS6594 RTC
rtc: Add driver for Nuvoton ma35d1 rtc controller
...
Including:
- Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to
a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
- ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm SMMU
implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
- Intel VT-d driver:
- Cleanup and refactoring
- AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
- Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
- Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
- Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmWecQoACgkQK/BELZcB
GuN5ZxAAzC5QUKAzANx0puk7QhPpKKlbSvj6Q7iRgCLk00KJO1+VQh9v4ouCmXqF
kn3Ko8gddjhtrgwN0OQ54F39cLUrp1SBemy71K5YOR+vu8VKtwtmawZGeeRZ+k+B
Eohw58oaXTiR1maYvoLixLYczLrjklqyJOQ1vZ0GxFGxDqrFByAryHDgG/3OCpJx
C9e6PsLbbfhfqA8Kv97iKcBqniGbXxAMuodqSUG0buQ3oZgfpIP6Bt3EgUzFGPGk
3BTlYxowS/gkjUWd3fgjQFIFLTA01u9FhpA2Jb0a4v67pUCR64YxHN7rBQ6ZChtG
kB9laQfU9re79RsHhqQzr0JT9x/eyq7pzGzjp5TV5TPW6IW+sqjMIPhzd9P08Ef7
BclkCVobx0jSAHOhnnG4QJiKANr2Y2oM3HfsAJccMMY45RRhUKmVqM7jxMPfGn3A
i+inlee73xTjZXJse1EWG1fmKKMLvX9LDEp4DyOfn9CqVT+7hpZvzPjfbGr937Rm
JlwXhF3rQXEpOCagEsbt1vOf+V0e9QiCLf1Y2KpkIkDbE5wwSD/2qLm3tFhJG3oF
fkW+J14Cid0pj+hY0afGe0kOUOIYlimu0nFmSf0pzMH+UktZdKogSfyb1gSDsy+S
rsZRGPFhMJ832ExqhlDfxqBebqh+jsfKynlskui6Td5C9ZULaHA=
=q751
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm
SMMU implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
Intel VT-d driver:
- Cleanup and refactoring
AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes"
* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu: Don't reserve 0-length IOVA region
iommu/vt-d: Move inline helpers to header files
iommu/vt-d: Remove unused vcmd interfaces
iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
iommu/sva: Fix memory leak in iommu_sva_bind_device()
dt-bindings: iommu: rockchip: Add Rockchip RK3588
iommu/dma: Trace bounce buffer usage when mapping buffers
iommu/arm-smmu: Convert to domain_alloc_paging()
iommu/arm-smmu: Pass arm_smmu_domain to internal functions
iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
iommu/arm-smmu: Convert to a global static identity domain
iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
iommu/arm-smmu-v3: Add a type for the STE
iommu/arm-smmu-v3: disable stall for quiet_cd
iommu/qcom: restore IOMMU state if needed
iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
iommu/arm-smmu-qcom: Add missing GMU entry to match table
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmWfCRQACgkQaDWVMHDJ
krDUqQ//VCvkpf0mAbYDJa1oTXFW8O5cVTusBtPi8k7cFbtjQpjno/9AqKol+sK8
AKg+y5iHHl7QJmDmEcpS+O9OBbmFOpvDzm3QZhk8RkWS5pe0B108dnINYtS0eP9R
MkzZwfrI2yC6NX4hvHGdD8WGHjrt+oxY0bojehX87JZsyRU+xqc/g1OO7a5bUPQe
3Ip0kKiCeqFv0y+Q1pFMEd9RdZ8XxqzUHCJT3hfgZ6FajJ2eVy6jNrPOm6LozycB
eOtYYNapSgw3k/WhJCOYWHX7kePXibLxBRONLpi6P3U6pMVk4n8wrgl7qPtdW1Qx
nR2UHX5P6eFkxNCuU1BzvmPBROe37C51MFVw29eRnigvuX3j/vfCH1+17xQOVKVv
5JyxYA0rJWqoOz6mX7YaNJHlmrxHzeKXudICyOFuu1j5c8CuGjh8NQsOSCq16XfZ
hPzfYDUS8I7/kHYQPJlnB+kF9pmbyjTM70h74I8D6ZWvXESHJZt+TYPyWfkBXP/P
L9Pwx1onAyoBApGxCWuvgGTLonzNredgYG4ABbqhUqxqncJS9M7Y/yJa+f+3SOkR
T6LxoByuDVld5cIfbOzRwIaRezZDe/NL7rkHm/DWo98OaV3zILsr20Hx1lPZ1Vce
ryZ9lCdZGGxm2jmpzr/VymPQz/E+ezahRHE1+F3su8jpCU41txg=
=1EJI
-----END PGP SIGNATURE-----
Merge tag 'x86_tdx_for_6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TDX updates from Dave Hansen:
"This contains the initial support for host-side TDX support so that
KVM can run TDX-protected guests. This does not include the actual
KVM-side support which will come from the KVM folks. The TDX host
interactions with kexec also needs to be ironed out before this is
ready for prime time, so this code is currently Kconfig'd off when
kexec is on.
The majority of the code here is the kernel telling the TDX module
which memory to protect and handing some additional memory over to it
to use to store TDX module metadata. That sounds pretty simple, but
the TDX architecture is rather flexible and it takes quite a bit of
back-and-forth to say, "just protect all memory, please."
There is also some code tacked on near the end of the series to handle
a hardware erratum. The erratum can make software bugs such as a
kernel write to TDX-protected memory cause a machine check and
masquerade as a real hardware failure. The erratum handling watches
out for these and tries to provide nicer user errors"
* tag 'x86_tdx_for_6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/virt/tdx: Make TDX host depend on X86_MCE
x86/virt/tdx: Disable TDX host support when kexec is enabled
Documentation/x86: Add documentation for TDX host support
x86/mce: Differentiate real hardware #MCs from TDX erratum ones
x86/cpu: Detect TDX partial write machine check erratum
x86/virt/tdx: Handle TDX interaction with sleep and hibernation
x86/virt/tdx: Initialize all TDMRs
x86/virt/tdx: Configure global KeyID on all packages
x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID
x86/virt/tdx: Designate reserved areas for all TDMRs
x86/virt/tdx: Allocate and set up PAMTs for TDMRs
x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions
x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions
x86/virt/tdx: Get module global metadata for module initialization
x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory
x86/virt/tdx: Add skeleton to enable TDX on demand
x86/virt/tdx: Add SEAMCALL error printing for module initialization
x86/virt/tdx: Handle SEAMCALL no entropy error in common code
x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC
x86/virt/tdx: Define TDX supported page sizes as macros
...
Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing
major in here this release cycle, just lots of small cleanups and some
tweaks on kernfs that in the very end, got reverted and will come back
in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZaeOrg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymtcwCffzvKKkSY9qAp6+0v2WQNkZm1JWoAoJCPYUwF
If6wEoPLWvRfKx4gIoq9
=D96r
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here are the set of driver core and kernfs changes for 6.8-rc1.
Nothing major in here this release cycle, just lots of small cleanups
and some tweaks on kernfs that in the very end, got reverted and will
come back in a safer way next release cycle.
Included in here are:
- more driver core 'const' cleanups and fixes
- fw_devlink=rpm is now the default behavior
- kernfs tiny changes to remove some string functions
- cpu handling in the driver core is updated to work better on many
systems that add topologies and cpus after booting
- other minor changes and cleanups
All of the cpu handling patches have been acked by the respective
maintainers and are coming in here in one series. Everything has been
in linux-next for a while with no reported issues"
* tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits)
Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock"
kernfs: convert kernfs_idr_lock to an irq safe raw spinlock
class: fix use-after-free in class_register()
PM: clk: make pm_clk_add_notifier() take a const pointer
EDAC: constantify the struct bus_type usage
kernfs: fix reference to renamed function
driver core: device.h: fix Excess kernel-doc description warning
driver core: class: fix Excess kernel-doc description warning
driver core: mark remaining local bus_type variables as const
driver core: container: make container_subsys const
driver core: bus: constantify subsys_register() calls
driver core: bus: make bus_sort_breadthfirst() take a const pointer
kernfs: d_obtain_alias(NULL) will do the right thing...
driver core: Better advertise dev_err_probe()
kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy()
kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy()
initramfs: Expose retained initrd as sysfs file
fs/kernfs/dir: obey S_ISGID
kernel/cgroup: use kernfs_create_dir_ns()
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmWldYsUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vyxUhAAs2ctoK/sMAfTOO2b1UAD/ig7CGGz
DlDt38RezFU4uqeY0Ix4heFs3RIt8YGuns76Fejfyevh1I7SOA9lbhFuMLBfO9j0
LU+KuZeGoXtIe5Kd6hCQIUgVvwISs407yp7JUUzqxFQ2rv7bin64xiDb407ZQGaK
5v4oRsnQn1KBhgZ2wfQ/S+adAma9IroK9F3C/Bm+IJ+mpNxJcbWPqnf9+5ExoxzU
MFyu0azan1crqWA/geJBetL4zVoRJx4qNEve0gqwk06vwLeIKyzB2jPO5dmn9pAb
kfAFCQgtTUGZHvZWyBZMWQcMKEQLSupOLYXU4b2Vf+oR9U0jvevqs3LArBsUceM9
vQw8Vg9RZiWs9lVeVYSQErYQecMhdiHYCXFuteaNH9tvATN4PumXiT2ZM9OsX6uy
jrXW7YLawJbGLIDNsAyrn8JESzY/CsRPpCIUq3JzL2VQdInC3mEl18rTEuKTBeZF
zE/RgwudhWDT58/vceS2LHa5KNd/vAzMTmUHEUwHg1N7TV3qkSgpPaVcvx4KklXv
1nKT2KcfD5K1Yy/InjxUYdGhRPYa7azl+l7W4hJ+NCGxwL+tUCg3knp80+empTJ0
mZm6/VSbc245nKjx3ydLlTbQ/xNMQXgHHDKPW6eO4ezZaydJZG2xkK3x6eF1+i0k
PWHSLjUxrK1AGrg=
=ri0M
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Reserve ECAM so we don't assign it to PCI BARs; this works around
bugs where BIOS included ECAM in a PNP0A03 host bridge window,
didn't reserve it via a PNP0C02 motherboard device, and didn't
allocate space for SR-IOV VF BARs (Bjorn Helgaas)
- Add MMCONFIG/ECAM debug logging (Bjorn Helgaas)
- Rename 'MMCONFIG' to 'ECAM' to match spec usage (Bjorn Helgaas)
- Log device type (Root Port, Switch Port, etc) during enumeration
(Bjorn Helgaas)
- Log bridges before downstream devices so the dmesg order is more
logical (Bjorn Helgaas)
- Log resource names (BAR 0, VF BAR 0, bridge window, etc)
consistently instead of a mix of names and "reg 0x10" (Puranjay
Mohan, Bjorn Helgaas)
- Fix 64GT/s effective data rate calculation to use 1b/1b encoding
rather than the 8b/10b or 128b/130b used by lower rates (Ilpo
Järvinen)
- Use PCI_HEADER_TYPE_* instead of literals in x86, powerpc, SCSI
lpfc (Ilpo Järvinen)
- Clean up open-coded PCIBIOS return code mangling (Ilpo Järvinen)
Resource management:
- Restructure pci_dev_for_each_resource() to avoid computing the
address of an out-of-bounds array element (the bounds check was
performed later so the element was never actually *read*, but it's
nicer to avoid even computing an out-of-bounds address) (Andy
Shevchenko)
Driver binding:
- Convert pci-host-common.c platform .remove() callback to
.remove_new() returning 'void' since it's not useful to return
error codes here (Uwe Kleine-König)
- Convert exynos, keystone, kirin from .remove() to .remove_new(),
which returns void instead of int (Uwe Kleine-König)
- Drop unused struct pci_driver.node member (Mathias Krause)
Virtualization:
- Add ACS quirk for more Zhaoxin Root Ports (LeoLiuoc)
Error handling:
- Log AER errors as "Correctable" (not "Corrected") or
"Uncorrectable" to match spec terminology (Bjorn Helgaas)
- Decode Requester ID when no error info found instead of printing
the raw hex value (Bjorn Helgaas)
Endpoint framework:
- Use a unique test pattern for each BAR in the pci_endpoint_test to
make it easier to debug address translation issues (Niklas Cassel)
Broadcom STB PCIe controller driver:
- Add DT property "brcm,clkreq-mode" and driver support for different
CLKREQ# modes to make ASPM L1.x states possible (Jim Quinlan)
Freescale Layerscape PCIe controller driver:
- Add suspend/resume support for Layerscape LS1043a and LS1021a,
including software-managed PME_Turn_Off and transitions between L0,
L2/L3_Ready Link states (Frank Li)
MediaTek PCIe controller driver:
- Clear MSI interrupt status before handler to avoid missing MSIs
that occur after the handler (qizhong cheng)
MediaTek PCIe Gen3 controller driver:
- Update mediatek-gen3 translation window setup to handle MMIO space
that is not a power of two in size (Jianjun Wang)
Qualcomm PCIe controller driver:
- Increase qcom iommu-map maxItems to accommodate SDX55 (five
entries) and SDM845 (sixteen entries) (Krzysztof Kozlowski)
- Describe qcom,pcie-sc8180x clocks and resets accurately (Krzysztof
Kozlowski)
- Describe qcom,pcie-sm8150 clocks and resets accurately (Krzysztof
Kozlowski)
- Correct the qcom "reset-name" property, previously incorrectly
called "reset-names" (Krzysztof Kozlowski)
- Document qcom,pcie-sm8650, based on qcom,pcie-sm8550 (Neil
Armstrong)
Renesas R-Car PCIe controller driver:
- Replace of_device.h with explicit of.h include to untangle header
usage (Rob Herring)
- Add DT and driver support for optional miniPCIe 1.5v and 3.3v
regulators on KingFisher (Wolfram Sang)
SiFive FU740 PCIe controller driver:
- Convert fu740 CONFIG_PCIE_FU740 dependency from SOC_SIFIVE to
ARCH_SIFIVE (Conor Dooley)
Synopsys DesignWare PCIe controller driver:
- Align iATU mapping for endpoint MSI-X (Niklas Cassel)
- Drop "host_" prefix from struct dw_pcie_host_ops members (Yoshihiro
Shimoda)
- Drop "ep_" prefix from struct dw_pcie_ep_ops members (Yoshihiro
Shimoda)
- Rename struct dw_pcie_ep_ops.func_conf_select() to
.get_dbi_offset() to be more descriptive (Yoshihiro Shimoda)
- Add Endpoint DBI accessors to encapsulate offset lookups (Yoshihiro
Shimoda)
TI J721E PCIe driver:
- Add j721e DT and driver support for 'num-lanes' for devices that
support x1, x2, or x4 Links (Matt Ranostay)
- Add j721e DT compatible strings and driver support for j784s4 (Matt
Ranostay)
- Make TI J721E Kconfig depend on ARCH_K3 since the hardware is
specific to those TI SoC parts (Peter Robinson)
TI Keystone PCIe controller driver:
- Hold power management references to all PHYs while enabling them to
avoid a race when one provides clocks to others (Siddharth
Vadapalli)
Xilinx XDMA PCIe controller driver:
- Remove redundant dev_err(), since platform_get_irq() and
platform_get_irq_byname() already log errors (Yang Li)
- Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq()
(Krzysztof Wilczyński)
- Fix xilinx_pl_dma_pcie_init_irq_domain() error return when
irq_domain_add_linear() fails (Harshit Mogalapalli)
MicroSemi Switchtec management driver:
- Do dma_mrpc cleanup during switchtec_pci_remove() to match its devm
ioremapping in switchtec_pci_probe(). Previously the cleanup was
done in stdev_release(), which used stale pointers if stdev->cdev
happened to be open when the PCI device was removed (Daniel
Stodden)
Miscellaneous:
- Convert interrupt terminology from "legacy" to "INTx" to be more
specific and match spec terminology (Damien Le Moal)
- In dw-xdata-pcie, pci_endpoint_test, and vmd, replace usage of
deprecated ida_simple_*() API with ida_alloc() and ida_free()
(Christophe JAILLET)"
* tag 'pci-v6.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits)
PCI: Fix kernel-doc issues
PCI: brcmstb: Configure HW CLKREQ# mode appropriate for downstream device
dt-bindings: PCI: brcmstb: Add property "brcm,clkreq-mode"
PCI: mediatek-gen3: Fix translation window size calculation
PCI: mediatek: Clear interrupt status before dispatching handler
PCI: keystone: Fix race condition when initializing PHYs
PCI: xilinx-xdma: Fix error code in xilinx_pl_dma_pcie_init_irq_domain()
PCI: xilinx-xdma: Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq()
PCI: rcar-gen4: Fix -Wvoid-pointer-to-enum-cast error
PCI: iproc: Fix -Wvoid-pointer-to-enum-cast warning
PCI: dwc: Add dw_pcie_ep_{read,write}_dbi[2] helpers
PCI: dwc: Rename .func_conf_select to .get_dbi_offset in struct dw_pcie_ep_ops
PCI: dwc: Rename .ep_init to .init in struct dw_pcie_ep_ops
PCI: dwc: Drop host prefix from struct dw_pcie_host_ops members
misc: pci_endpoint_test: Use a unique test pattern for each BAR
PCI: j721e: Make TI J721E depend on ARCH_K3
PCI: j721e: Add TI J784S4 PCIe configuration
PCI/AER: Use explicit register sizes for struct members
PCI/AER: Decode Requester ID when no error info found
PCI/AER: Use 'Correctable' and 'Uncorrectable' spec terms for errors
...
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
switch a memory area between guest_memfd and regular anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees
confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new guest_memfd
and page attributes infrastructure. This is mostly useful for testing,
since there is no pKVM-like infrastructure to provide a meaningfully
reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually care
about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a stable TSC",
because some of them don't expect the "TSC stable" bit (added to the pvclock
ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
flushes on nested transitions, i.e. always satisfies flush requests. This
allows running bleeding edge versions of VMware Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
- On AMD machines with vNMI, always rely on hardware instead of intercepting
IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
base granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with
a prefix branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV
support to that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing flag
in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix the
various bugs that were lurking due to lack of said annotation.
There are two non-KVM patches buried in the middle of guest_memfd support:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+
6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML
wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE
4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi
rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4
a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA==
=66Ws
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
Core & protocols
----------------
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up
build time warnings to safeguard against future header changes.
This improves TCP performances with many concurrent connections
up to 40%.
- Add page-pool netlink-based introspection, exposing the
memory usage and recycling stats. This helps indentify
bad PP users and possible leaks.
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set.
This lowers the time taken by connect() for hosts having
many active connections to the same destination.
- Refactor the TCP bind conflict code, shrinking related socket
structs.
- Refactor TCP SYN-Cookie handling, as a preparation step to
allow arbitrary SYN-Cookie processing via eBPF.
- Tune optmem_max for 0-copy usage, increasing the default value
to 128KB and namespecifying it.
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations.
- Reduce extension header parsing overhead at GRO time.
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries.
- Reorder nftables struct members, to keep data accessed by the
datapath first.
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer.
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex).
- More data-race annotations.
- Extend the diag interface to dump TCP bound-only sockets.
- Conditional notification of events for TC qdisc class and actions.
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID.
- Implement SMCv2.1 virtual ISM device support.
- Add support for Batman-avd mulicast packet type.
BPF
---
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers. It
improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload.
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y.
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques.
- Support uid/gid options when mounting bpffs.
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is identified
by its id.
- Extend verifier to allow bpf_refcount_acquire() of a map value field
obtained via direct load which is a use-case needed in sched_ext.
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter.
- Support for VLAN tag in XDP hints.
- Remove deprecated bpfilter kernel leftovers given the project
is developed in user-space (https://github.com/facebook/bpfilter).
Misc
----
- Support for parellel TC self-tests execution.
- Increase MPTCP self-tests coverage.
- Updated the bridge documentation, including several so-far
undocumented features.
- Convert all the net self-tests to run in unique netns, to
avoid random failures due to conflict and allow concurrent
runs.
- Add TCP-AO self-tests.
- Add kunit tests for both cfg80211 and mac80211.
- Autogenerate Netlink families documentation from YAML spec.
- Add yml-gen support for fixed headers and recursive nests, the
tool can now generate user-space code for all genetlink families
for which we have specs.
- A bunch of additional module descriptions fixes.
- Catch incorrect freeing of pages belonging to a page pool.
Driver API
----------
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust.
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship.
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances.
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host.
- Add support for ethtool symmetric-xor RSS hash.
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform.
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute.
- Convert older drivers to platform remove callback returning void.
- Add support for PHY package MMD read/write.
New hardware / drivers
----------------------
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed
-------
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will allow
in future releases combining multiple PFs devices attached to
different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring param,
coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk
PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35
98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ
HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU
Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf
k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK
yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy
C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw
q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo
PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX
pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF
ucqPEcZB5cRE
=1bW6
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"The most interesting thing is probably the networking structs
reorganization and a significant amount of changes is around
self-tests.
Core & protocols:
- Analyze and reorganize core networking structs (socks, netdev,
netns, mibs) to optimize cacheline consumption and set up build
time warnings to safeguard against future header changes
This improves TCP performances with many concurrent connections up
to 40%
- Add page-pool netlink-based introspection, exposing the memory
usage and recycling stats. This helps indentify bad PP users and
possible leaks
- Refine TCP/DCCP source port selection to no longer favor even
source port at connect() time when IP_LOCAL_PORT_RANGE is set. This
lowers the time taken by connect() for hosts having many active
connections to the same destination
- Refactor the TCP bind conflict code, shrinking related socket
structs
- Refactor TCP SYN-Cookie handling, as a preparation step to allow
arbitrary SYN-Cookie processing via eBPF
- Tune optmem_max for 0-copy usage, increasing the default value to
128KB and namespecifying it
- Allow coalescing for cloned skbs coming from page pools, improving
RX performances with some common configurations
- Reduce extension header parsing overhead at GRO time
- Add bridge MDB bulk deletion support, allowing user-space to
request the deletion of matching entries
- Reorder nftables struct members, to keep data accessed by the
datapath first
- Introduce TC block ports tracking and use. This allows supporting
multicast-like behavior at the TC layer
- Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and
classifiers (RSVP and tcindex)
- More data-race annotations
- Extend the diag interface to dump TCP bound-only sockets
- Conditional notification of events for TC qdisc class and actions
- Support for WPAN dynamic associations with nearby devices, to form
a sub-network using a specific PAN ID
- Implement SMCv2.1 virtual ISM device support
- Add support for Batman-avd mulicast packet type
BPF:
- Tons of verifier improvements:
- BPF register bounds logic and range support along with a large
test suite
- log improvements
- complete precision tracking support for register spills
- track aligned STACK_ZERO cases as imprecise spilled registers.
This improves the verifier "instructions processed" metric from
single digit to 50-60% for some programs
- support for user's global BPF subprogram arguments with few
commonly requested annotations for a better developer
experience
- support tracking of BPF_JNE which helps cases when the compiler
transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the
like
- several fixes
- Add initial TX metadata implementation for AF_XDP with support in
mlx5 and stmmac drivers. Two types of offloads are supported right
now, that is, TX timestamp and TX checksum offload
- Fix kCFI bugs in BPF all forms of indirect calls from BPF into
kernel and from kernel into BPF work with CFI enabled. This allows
BPF to work with CONFIG_FINEIBT=y
- Change BPF verifier logic to validate global subprograms lazily
instead of unconditionally before the main program, so they can be
guarded using BPF CO-RE techniques
- Support uid/gid options when mounting bpffs
- Add a new kfunc which acquires the associated cgroup of a task
within a specific cgroup v1 hierarchy where the latter is
identified by its id
- Extend verifier to allow bpf_refcount_acquire() of a map value
field obtained via direct load which is a use-case needed in
sched_ext
- Add BPF link_info support for uprobe multi link along with bpftool
integration for the latter
- Support for VLAN tag in XDP hints
- Remove deprecated bpfilter kernel leftovers given the project is
developed in user-space (https://github.com/facebook/bpfilter)
Misc:
- Support for parellel TC self-tests execution
- Increase MPTCP self-tests coverage
- Updated the bridge documentation, including several so-far
undocumented features
- Convert all the net self-tests to run in unique netns, to avoid
random failures due to conflict and allow concurrent runs
- Add TCP-AO self-tests
- Add kunit tests for both cfg80211 and mac80211
- Autogenerate Netlink families documentation from YAML spec
- Add yml-gen support for fixed headers and recursive nests, the tool
can now generate user-space code for all genetlink families for
which we have specs
- A bunch of additional module descriptions fixes
- Catch incorrect freeing of pages belonging to a page pool
Driver API:
- Rust abstractions for network PHY drivers; do not cover yet the
full C API, but already allow implementing functional PHY drivers
in rust
- Introduce queue and NAPI support in the netdev Netlink interface,
allowing complete access to the device <> NAPIs <> queues
relationship
- Introduce notifications filtering for devlink to allow control
application scale to thousands of instances
- Improve PHY validation, requesting rate matching information for
each ethtool link mode supported by both the PHY and host
- Add support for ethtool symmetric-xor RSS hash
- ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD
platform
- Expose pin fractional frequency offset value over new DPLL generic
netlink attribute
- Convert older drivers to platform remove callback returning void
- Add support for PHY package MMD read/write
New hardware / drivers:
- Ethernet:
- Octeon CN10K devices
- Broadcom 5760X P7
- Qualcomm SM8550 SoC
- Texas Instrument DP83TG720S PHY
- Bluetooth:
- IMC Networks Bluetooth radio
Removed:
- WiFi:
- libertas 16-bit PCMCIA support
- Atmel at76c50x drivers
- HostAP ISA/PCMCIA style 802.11b driver
- zd1201 802.11b USB dongles
- Orinoco ISA/PCMCIA 802.11b driver
- Aviator/Raytheon driver
- Planet WL3501 driver
- RNDIS USB 802.11b driver
Driver updates:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- allow one by one port representors creation and removal
- add temperature and clock information reporting
- add get/set for ethtool's header split ringparam
- add again FW logging
- adds support switchdev hardware packet mirroring
- iavf: implement symmetric-xor RSS hash
- igc: add support for concurrent physical and free-running
timers
- i40e: increase the allowable descriptors
- nVidia/Mellanox:
- Preparation for Socket-Direct multi-dev netdev. That will
allow in future releases combining multiple PFs devices
attached to different NUMA nodes under the same netdev
- Broadcom (bnxt):
- TX completion handling improvements
- add basic ntuple filter support
- reduce MSIX vectors usage for MQPRIO offload
- add VXLAN support, USO offload and TX coalesce completion
for P7
- Marvell Octeon EP:
- xmit-more support
- add PF-VF mailbox support and use it for FW notifications
for VFs
- Wangxun (ngbe/txgbe):
- implement ethtool functions to operate pause param, ring
param, coalesce channel number and msglevel
- Netronome/Corigine (nfp):
- add flow-steering support
- support UDP segmentation offload
- Ethernet NICs embedded, slower, virtual:
- Xilinx AXI: remove duplicate DMA code adopting the dma engine
driver
- stmmac: add support for HW-accelerated VLAN stripping
- TI AM654x sw: add mqprio, frame preemption & coalescing
- gve: add support for non-4k page sizes.
- virtio-net: support dynamic coalescing moderation
- nVidia/Mellanox Ethernet datacenter switches:
- allow firmware upgrade without a reboot
- more flexible support for bridge flooding via the compressed
FID flooding mode
- Ethernet embedded switches:
- Microchip:
- fine-tune flow control and speed configurations in KSZ8xxx
- KSZ88X3: enable setting rmii reference
- Renesas:
- add jumbo frames support
- Marvell:
- 88E6xxx: add "eth-mac" and "rmon" stats support
- Ethernet PHYs:
- aquantia: add firmware load support
- at803x: refactor the driver to simplify adding support for more
chip variants
- NXP C45 TJA11xx: Add MACsec offload support
- Wifi:
- MediaTek (mt76):
- NVMEM EEPROM improvements
- mt7996 Extremely High Throughput (EHT) improvements
- mt7996 Wireless Ethernet Dispatcher (WED) support
- mt7996 36-bit DMA support
- Qualcomm (ath12k):
- support for a single MSI vector
- WCN7850: support AP mode
- Intel (iwlwifi):
- new debugfs file fw_dbg_clear
- allow concurrent P2P operation on DFS channels
- Bluetooth:
- QCA2066: support HFP offload
- ISO: more broadcast-related improvements
- NXP: better recovery in case receiver/transmitter get out of sync"
* tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits)
lan78xx: remove redundant statement in lan78xx_get_eee
lan743x: remove redundant statement in lan743x_ethtool_get_eee
bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer()
bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel()
bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter()
tcp: Revert no longer abort SYN_SENT when receiving some ICMP
Revert "mlx5 updates 2023-12-20"
Revert "net: stmmac: Enable Per DMA Channel interrupt"
ipvlan: Remove usage of the deprecated ida_simple_xx() API
ipvlan: Fix a typo in a comment
net/sched: Remove ipt action tests
net: stmmac: Use interrupt mode INTM=1 for per channel irq
net: stmmac: Add support for TX/RX channel interrupt
net: stmmac: Make MSI interrupt routine generic
dt-bindings: net: snps,dwmac: per channel irq
net: phy: at803x: make read_status more generic
net: phy: at803x: add support for cdt cross short test for qca808x
net: phy: at803x: refactor qca808x cable test get status function
net: phy: at803x: generalize cdt fault length function
net: ethernet: cortina: Drop TSO support
...
A series from Baoquan He cleans up the asm-generic/io.h to remove the
ioremap_uc() definition from everything except x86, which still needs it
for pre-PAT systems. This series notably contains a patch from Jiaxun Yang
that converts MIPS to use asm-generic/io.h like every other architecture
does, enabling future cleanups.
Some of my own patches fix -Wmissing-prototype warnings in architecture
specific code across several architectures. This is now needed as the
warning is enabled by default. There are still some remaining warnings
in minor platforms, but the series should catch most of the widely used
ones make them more consistent with one another.
David McKay fixes a bug in __generic_cmpxchg_local() when this is used
on 64-bit architectures. This could currently only affect parisc64
and sparc64.
Additional cleanups address from Linus Walleij, Uwe Kleine-König,
Thomas Huth, and Kefeng Wang help reduce unnecessary inconsistencies
between architectures.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmWeak8ACgkQYKtH/8kJ
UidSiQ/+LL1WTO9d3Zx5HI0GGGjaIYpYs6jUNSf9Y5GPQiOrvjfEWj7CU11/4vxl
GlQRpRyncYm8Eiz0Qu+aNxZFiiMah8Uful75yfbX8P1L4EPTbAYNDjkyNJrTjIAK
jPK4sl8awIrapOeFUz++PsEj22R/4Is4f0mo+CqoCkL5RKlHe5oFdXzcwjmds4yK
CvU6Ldn+M7FZ3EItMdjXaB3D3HS9uictFiO5JByZY8p+IcqgNRI/iHNnZIMsltJ+
XjDi0DG+x4jCj6teElSchw7AofE4OcNSP3xbR1PLKv6+xBLGYaAGZhNuPTz88eV/
Gj0loDQrrR5McGUfDBRHK9zN2Jd0O/FKnfh9kLOt1FLFyGPvC78Q/2HkpVCjbBr2
Pr1aqhLDHA+tGNSsThsV8RUa8/tiEnxAki43tfBFS3SEKhtQsTm2g1z4miwbE3p0
BJIrSgTqrP/SBq7a9z/thPrkzdZcNuA9FUETTbaMeUlJS51n1V9E5A1t7sOG7jaI
vV/gbuR6FjvD49mTyQiOSCt3V4ygRqgN1Q+C4QM8WLqq2keUq0AhGodquv8F78in
J3x2j2r27lHY7jKf8B0dua/JXAsF20u8qD6yDQ9ymkjt/MWhGXBgK0jpT7RTIuMS
e2jmTywUVD4UohAcx3inkOojUhIJ5KDB0I4Pzv4zWcHNbyFNKcY=
=4VQl
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanups from Arnd Bergmann:
"A series from Baoquan He cleans up the asm-generic/io.h to remove the
ioremap_uc() definition from everything except x86, which still needs
it for pre-PAT systems. This series notably contains a patch from
Jiaxun Yang that converts MIPS to use asm-generic/io.h like every
other architecture does, enabling future cleanups.
Some of my own patches fix -Wmissing-prototype warnings in
architecture specific code across several architectures. This is now
needed as the warning is enabled by default. There are still some
remaining warnings in minor platforms, but the series should catch
most of the widely used ones make them more consistent with one
another.
David McKay fixes a bug in __generic_cmpxchg_local() when this is used
on 64-bit architectures. This could currently only affect parisc64 and
sparc64.
Additional cleanups address from Linus Walleij, Uwe Kleine-König,
Thomas Huth, and Kefeng Wang help reduce unnecessary inconsistencies
between architectures"
* tag 'asm-generic-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: Fix 32 bit __generic_cmpxchg_local
Hexagon: Make pfn accessors statics inlines
ARC: mm: Make virt_to_pfn() a static inline
mips: remove extraneous asm-generic/iomap.h include
sparc: Use $(kecho) to announce kernel images being ready
arm64: vdso32: Define BUILD_VDSO32_64 to correct prototypes
csky: fix arch_jump_label_transform_static override
arch: add do_page_fault prototypes
arch: add missing prepare_ftrace_return() prototypes
arch: vdso: consolidate gettime prototypes
arch: include linux/cpu.h for trap_init() prototype
arch: fix asm-offsets.c building with -Wmissing-prototypes
arch: consolidate arch_irq_work_raise prototypes
hexagon: Remove CONFIG_HEXAGON_ARCH_VERSION from uapi header
asm/io: remove unnecessary xlate_dev_mem_ptr() and unxlate_dev_mem_ptr()
mips: io: remove duplicated codes
arch/*/io.h: remove ioremap_uc in some architectures
mips: add <asm-generic/io.h> including
The goal is to get sched.h down to a type only header, so the main thing
happening in this patchset is splitting out various _types.h headers and
dependency fixups, as well as moving some things out of sched.h to
better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies.
Testing - it's been in -next, and fixes from pretty much all
architectures have percolated in - nothing major.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmWfBwwACgkQE6szbY3K
bnZPwBAAmuRojXaeWxi01IPIOehSGDe68vw44PR9glEMZvxdnZuPOdvE4/+245/L
bRKU2WBCjBUokUbV9msIShwRkFTZAmEMPNfPAAsFMA+VXeDYHKB+ZRdwTggNAQ+I
SG6fZgh5m0HsewCDxU8oqVHkjVq4fXn0cy+aL6xLEd9gu67GoBzX2pDieS2Kvy6j
jnyoKTxFwb+LTQgph0P4EIpq5I2umAsdLwdSR8EJ+8e9NiNvMo1pI00Lx/ntAnFZ
JftWUJcMy3TQ5u1GkyfQN9y/yThX1bZK5GvmHS9SJ2Dkacaus5d+xaKCHtRuFS1I
7C6b8PsNgRczUMumBXus44HdlNfNs1yU3lvVxFvBIPE1qC9pYRHrkWIXXIocXLLC
oxTEJ6B2G3BQZVQgLIA4fOaxMVhmvKffi/aEZLi9vN9VVosd1a6XNKI6KbyRnXFp
GSs9qDqszhn5I3GYNlDNQTc/8UsRlhPFgS6nS0By6QnvxtGi9QkU2tBRBsXvqwCy
cLoCYIhc2tvugHvld70dz26umiJ4rnmxGlobStNoigDvIKAIUt1UmIdr1so8P8eH
xehnL9ZcOX6xnANDL0AqMFFHV6I58CJynhFdUoXfVQf/DWLGX48mpi9LVNsYBzsI
CAwVOAQ0UjGrpdWmJ9ueY/ABYqg9vRjzaDEXQ+MhAYO55CLaVsg=
=3tyT
-----END PGP SIGNATURE-----
Merge tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs
Pull header cleanups from Kent Overstreet:
"The goal is to get sched.h down to a type only header, so the main
thing happening in this patchset is splitting out various _types.h
headers and dependency fixups, as well as moving some things out of
sched.h to better locations.
This is prep work for the memory allocation profiling patchset which
adds new sched.h interdepencencies"
* tag 'header_cleanup-2024-01-10' of https://evilpiepirate.org/git/bcachefs: (51 commits)
Kill sched.h dependency on rcupdate.h
kill unnecessary thread_info.h include
Kill unnecessary kernel.h include
preempt.h: Kill dependency on list.h
rseq: Split out rseq.h from sched.h
LoongArch: signal.c: add header file to fix build error
restart_block: Trim includes
lockdep: move held_lock to lockdep_types.h
sem: Split out sem_types.h
uidgid: Split out uidgid_types.h
seccomp: Split out seccomp_types.h
refcount: Split out refcount_types.h
uapi/linux/resource.h: fix include
x86/signal: kill dependency on time.h
syscall_user_dispatch.h: split out *_types.h
mm_types_task.h: Trim dependencies
Split out irqflags_types.h
ipc: Kill bogus dependency on spinlock.h
shm: Slim down dependencies
workqueue: Split out workqueue_types.h
...
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options.
[ mingo: Converted a few more uses in comments/messages as well. ]
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ariel Miculas <amiculas@cisco.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
Step 4/10 of the namespace unification of CPU mitigations related Kconfig options.
[ mingo: Converted new uses that got added since the series was posted. ]
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-5-leitao@debian.org
So the CPU mitigations Kconfig entries - there's 10 meanwhile - are named
in a historically idiosyncratic and hence rather inconsistent fashion
and have become hard to relate with each other over the years:
https://lore.kernel.org/lkml/20231011044252.42bplzjsam3qsasz@treble/
When they were introduced we never expected that we'd eventually have
about a dozen of them, and that more organization would be useful,
especially for Linux distributions that want to enable them in an
informed fashion, and want to make sure all mitigations are configured
as expected.
For example, the current CONFIG_SPECULATION_MITIGATIONS namespace is only
halfway populated, where some mitigations have entries in Kconfig, and
they could be modified, while others mitigations do not have Kconfig entries,
and can not be controlled at build time.
Fine-grained control over these Kconfig entries can help in a number of ways:
1) Users can choose and pick only mitigations that are important for
their workloads.
2) Users and developers can choose to disable mitigations that mangle
the assembly code generation, making it hard to read.
3) Separate Kconfigs for just source code readability,
so that we see *which* butt-ugly piece of crap code is for what
reason...
In most cases, if a mitigation is disabled at compilation time, it
can still be enabled at runtime using kernel command line arguments.
This is the first patch of an initial series that renames various
mitigation related Kconfig options, unifying them under a single
CONFIG_MITIGATION_* namespace:
CONFIG_GDS_FORCE_MITIGATION => CONFIG_MITIGATION_GDS_FORCE
CONFIG_CPU_IBPB_ENTRY => CONFIG_MITIGATION_IBPB_ENTRY
CONFIG_CALL_DEPTH_TRACKING => CONFIG_MITIGATION_CALL_DEPTH_TRACKING
CONFIG_PAGE_TABLE_ISOLATION => CONFIG_MITIGATION_PAGE_TABLE_ISOLATION
CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINE
CONFIG_SLS => CONFIG_MITIGATION_SLS
CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY
CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY
CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO
CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK
Implement step 1/10 of the namespace unification of CPU mitigations related
Kconfig options and rename CONFIG_GDS_FORCE_MITIGATION to
CONFIG_MITIGATION_GDS_FORCE.
[ mingo: Rewrote changelog for clarity. ]
Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231121160740.1249350-2-leitao@debian.org
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in "nilfs2: Folio
conversions for file paths".
- Additional nilfs2 folio conversion from Ryusuke Konishi in "nilfs2:
Folio conversions for directory paths".
- IA64 remnant removal in Heiko Carstens's "Remove unused code after
IA-64 removal".
- Arnd Bergmann has enabled the -Wmissing-prototypes warning everywhere
in "Treewide: enable -Wmissing-prototypes". This had some followup
fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
"hexagon: Fix up instances of -Wmissing-prototypes".
- Nathan also addressed some s390 warnings in "s390: A couple of
fixes for -Wmissing-prototypes".
- Arnd Bergmann addresses the same warnings for MIPS in his series
"mips: address -Wmissing-prototypes warnings".
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series "kexec_file: Load kernel at top of
system RAM if required"
- Baoquan He has also added the self-explanatory "kexec_file: print out
debugging message if required".
- Some checkstack maintenance work from Tiezhu Yang in the series
"Modify some code about checkstack".
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is "watchdog:
Better handling of concurrent lockups".
- Yuntao Wang has contributed some maintenance work on the crash code in
"crash: Some cleanups and fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZ2R6AAKCRDdBJ7gKXxA
juCVAP4t76qUISDOSKugB/Dn5E4Nt9wvPY9PcufnmD+xoPsgkQD+JVl4+jd9+gAV
vl6wkJDiJO5JZ3FVtBtC3DFA/xHtVgk=
=kQw+
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"Quite a lot of kexec work this time around. Many singleton patches in
many places. The notable patch series are:
- nilfs2 folio conversion from Matthew Wilcox in 'nilfs2: Folio
conversions for file paths'.
- Additional nilfs2 folio conversion from Ryusuke Konishi in 'nilfs2:
Folio conversions for directory paths'.
- IA64 remnant removal in Heiko Carstens's 'Remove unused code after
IA-64 removal'.
- Arnd Bergmann has enabled the -Wmissing-prototypes warning
everywhere in 'Treewide: enable -Wmissing-prototypes'. This had
some followup fixes:
- Nathan Chancellor has cleaned up the hexagon build in the series
'hexagon: Fix up instances of -Wmissing-prototypes'.
- Nathan also addressed some s390 warnings in 's390: A couple of
fixes for -Wmissing-prototypes'.
- Arnd Bergmann addresses the same warnings for MIPS in his series
'mips: address -Wmissing-prototypes warnings'.
- Baoquan He has made kexec_file operate in a top-down-fitting manner
similar to kexec_load in the series 'kexec_file: Load kernel at top
of system RAM if required'
- Baoquan He has also added the self-explanatory 'kexec_file: print
out debugging message if required'.
- Some checkstack maintenance work from Tiezhu Yang in the series
'Modify some code about checkstack'.
- Douglas Anderson has disentangled the watchdog code's logging when
multiple reports are occurring simultaneously. The series is
'watchdog: Better handling of concurrent lockups'.
- Yuntao Wang has contributed some maintenance work on the crash code
in 'crash: Some cleanups and fixes'"
* tag 'mm-nonmm-stable-2024-01-09-10-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (157 commits)
crash_core: fix and simplify the logic of crash_exclude_mem_range()
x86/crash: use SZ_1M macro instead of hardcoded value
x86/crash: remove the unused image parameter from prepare_elf_headers()
kdump: remove redundant DEFAULT_CRASH_KERNEL_LOW_SIZE
scripts/decode_stacktrace.sh: strip unexpected CR from lines
watchdog: if panicking and we dumped everything, don't re-enable dumping
watchdog/hardlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/softlockup: use printk_cpu_sync_get_irqsave() to serialize reporting
watchdog/hardlockup: adopt softlockup logic avoiding double-dumps
kexec_core: fix the assignment to kimage->control_page
x86/kexec: fix incorrect end address passed to kernel_ident_mapping_init()
lib/trace_readwrite.c:: replace asm-generic/io with linux/io
nilfs2: cpfile: fix some kernel-doc warnings
stacktrace: fix kernel-doc typo
scripts/checkstack.pl: fix no space expression between sp and offset
x86/kexec: fix incorrect argument passed to kexec_dprintk()
x86/kexec: use pr_err() instead of kexec_dprintk() when an error occurs
nilfs2: add missing set_freezable() for freezable kthread
kernel: relay: remove relay_file_splice_read dead code, doesn't work
docs: submit-checklist: remove all of "make namespacecheck"
...
- Add branch stack counters ABI extension to better capture
the growing amount of information the PMU exposes via
branch stack sampling. There's matching tooling support.
- Fix race when creating the nr_addr_filters sysfs file
- Add Intel Sierra Forest and Grand Ridge intel/cstate
PMU support.
- Add Intel Granite Rapids, Sierra Forest and Grand Ridge
uncore PMU support.
- Misc cleanups & fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb4lURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jlnQ/+NSzrPQ9hEiS5a1iMMxdwC6IoXCmeFVsv
s5NsGaVC7FEgjm3oCfvQlP63HolMO9R7TNLZsgINzOda5IHtE7WUcgBK7gbZr+NT
WabdTyFrdmUr+Br0rLrEe0bxDSQU7r41ptqKE5HZRM9/3SbLhWgaXSJbfFAG2JV0
xboZ/2qzb7Puch6VTWv1YhuIpr1Pi817As4SOo7JR4V8jBB2bh2eZ7XBN1z23aw2
xuglbYml5gs4dOaFTqkRLWyn2PmrZ9wYKcdp63FVUscZ4LxvSw749BxEcNpTbxLp
PT6uXIKw9PnStNfscfrsk6fDocVJzqrOK71blgiOKbmhWTE0UimEpFf1Hd3ooewg
hFp3hmkE5Bc2MTUnwivkBxj96fz5rXH+3+Cue/5NsvDNlhlkswIIxzDw8M1G4rOI
KQMDUYFOhQPa3Hi1lSp2SgHI5AcYHudepr/Z3QMxD3iLs+Wo2cmDcp8d2VrMLfb7
GHSITG592iYcZPYsJosxby8CSFaUPxIl9l3AODQwWuEjd4PcOYa6iB2HbEa/mC3R
wXcs8mFIMAaH/HRYUlqUDA5pOqN5chb13iDtS4JqJqBKyWgdrDLCVxoZSQvB64+I
bldyy1e5oQSVVwJ42WLkUK3Eld2x75ki1JLZFwMgYuOgQv3jfu2VNenUWJ5ig0La
dPpHP8PwOoc=
=2O/5
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
- Add branch stack counters ABI extension to better capture the growing
amount of information the PMU exposes via branch stack sampling.
There's matching tooling support.
- Fix race when creating the nr_addr_filters sysfs file
- Add Intel Sierra Forest and Grand Ridge intel/cstate PMU support
- Add Intel Granite Rapids, Sierra Forest and Grand Ridge uncore PMU
support
- Misc cleanups & fixes
* tag 'perf-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Factor out topology_gidnid_map()
perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology()
perf/x86/amd: Reject branch stack for IBS events
perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge
perf/x86/intel/uncore: Support IIO free-running counters on GNR
perf/x86/intel/uncore: Support Granite Rapids
perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array
perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR
perf: Fix the nr_addr_filters fix
perf/x86/intel/cstate: Add Grand Ridge support
perf/x86/intel/cstate: Add Sierra Forest support
x86/smp: Export symbol cpu_clustergroup_mask()
perf/x86/intel/cstate: Cleanup duplicate attr_groups
perf/core: Fix narrow startup race when creating the perf nr_addr_filters sysfs file
perf/x86/intel: Support branch counters logging
perf/x86/intel: Reorganize attrs and is_visible
perf: Add branch_sample_call_stack
perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag
perf: Add branch stack counters
- A micro-optimization got misplaced as a cleanup:
- Micro-optimize the asm code in secondary_startup_64_no_verify()
- Change global variables to local
- Add missing kernel-doc function parameter descriptions
- Remove unused parameter from a macro
- Remove obsolete Kconfig entry
- Fix comments
- Fix typos, mostly scripted, manually reviewed
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb2i8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iFIQ//RjqKWmEBfv0UVCNgtRgkUKOvYVkfhC1R
FykHWbSE+/oDODS7B+gbWqzl9Fq2Oxx9re4KZuMfnojE96KZ6H1flQn7z3UVRUrf
pfMx13E+uyf7qbVZktqH38lUS4s/AHdX2PKCiXlU/0hIkiBdjbAl3ylyqMv7ytIL
Fi2N9iYJN+eLlMkc3A5IK83xNiU8rb0gO6Uywn3nUbqadY/YX2gDpND5kfzRIneR
lTKy4rX3+E65qYB2Ly1wDr7e0Q0rgaTzPctx6twFrxQXK+MsHiartJhM5juND/tU
DEjSW9ISOHlitKEJI/zbdrvJlr5AKDNy2zHYmQQuqY6+YHRamCKqwIjLIPkKj52g
lAbosNwvp/o8W3zUHgUfVZR5hVxN863zV2qa/ehoQ3b/9kNjQC8actILjYEgIVu9
av1sd+nETbjCUABIF9H9uAoRbgc+wQs2nupJZrjvginFz8+WVhgaBdJDMYCNAmjc
fNMjGtRS7YXiIMj09ZAXFThVW302FdbTgggDh/qlQlDOXFu5HRbyuWR+USr4/jkP
qs2G6m/BHDs9HxDRo/no+ccSrUBV5phfhZbO7qwjTf2NJJvPHW+cxGpT00zU2v8A
lgfVI7SDkxwbyi1gacJ054GqEhsWuEdi40ikqxjhL8Oq4xwwsey/PiaIxjkDQx92
Gj3XUSDnGEs=
=kUav
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
- Change global variables to local
- Add missing kernel-doc function parameter descriptions
- Remove unused parameter from a macro
- Remove obsolete Kconfig entry
- Fix comments
- Fix typos, mostly scripted, manually reviewed
and a micro-optimization got misplaced as a cleanup:
- Micro-optimize the asm code in secondary_startup_64_no_verify()
* tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch/x86: Fix typos
x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify()
x86/docs: Remove reference to syscall trampoline in PTI
x86/Kconfig: Remove obsolete config X86_32_SMP
x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro
x86/mtrr: Document missing function parameters in kernel-doc
x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
- Replace magic numbers in GDT descriptor definitions & handling:
- Introduce symbolic names via macros for descriptor types/fields/flags,
and then use these symbolic names.
- Clean up definitions a bit, such as GDT_ENTRY_INIT()
- Fix/clean up details that became visibly inconsistent after the
symbol-based code was introduced:
- Unify accessed flag handling
- Set the D/B size flag consistently & according to the HW specification
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb1hQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j7/xAAk2L36n4ZpWqm4oWxMyRDTd713qT+Jzg5
txVr7i8el5RZ8D9WQejzOWemJ/XgirkWuRSFfY+UzZZ+7lmioXWDFLz6PFpyMOwM
OG2yyAAZIQzFhe1LxhurN5G7F+q9Nl/CcDnRWv+FUKs829GOGlaqirKfKwgIFzgw
GUT0eT5gPfIfEFnzMUrix9+KNIZU6s6Lgg5tVpavNw6bKEimg8Mtn4J45voFbg98
l/PU0RaC28vePTMdoBLv/ZZUHS9K/UdGUq600guzk/Zh2plqxGtqjXtaM5/8giPO
JbgQCTP0LaCC0oZf0SAbUW7+TiKD03CP6QzleYyLdhTjLz1CCOswgz0upPjfa/5j
nhgsMCk/CEhr6nMwhJiRNhOrFFFJWX00bpcQ2tyyVO156Q+CfhJxuk1XafeQ5MSs
K4l97IVc/vPQSEZy5Nh6GdGmWljYxP3Ku4eHRkyVGVYf8YyaD2l6CimaRcdPNqqL
w9GLwczLfEzkomfdxsyppwu364O91Lpv6gi6AY4Tj7Hgo1ZMm9ucUO3AsTUz5EJJ
aiqX6o7puIJJkU2O3hKviTzNwTojjjicD/Az66re3lHZBND+luP7LjVY2wIomH5i
bFPVdSCDmCdTrsTkO3f87FHbCVrbNllKLGChvJ66vyC2nJeIM8UHJ/MxYJsGRhqk
87vU1DWYPrQ=
=Y0pg
-----END PGP SIGNATURE-----
Merge tag 'x86-asm-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"Replace magic numbers in GDT descriptor definitions & handling:
- Introduce symbolic names via macros for descriptor
types/fields/flags, and then use these symbolic names.
- Clean up definitions a bit, such as GDT_ENTRY_INIT()
- Fix/clean up details that became visibly inconsistent after the
symbol-based code was introduced:
- Unify accessed flag handling
- Set the D/B size flag consistently & according to the HW
specification"
* tag 'x86-asm-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Add DB flag to 32-bit percpu GDT entry
x86/asm: Always set A (accessed) flag in GDT descriptors
x86/asm: Replace magic numbers in GDT descriptors, script-generated change
x86/asm: Replace magic numbers in GDT descriptors, preparations
x86/asm: Provide new infrastructure for GDT descriptors
solution which allows for more timely detection and reporting of
errors
- Start a documentation section which will hold down relevant
RAS features description and how they should be used
- Add new AMD error bank types
- Slim down and remove error type descriptions from the kernel side of
error decoding to rasdaemon which can be used from now on to decode
hw errors on AMD
- Mark pages containing uncorrectable errors as poison so that kdump can
avoid them and thus not cause another panic
- The usual cleanups and fixlets
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWamqkACgkQEsHwGGHe
VUrkMw/+Jv//z5pKFMXF2GlI/Xefh8pXDB57D22B1T6zNJm+Qq1b58U44WpxoU24
b9gPtkgxjzA3JwoQG+cBDkCSs1xLV3McUS0UoEqQ28QvFN+mzxYKu4ura3F+rcZG
PiCT4gPEYgZirYl0PKYeypnBPq3Krx/RYeTUE0vQ9HtmeBCmH71X1egyB4TqFbFU
7ui9aITLAnLBwgO6On1qviGPvJoEDGAsQ656XoY0Js+8dUeYwAI4qpaaUwtXUo1D
ARGzss55/qTRo2G+IkDDhbJ8e8G+eX22oa0n1tdhNFBqfYAN6gM+t4NiFlNnn+18
nlbaSZfDygciE8DVjEnkVIrRJtkq7uj0dk7LvnqEI2y7J0LybHojC3hDKrqYLK3o
PRgfPwmykOCwZQldGFYbShvmY8KoEQc/V9OWi/+A/M/uTJsForQmHn78Z2YkO9kG
K6VaLuYszSCqz47wf56pHBwtMrivEPmkcxaz9ErkK90vM/NmV7kfLl+R8IK8apNJ
cJkuLBjfgGpBP+AlpXGl9OE0lRJK5MCbEBlbGBBl58REBaB4DNkM4QHItrUSRR82
ADLcfLAIRWT8UwwXieDbWF1jb+4L+IJnXCKGCCQ7+eYxcFI9V9TABD2B8io+Dzvz
evZwLCPKmjuPc2CMcDu/eUdBKLNTn3QAoN/NLcVmzJ23lguQW/M=
=o3UM
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Convert the hw error storm handling into a finer-grained, per-bank
solution which allows for more timely detection and reporting of
errors
- Start a documentation section which will hold down relevant RAS
features description and how they should be used
- Add new AMD error bank types
- Slim down and remove error type descriptions from the kernel side of
error decoding to rasdaemon which can be used from now on to decode
hw errors on AMD
- Mark pages containing uncorrectable errors as poison so that kdump
can avoid them and thus not cause another panic
- The usual cleanups and fixlets
* tag 'ras_core_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Handle Intel threshold interrupt storms
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Remove old CMCI storm mitigation code
Documentation: Begin a RAS section
x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
EDAC/mce_amd: Remove SMCA Extended Error code descriptions
x86/mce/amd, EDAC/mce_amd: Move long names to decoder module
x86/mce/inject: Clear test status value
x86/mce: Remove redundant check from mce_device_create()
x86/mce: Mark fatal MCE's page as poison to avoid panic in the kdump kernel
and use them everywhere instead of ad-hoc family/model checks. Drop an
ancient AMD errata checking facility as a result
- Fix a fragile initcall ordering in intel_epb
- Do not issue the MFENCE+LFENCE barrier for the TSC deadline and X2APIC
MSRs on AMD as it is not needed there
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWalaEACgkQEsHwGGHe
VUrDYg/+LjqjJv3OcVZZkx9WVds0kmBCajrf9JxRYgiSTIpiL/usH0QOms8FjHQ6
tYcukj+RJDk2nP5ho3Vs1WNA0mvU0nxC+99u0Ph4zugSMSl0XGOA+YxxTBPXmDGB
1IxH9IloMFPhwDoQ4/ear0IvjIrfE4ESV2Dafe45WzVSdG7/0ijurisaH1kPYraP
wzuNn142Tk0eicaam30sdThXZraO9Paz5MOYbpYEAU4lxNtdH85sQa+Xk0tqJcjD
IwEcQJLE6n3r8t/lNMIlhAsmOVGrD5WltDH9HvEmKT4mzTumSc9DLu3YHHRWyx2K
TMpRYHlVuvGJkJV3CYXi8fhTsV6uMsHEe1+xZ/Rf0iQzOG25v+zen8WK4REWOr/o
VmprG3j7LkEFeeH3CqSOtVSbYmxFILQb6pAbzSlI907b5C6PaEYuudjVXuX01urN
IG3krWHGMJ3AWKDV2Z3hW1TYtbLJyqKPNhqcBJiOuWyCe8cQXfKQBTpP5HuAEZEd
UXc4QpStMvuPqxyQhlPSTAtY7L/UVhBH8oHoXPYiBmcCo7VtJYW6HH9z1ISUc1av
FgKdkpx6vaJiXlD/wI/B5T1oViWQ8udhHpit99rhKl623e7WC2rdguAOVDLn/YIe
cZB+R05yknBWOavH0kcuz9R9xYKMSBcEsRnBKmeOg9R+tTK/7BM=
=afTN
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu feature updates from Borislav Petkov:
- Add synthetic X86_FEATURE flags for the different AMD Zen generations
and use them everywhere instead of ad-hoc family/model checks. Drop
an ancient AMD errata checking facility as a result
- Fix a fragile initcall ordering in intel_epb
- Do not issue the MFENCE+LFENCE barrier for the TSC deadline and
X2APIC MSRs on AMD as it is not needed there
* tag 'x86_cpu_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Add X86_FEATURE_ZEN1
x86/CPU/AMD: Drop now unused CPU erratum checking function
x86/CPU/AMD: Get rid of amd_erratum_1485[]
x86/CPU/AMD: Get rid of amd_erratum_400[]
x86/CPU/AMD: Get rid of amd_erratum_383[]
x86/CPU/AMD: Get rid of amd_erratum_1054[]
x86/CPU/AMD: Move the DIV0 bug detection to the Zen1 init function
x86/CPU/AMD: Move Zenbleed check to the Zen2 init function
x86/CPU/AMD: Rename init_amd_zn() to init_amd_zen_common()
x86/CPU/AMD: Call the spectral chicken in the Zen2 init function
x86/CPU/AMD: Move erratum 1076 fix into the Zen1 init function
x86/CPU/AMD: Move the Zen3 BTC_NO detection to the Zen3 init function
x86/CPU/AMD: Carve out the erratum 1386 fix
x86/CPU/AMD: Add ZenX generations flags
x86/cpu/intel_epb: Don't rely on link order
x86/barrier: Do not serialize MSR accesses on AMD
- Move the SEV C-bit verification to the BSP as it needs to happen only
once and not on every AP
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWbIXUACgkQEsHwGGHe
VUocDQ//cxfdJuF+Srww/WgpXSc2EmT6OAlSX8R1S89PD5PPDWJq3U8dE7gScZnH
bjcFj7CMkyFHe4yPkDljdS4+7Zrt2ilbzuojdnCjByd4aL5fCsZJCjWlVeaR/0ay
3lkr+fX3GtAl2njOBv9Fc0E7qh/E8zHRNcIRIS6Lz+LY5ziNqU9IGZruESQKkR2B
GokNKqWfqFhJaifEe4F+AdWRbZaEZlli6Gttn7T+QPPRk9LPCaxqkHT0BbIbUqnr
eMdvPZucWZiqNe1C9zxKeRyzqc2nPXbvi1QkCundXrCYwRuF4+sQiyM5yEMmStdl
wQ/kfbLJFFS96RZ408UNrvSJHYulFRb5uNj0M6lIc2KlQt2PCPhz2+MsipjWMYEP
7qz5KGkiSyosTZ+ZT0jw1G++KtRcoG/brXFKXGuu9I2yi0OMOtNdCXNH9SyTJDT3
d6nk5t2xloPH8qLfAb9BK7uVINSSlu/f/3nuoKqlHPSYcZoK7kt0BW4DJNb8+iYO
XhdmCpvYIfMvyaLjFYypimE7AnqgafUS/zVQCfnIF6BJveGj6AI/Xv0yubwErbuK
ijcrz2x9LlwgoAGMClePFWwhmRfg1M3cRRD4VekH4kIC6pcrP2Wk3Dl2VgaojEdg
F4qE+RB1x4AO5wkiYLtHACPZQ7i7yzLLGOhBTxFjgXz0EBMwTYU=
=zKez
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Convert the sev-guest plaform ->remove callback to return void
- Move the SEV C-bit verification to the BSP as it needs to happen only
once and not on every AP
* tag 'x86_sev_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sev-guest: Convert to platform remove callback returning void
x86/sev: Do the C-bit verification only on the BSP
infrastructure and remove the former
- Misc other improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWbIJ0ACgkQEsHwGGHe
VUrOSg//duslGRzbLkrjijYmlb9C9yjdf+7611biRNExxlOUbQvgMX8O5/qnvfhk
ddqcyRJlg6GeooMI3ch25KRqdUbcgI5Zq1ftG7hFiX94gIiWhiHibYxtmnNGsDuy
sq4R3nxqZxpTh2Seajh9AfXS4uINu1wsedLxuSyNiCBk2/BB7zqC3aPCyAtjJDdT
tpPN+dx5s/rmPkdCcdFxwAgSxsohLjZgktdZIUCbVM+blbghDgPV1wjhCUimxUYx
2woqS0zgOZb6eRcZgdXkuaNvOqODOwIZo3iebr6XSIb49YCC/KhuUlDDr/ZZGV3a
CeMR42sa2CEdl0YHsnHEm8LswpUVqL6FBfFnC+TP0bAGRC+L4pNEaS6eQk+ktWf/
+k6bvuBhuPgZG5KdxgQiJB8wogDxCXqQtI7MNLI0IuBS7GC2hjHb35WB6Qb8i+EB
ffyIJYhK0shpVzo9H1efosmLfO+i2RPn5YZiC+S27MVqGbYEkW6713hr56vp58mK
UaxXWoTnfRR9O5Hyf0xEf/qjZFO2zpmQRgXIM2ECzLxn7GNvsPtcHsYJmLJyeUXf
749glVHz8Ue9HsDTCQB35Z9yRT4cVuz2FclwEt51Qlta8QtX8MRaL/l+F3FOrHoo
CPgo/fzSav3Z/XpQpjbC7vI886azQivuX5i5J9LhO5IDc8O3OMs=
=Q5U5
-----END PGP SIGNATURE-----
Merge tag 'x86_paravirt_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt updates from Borislav Petkov:
- Replace the paravirt patching functionality using the alternatives
infrastructure and remove the former
- Misc other improvements
* tag 'x86_paravirt_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternative: Correct feature bit debug output
x86/paravirt: Remove no longer needed paravirt patching code
x86/paravirt: Switch mixed paravirt/alternative calls to alternatives
x86/alternative: Add indirect call patching
x86/paravirt: Move some functions and defines to alternative.c
x86/paravirt: Introduce ALT_NOT_XEN
x86/paravirt: Make the struct paravirt_patch_site packed
x86/paravirt: Use relative reference for the original instruction offset
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmWZaMsACgkQEsHwGGHe
VUpiig//cDmMbajaeInAVVEilmJnw+YYxKCpvgOGdUnL+dAxBJ2SehVeEGZLiMRs
79o5X2In+tdTjXRH1oTxWJWSMx9MW4aGbDbgZDAJoPigdZYkFoWE2p81t5n11LNt
xjMNcnGMbns1aNCdBqGlg0baQ/xJavdFPndziq7kUMS/f3QLRm8Jp7MA+zUkDgCf
DPuJcyFVYq7ffmc9VuVEsrh6yNREOj88Ek1tatDobbOd6LA/a3N/aOjosEwCS9W1
qBVcMSaNt34ySzHV4sd/Uw8Dj3LXmxe2a51O+nsy+AjjkmHPYU9F0jSK4UqZgI3X
+NPl9o8eNz8ACnK9oF/OLogoK3R9ED8vsok0r8OdvziSpi0XXaAJFGJCIYYcwJWb
5aJTVtwKCoOhdV80NIcwjY/QJLTji4LKK46w+HaHy8wNQhAWY3Kt3P4XAn7ERTh6
vBjeeFQtJLmE+KKYcNC90VGf/efztN9AdtYxwdb8JkkFZwOyc7M2MPV74YvQI0iz
O0FCf2PWbQ0hKXwsVvfwn5bZoNXyIcvRc+rGpYi0NaGaiUcq53JY00NateOyhG1Z
ebN30AWeivqW4Mwz3izKddmeOwniWpCbbJHR8E3KGZH/Uy6S9F6UQY14y0NhucKq
RWTwFs5+YWB6rsmJRtUPgkX18hRVl5AjURcBXK7WbR0cONntXHs=
=0Ewr
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode updates from Borislav Petkov:
- Correct minor issues after the microcode revision reporting
sanitization
* tag 'x86_microcode_for_v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/intel: Set new revision only after a successful update
x86/microcode/intel: Remove redundant microcode late updated message
kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is
present in the VM. It leads to write to a MSR that doesn't exist on some
configurations, namely in TDX guest:
unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000)
at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30)
kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt
features.
Do not disable kvmclock if it was not enabled.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: c02027b574 ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Message-Id: <20231205004510.27164-6-kirill.shutemov@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use SZ_1M macro instead of hardcoded 1<<20 to make code more readable.
Link: https://lkml.kernel.org/r/20240102144905.110047-3-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "crash: Some cleanups and fixes", v2.
This patchset includes two cleanups and one fix.
This patch (of 3):
The image parameter is no longer in use, remove it. Also, tidy up the
code formatting.
Link: https://lkml.kernel.org/r/20240102144905.110047-1-ytcoode@gmail.com
Link: https://lkml.kernel.org/r/20240102144905.110047-2-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In
https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local
I wanted to adjust the alternative patching debug output to the new
changes introduced by
da0fe6e68e ("x86/alternative: Add indirect call patching")
but removed the '*' which denotes the ->x86_capability word. The correct
output should be, for example:
[ 0.230071] SMP alternatives: feat: 11*32+15, old: (entry_SYSCALL_64_after_hwframe+0x5a/0x77 (ffffffff81c000c2) len: 16), repl: (ffffffff89ae896a, len: 5) flags: 0x0
while the incorrect one says "... 1132+15" currently.
Add back the '*'.
Fixes: da0fe6e68e ("x86/alternative: Add indirect call patching")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231206110636.GBZXBVvCWj2IDjVk4c@fat_crate.local
kernel_ident_mapping_init() takes an exclusive memory range [pstart, pend)
where pend is not included in the range, while res represents an inclusive
memory range [start, end] where end is considered part of the range.
Passing [start, end] rather than [start, end+1) to
kernel_ident_mapping_init() may result in the identity mapping for the
end address not being set up.
For example, when res->start is equal to res->end,
kernel_ident_mapping_init() will not establish any identity mapping.
Similarly, when the value of res->end is a multiple of 2M and the page
table maps 2M pages, kernel_ident_mapping_init() will also not set up
identity mapping for res->end.
Therefore, passing res->end directly to kernel_ident_mapping_init() is
incorrect, the correct end address should be `res->end + 1`.
Link: https://lkml.kernel.org/r/20231221101702.20956-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kexec_dprintk() expects the last argument to be kbuf.memsz, but the actual
argument being passed is kbuf.bufsz.
Although these two values are currently equal, it is better to pass the
correct one, in case these two values become different in the future.
Link: https://lkml.kernel.org/r/20231220154105.215610-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When detecting an error, the current code uses kexec_dprintk() to output
log message. This is not quite appropriate as kexec_dprintk() is mainly
used for outputting debugging messages, rather than error messages.
Replace kexec_dprintk() with pr_err(). This also makes the output method
for this error log align with the output method for other error logs in
this function.
Additionally, the last return statement in set_page_address() is
unnecessary, remove it.
Link: https://lkml.kernel.org/r/20231220030124.149160-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We're trying to get sched.h down to more or less just types only, not
code - rseq can live in its own header.
This helps us kill the dependency on preempt.h in sched.h.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The expression `mstart + resource_size(res) - 1` is actually equivalent to
`res->end`, simplify the logic of this function to improve readability.
Link: https://lkml.kernel.org/r/20231212150506.31711-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Then when specifying '-d' for kexec_file_load interface, loaded locations
of kernel/initrd/cmdline etc can be printed out to help debug.
Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
loading related codes.
And also print out e820 memmap passed to 2nd kernel just as kexec_load
interface has been doing.
Link: https://lkml.kernel.org/r/20231213055747.61826-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The D/B size flag for the 32-bit percpu GDT entry was not set.
The Intel manual (vol 3, section 3.4.5) only specifies the meaning of
this flag for three cases:
1) code segments used for %cs -- doesn't apply here
2) stack segments used for %ss -- doesn't apply
3) expand-down data segments -- but we don't have the expand-down flag
set, so it also doesn't apply here
The flag likely doesn't do anything here, although the manual does also
say: "This flag should always be set to 1 for 32-bit code and data
segments [...]" so we should probably do it anyway.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231219151200.2878271-6-vegard.nossum@oracle.com
We have no known use for having the CPU track whether GDT descriptors
have been accessed or not.
Simplify the code by adding the flag to the common flags and removing
it everywhere else.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231219151200.2878271-5-vegard.nossum@oracle.com
Actually replace the numeric values by the new symbolic values.
I used this to find all the existing users of the GDT_ENTRY*() macros:
$ git grep -P 'GDT_ENTRY(_INIT)?\('
Some of the lines will exceed 80 characters, but some of them will be
shorter again in the next couple of patches.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231219151200.2878271-4-vegard.nossum@oracle.com
We'd like to replace all the magic numbers in various GDT descriptors
with new, semantically meaningful, symbolic values.
In order to be able to verify that the change doesn't cause any actual
changes to the compiled binary code, I've split the change into two
patches:
- Part 1 (this commit): everything _but_ actually replacing the numbers
- Part 2 (the following commit): _only_ replacing the numbers
The reason we need this split for verification is that including new
headers causes some spurious changes to the object files, mostly line
number changes in the debug info but occasionally other subtle codegen
changes.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231219151200.2878271-3-vegard.nossum@oracle.com
This function is now called from a few places that are no __init_or_module,
resulting a link time warning:
WARNING: modpost: vmlinux: section mismatch in reference: patch_dest+0x8a (section: .text) -> apply_relocation (section: .init.text)
Remove the annotation here.
[ mingo: Also sync up add_nop() with these changes. ]
Fixes: 17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231204072856.1033621-1-arnd@kernel.org
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmWAz2EACgkQ6rmadz2v
bToqrw/9EwroZCc8GEHOKAlb/fzrMvn92rLo0ZW/cGN84QJPnx4zM6Zo0+fgLaaN
oqqztwMUwdzGC3uX3FfVXaaLKbJ/MeHeL9BXFZNW8zkRHciw4R7kIBhOdPnHyET7
uT+rQ4xPe1Mt7e9PjepKlSL5mEsxWfBkdUgsdn19Z2Vjdfr9mZMhYWYMJGcfTCD1
TwxHKBPhq5fN3IsshmMBB8IrRp1HStUKb65MgZ4dI22LJXxTsFkx5XMFXcmuqvkH
NhKj8jDcPEEh31bYcb6aG2Z4onw5F2lquygjk1Qyy5cyw45m/ipJKAXKdAyvJG+R
VZCWOET/9wbRwFSK5wxwihCuKghFiofK52i2PcGtXZh0PCouyZZneSJOKM0yVWKO
BvuJBxK4ETRnQyN6ZxhuJiEXG3/YMBBhyR2TX1LntVK9ct/k7qFVzATG49J39/sR
SYMbptBRj4a5oMJ1qn0nFVEDFkg0jTnTDNnsEpcz60Ayt6EsJ1XosO5yz2huf861
xgRMTKMseyG1/uV45tQ8ZPzbSPpBxjUi9Dl3coYsIm1a+y6clWUXcarONY5KVrpS
CR98DuFgl+E7dXuisd/Kz2p2KxxSPq8nytsmLlgOvrUqhwiXqB+TKN8EHgIapVOt
l1A5LrzXFTcGlT9MlaWBqEIy83Bu1nqQqbxrAFOE0k8A5jomXaw=
=stU2
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-12-18
This PR is larger than usual and contains changes in various parts
of the kernel.
The main changes are:
1) Fix kCFI bugs in BPF, from Peter Zijlstra.
End result: all forms of indirect calls from BPF into kernel
and from kernel into BPF work with CFI enabled. This allows BPF
to work with CONFIG_FINEIBT=y.
2) Introduce BPF token object, from Andrii Nakryiko.
It adds an ability to delegate a subset of BPF features from privileged
daemon (e.g., systemd) through special mount options for userns-bound
BPF FS to a trusted unprivileged application. The design accommodates
suggestions from Christian Brauner and Paul Moore.
Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
-o delegate_cmds=prog_load:MAP_CREATE \
-o delegate_progs=kprobe \
-o delegate_attachs=xdp
3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.
- Complete precision tracking support for register spills
- Fix verification of possibly-zero-sized stack accesses
- Fix access to uninit stack slots
- Track aligned STACK_ZERO cases as imprecise spilled registers.
It improves the verifier "instructions processed" metric from single
digit to 50-60% for some programs.
- Fix verifier retval logic
4) Support for VLAN tag in XDP hints, from Larysa Zaremba.
5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.
End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.
6) Fix race between BPF prog accessing inner map and parallel delete,
from Hou Tao.
7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.
It allows BPF interact with IPSEC infra. The intent is to support
software RSS (via XDP) for the upcoming ipsec pcpu work.
Experiments on AWS demonstrate single tunnel pcpu ipsec reaching
line rate on 100G ENA nics.
8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.
9) BPF file verification via fsverity, from Song Liu.
It allows BPF progs get fsverity digest.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (164 commits)
bpf: Ensure precise is reset to false in __mark_reg_const_zero()
selftests/bpf: Add more uprobe multi fail tests
bpf: Fail uprobe multi link with negative offset
selftests/bpf: Test the release of map btf
s390/bpf: Fix indirect trampoline generation
selftests/bpf: Temporarily disable dummy_struct_ops test on s390
x86/cfi,bpf: Fix bpf_exception_cb() signature
bpf: Fix dtor CFI
cfi: Add CFI_NOSEAL()
x86/cfi,bpf: Fix bpf_struct_ops CFI
x86/cfi,bpf: Fix bpf_callback_t CFI
x86/cfi,bpf: Fix BPF JIT call
cfi: Flip headers
selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
bpf: Limit the number of kprobes when attaching program to multiple kprobes
bpf: Limit the number of uprobes when attaching program to multiple uprobes
bpf: xdp: Register generic_kfunc_set with XDP programs
selftests/bpf: utilize string values for delegate_xxx mount options
...
====================
Link: https://lore.kernel.org/r/20231219000520.34178-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The recent fix to ignore invalid x2APIC entries inadvertently broke
systems with creative MADT APIC tables. The affected systems have APIC
MADT tables where all entries have invalid APIC IDs (0xFF), which means
they register exactly zero CPUs.
But the condition to ignore the entries of APIC IDs < 255 in the X2APIC
MADT table is solely based on the count of MADT APIC table entries.
As a consequence, the affected machines enumerate no secondary CPUs at
all because the APIC table has entries and therefore the X2APIC table
entries with APIC IDs < 255 are ignored.
Change the condition so that the APIC table preference for APIC IDs <
255 only becomes effective when the APIC table has valid APIC ID
entries.
IOW, an APIC table full of invalid APIC IDs is considered to be empty
which in consequence enables the X2APIC table entries with a APIC ID
< 255 and restores the expected behaviour.
Fixes: ec9aedb2aa ("x86/acpi: Ignore invalid x2APIC entries")
Reported-by: John Sperbeck <jsperbeck@google.com>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/169953729188.3135.6804572126118798018.tip-bot2@tip-bot2
Specs don't say anything about UIP being cleared within 10ms. They
only say that UIP won't occur for another 244uS. If a long NMI occurs
while UIP is still updating it might not be possible to get valid
data in 10ms.
This has been observed in the wild that around s2idle some calls can
take up to 480ms before UIP is clear.
Adjust callers from outside an interrupt context to wait for up to a
1s instead of 10ms.
Cc: <stable@vger.kernel.org> # 6.1.y
Fixes: ec5895c0f2 ("rtc: mc146818-lib: extract mc146818_avoid_UIP")
Reported-by: Carsten Hatger <xmb8dsv4@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217626
Tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Reviewed-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Acked-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20231128053653.101798-5-mario.limonciello@amd.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The UIP timeout is hardcoded to 10ms for all RTC reads, but in some
contexts this might not be enough time. Add a timeout parameter to
mc146818_get_time() and mc146818_get_time_callback().
If UIP timeout is configured by caller to be >=100 ms and a call
takes this long, log a warning.
Make all callers use 10ms to ensure no functional changes.
Cc: <stable@vger.kernel.org> # 6.1.y
Fixes: ec5895c0f2 ("rtc: mc146818-lib: extract mc146818_avoid_UIP")
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Reviewed-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Acked-by: Mateusz Jończyk <mat.jonczyk@o2.pl>
Link: https://lore.kernel.org/r/20231128053653.101798-4-mario.limonciello@amd.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
BPF struct_ops uses __arch_prepare_bpf_trampoline() to write
trampolines for indirect function calls. These tramplines much have
matching CFI.
In order to obtain the correct CFI hash for the various methods, add a
matching structure that contains stub functions, the compiler will
generate correct CFI which we can pilfer for the trampolines.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.566977112@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Where the main BPF program is expected to match bpf_func_t,
sub-programs are expected to match bpf_callback_t.
This fixes things like:
tools/testing/selftests/bpf/progs/bloom_filter_bench.c:
bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0);
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.451956710@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The current BPF call convention is __nocfi, except when it calls !JIT things,
then it calls regular C functions.
It so happens that with FineIBT the __nocfi and C calling conventions are
incompatible. Specifically __nocfi will call at func+0, while FineIBT will have
endbr-poison there, which is not a valid indirect target. Causing #CP.
Notably this only triggers on IBT enabled hardware, which is probably why this
hasn't been reported (also, most people will have JIT on anyway).
Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for
x86.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231215092707.345270396@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Normal include order is that linux/foo.h should include asm/foo.h, CFI has it
the wrong way around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20231215092707.231038174@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
apply_alternatives() treats alternatives with the ALT_FLAG_NOT flag set
special as it optimizes the existing NOPs in place.
Unfortunately, this happens with interrupts enabled and does not provide any
form of core synchronization.
So an interrupt hitting in the middle of the update and using the affected code
path will observe a half updated NOP and crash and burn. The following
3 NOP sequence was observed to expose this crash halfway reliably under QEMU
32bit:
0x90 0x90 0x90
which is replaced by the optimized 3 byte NOP:
0x8d 0x76 0x00
So an interrupt can observe:
1) 0x90 0x90 0x90 nop nop nop
2) 0x8d 0x90 0x90 undefined
3) 0x8d 0x76 0x90 lea -0x70(%esi),%esi
4) 0x8d 0x76 0x00 lea 0x0(%esi),%esi
Where only #1 and #4 are true NOPs. The same problem exists for 64bit obviously.
Disable interrupts around this NOP optimization and invoke sync_core()
before re-enabling them.
Fixes: 270a69c448 ("x86/alternative: Support relocations in alternatives")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com
text_poke_early() does:
local_irq_save(flags);
memcpy(addr, opcode, len);
local_irq_restore(flags);
sync_core();
That's not really correct because the synchronization should happen before
interrupts are re-enabled to ensure that a pending interrupt observes the
complete update of the opcodes.
It's not entirely clear whether the interrupt entry provides enough
serialization already, but moving the sync_core() invocation into interrupt
disabled region does no harm and is obviously correct.
Fixes: 6fffacb303 ("x86/alternatives, jumplabel: Use text_poke_early() before mm_init()")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/ZT6narvE%2BLxX%2B7Be@windriver.com
Chris reported that a Dell PowerEdge T340 system stopped to boot when upgrading
to a kernel which contains the parallel hotplug changes. Disabling parallel
hotplug on the kernel command line makes it boot again.
It turns out that the Dell BIOS has x2APIC enabled and the boot CPU comes up in
X2APIC mode, but the APs come up inconsistently in xAPIC mode.
Parallel hotplug requires that the upcoming CPU reads out its APIC ID from the
local APIC in order to map it to the Linux CPU number.
In this particular case the readout on the APs uses the MMIO mapped registers
because the BIOS failed to enable x2APIC mode. That readout results in a page
fault because the kernel does not have the APIC MMIO space mapped when X2APIC
mode was enabled by the BIOS on the boot CPU and the kernel switched to X2APIC
mode early. That page fault can't be handled on the upcoming CPU that early and
results in a silent boot failure.
If parallel hotplug is disabled the system boots because in that case the APIC
ID read is not required as the Linux CPU number is provided to the AP in the
smpboot control word. When the kernel uses x2APIC mode then the APs are
switched to x2APIC mode too slightly later in the bringup process, but there is
no reason to do it that late.
Cure the BIOS bogosity by checking in the parallel bootup path whether the
kernel uses x2APIC mode and if so switching over the APs to x2APIC mode before
the APIC ID readout.
Fixes: 0c7ffa32db ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it")
Reported-by: Chris Lindee <chris.lindee@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Tested-by: Chris Lindee <chris.lindee@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/CA%2B2tU59853R49EaU_tyvOZuOTDdcU0RshGyydccp9R1NX9bEeQ@mail.gmail.com
Add an Intel specific hook into machine_check_poll() to keep track of
per-CPU, per-bank corrected error logs (with a stub for the
CONFIG_MCE_INTEL=n case).
When a storm is observed the rate of interrupts is reduced by setting
a large threshold value for this bank in IA32_MCi_CTL2. This bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second.
When a storm ends reset the threshold in IA32_MCi_CTL2 back to 1, remove
the bank from the bitmap for polling, and change the polling rate back
to the default.
If a CPU with banks in storm mode is taken offline, the new CPU that
inherits ownership of those banks takes over management of storm(s) in
the inherited bank(s).
The cmci_discover() function was already very large. These changes
pushed it well over the top. Refactor with three helper functions to
bring it back under control.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231115195450.12963-4-tony.luck@intel.com
This is the core functionality to track CMCI storms at the machine check
bank granularity. Subsequent patches will add the vendor specific hooks
to supply input to the storm detection and take actions on the start/end
of a storm.
machine_check_poll() is called both by the CMCI interrupt code, and for
periodic polls from a timer. Add a hook in this routine to maintain
a bitmap history for each bank showing whether the bank logged an
corrected error or not each time it is polled.
In normal operation the interval between polls of these banks determines
how far to shift the history. The 64 bit width corresponds to about one
second.
When a storm is observed a CPU vendor specific action is taken to reduce
or stop CMCI from the bank that is the source of the storm. The bank is
added to the bitmap of banks for this CPU to poll. The polling rate is
increased to once per second. During a storm each bit in the history
indicates the status of the bank each time it is polled. Thus the
history covers just over a minute.
Declare a storm for that bank if the number of corrected interrupts seen
in that history is above some threshold (defined as 5 in this series,
could be tuned later if there is data to suggest a better value).
A storm on a bank ends if enough consecutive polls of the bank show no
corrected errors (defined as 30, may also change). That calls the CPU
vendor specific function to revert to normal operational mode, and
changes the polling rate back to the default.
[ bp: Massage. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231115195450.12963-3-tony.luck@intel.com
When a "storm" of corrected machine check interrupts (CMCI) is detected
this code mitigates by disabling CMCI interrupt signalling from all of
the banks owned by the CPU that saw the storm.
There are problems with this approach:
1) It is very coarse grained. In all likelihood only one of the banks
was generating the interrupts, but CMCI is disabled for all. This
means Linux may delay seeing and processing errors logged from other
banks.
2) Although CMCI stands for Corrected Machine Check Interrupt, it is
also used to signal when an uncorrected error is logged. This is
a problem because these errors should be handled in a timely manner.
Delete all this code in preparation for a finer grained solution.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20231115195450.12963-2-tony.luck@intel.com
There's no need to do it on every AP.
The C-bit value read on the BSP and also verified there, is used
everywhere from now on.
No functional changes - just a bit faster booting APs.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231130132601.10317-1-bp@alien8.de
There is no need to use TESTL when checking the least-significant bit
with a TEST instruction. Use TESTB, which is three bytes shorter:
f6 05 00 00 00 00 01 testb $0x1,0x0(%rip)
vs:
f7 05 00 00 00 00 01 testl $0x1,0x0(%rip)
00 00 00
for the same effect.
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231109201032.4439-1-ubizjak@gmail.com
The first few generations of TDX hardware have an erratum. Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.
Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.
== Background ==
Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.
This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.
== Problem ==
A partial write to a TDX private memory cacheline will silently "poison"
the line. Subsequent reads will consume the poison and generate a
machine check. According to the TDX hardware spec, neither of these
things should have happened.
To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.
== Solution ==
In the end, this issue is hard to trigger. Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.
Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:
[...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
[...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
...
[...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] Kernel panic - not syncing: Fatal local machine check
Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".
Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error". But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error. Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.
So keep the "Hardware Error". The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.
Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:
...
[...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.
Adding this additional message requires determination of whether the
memory page is TDX private memory. There is no existing infrastructure
to do that. Add an interface to query the TDX module to fill this gap.
== Impact ==
This issue requires some kind of kernel bug to trigger.
TDX private memory should never be mapped UC/WC. A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory. It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.
MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map. It should also be
detectable with normal debug techniques.
The one place where this might get nasty would be if the CPU read data
then wrote back the same data. That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.
With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.
[ dhansen: changelog tweaks ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
Add a synthetic feature flag specifically for first generation Zen
machines. There's need to have a generic flag for all Zen generations so
make X86_FEATURE_ZEN be that flag.
Fixes: 30fa92832f ("x86/CPU/AMD: Add ZenX generations flags")
Suggested-by: Brian Gerst <brgerst@gmail.com>
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/dc3835e3-0731-4230-bbb9-336bbe3d042b@amd.com
mm_get_enqcmd_pasid() should be used by architecture code and closely
related to learn the PASID value that the x86 ENQCMD operation should
use for the mm.
For the moment SMMUv3 uses this without any connection to ENQCMD, it
will be cleaned up similar to how the prior patch made VT-d use the
PASID argument of set_dev_pasid().
The motivation is to replace mm->pasid with an iommu private data
structure that is introduced in a later patch.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231027000525.1278806-4-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Linus suggested that the kconfig here is confusing:
https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/
Let's break it into three kconfigs controlling distinct things:
- CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional
fields for the IOMMU. Currently only PASID, but later patches store
a struct iommu_mm_data *
- CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit
for keeping track of the ENQCMD instruction. x86 will select this if
IOMMU_SVA is enabled
- IOMMU_SVA controls if the IOMMU core compiles in the SVA support code
for iommu driver use and the IOMMU exported API
This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Use current_top_of_stack() helper in sync_regs() and vc_switch_off_ist()
instead of open-coding the reading of the top_of_stack percpu variable
explicitly.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231204210320.114429-2-ubizjak@gmail.com
Now that paravirt is using the alternatives patching infrastructure,
remove the paravirt patching code.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231210062138.2417-6-jgross@suse.com
Instead of stacking alternative and paravirt patching, use the new
ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
handling.
Eliminate the need to be careful regarding the sequence of alternative
and paravirt patching.
[ bp: Touch up commit message. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231210062138.2417-5-jgross@suse.com
In order to prepare replacing of paravirt patching with alternative
patching, add the capability to replace an indirect call with a direct
one.
This is done via a new flag ALT_FLAG_CALL as the target of the CALL
instruction needs to be evaluated using the value of the location
addressed by the indirect call.
For convenience, add a macro for a default CALL instruction. In case it
is being used without the new flag being set, it will result in a BUG()
when being executed. As in most cases, the feature used will be
X86_FEATURE_ALWAYS so add another macro ALT_CALL_ALWAYS usable for the
flags parameter of the ALTERNATIVE macros.
For a complete replacement, handle the special cases of calling a nop
function and an indirect call of NULL the same way as paravirt does.
[ bp: Massage commit message, fixup the debug output and clarify flow
more. ]
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231210062138.2417-4-jgross@suse.com
As a preparation for replacing paravirt patching completely by
alternative patching, move some backend functions and #defines to
the alternatives code and header.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231129133332.31043-3-jgross@suse.com
callback so that the callback runs only on AMD
- Make sure SEV-ES protocol negotiation happens only once and on the BSP
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmV1l4MACgkQEsHwGGHe
VUoESQ//WxIdMb/ATNtBRXEM3GRzd/X2he7kXRGVSUeJZMxEOnpVEmo4NtAzXwI7
E2yOk4EJ6QlN+p6/eea5X2FJ5IkpiDKj1jVvU6NxTJiF2I5t1XUjdcXiv05Is+bP
5Cx1ysaznMQYvZUler15NkbLCZvVX5/1V9FthnLpS1d1cvem0NKoH+hVUQHkdHbo
/QNzEy+dyii8mp+t6D4QEfjq2Ab7bRbx1FufJcvHy9tFt3u8JsuWI/kHLm5Jk3VN
n7Efynluw1wJC3l/18QzrxUCL5LVwrs2LxC0GH42dq9vBp0EJclgsBOh+43cm+Ey
ffowP3uJxayngdigRA4ANpfbJL7GHjEtoYlxk/ooI4l2ccNtSdCXn86UVI9q3BiT
60j/qV9r/o0Wl46/BVrT6SXhHlcKm4YchUa53NbOHcU3hBpUuM78WAcqs+63voMq
kFqXQVihlJWYYytyFKw6Ng7O6SHR/SMC3uSygG0q2KebiUlVMZSheJin2Od+shhC
skD0z2EUzUGuNICFHuxwA2uu6jr/KvkEwDfjczwdE2haUfZLCnOfTjxZmqBbuzJ6
l7lfGjBCpGGtGettjrM8WAuHnL6MzNDI2jVDV3K6bzqQVzk9wjFQjIG9X7c8XjN/
E2IyQxNuiIKVj3GfFOVM9lgJ4C8qwNDa2A46Kad2dKHioiFfMuM=
=uIo2
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a forgotten CPU vendor check in the AMD microcode post-loading
callback so that the callback runs only on AMD
- Make sure SEV-ES protocol negotiation happens only once and on the
BSP
* tag 'x86_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Check vendor in the AMD microcode callback
x86/sev: Fix kernel crash due to late update to read-only ghcb_version
Start to transit out the "multi-steps" to initialize the TDX module.
TDX provides increased levels of memory confidentiality and integrity.
This requires special hardware support for features like memory
encryption and storage of memory integrity checksums. Not all memory
satisfies these requirements.
As a result, TDX introduced the concept of a "Convertible Memory Region"
(CMR). During boot, the firmware builds a list of all of the memory
ranges which can provide the TDX security guarantees. The list of these
ranges is available to the kernel by querying the TDX module.
CMRs tell the kernel which memory is TDX compatible. The kernel needs
to build a list of memory regions (out of CMRs) as "TDX-usable" memory
and pass them to the TDX module. Once this is done, those "TDX-usable"
memory regions are fixed during module's lifetime.
To keep things simple, assume that all TDX-protected memory will come
from the page allocator. Make sure all pages in the page allocator
*are* TDX-usable memory.
As TDX-usable memory is a fixed configuration, take a snapshot of the
memory configuration from memblocks at the time of module initialization
(memblocks are modified on memory hotplug). This snapshot is used to
enable TDX support for *this* memory configuration only. Use a memory
hotplug notifier to ensure that no other RAM can be added outside of
this configuration.
This approach requires all memblock memory regions at the time of module
initialization to be TDX convertible memory to work, otherwise module
initialization will fail in a later SEAMCALL when passing those regions
to the module. This approach works when all boot-time "system RAM" is
TDX convertible memory and no non-TDX-convertible memory is hot-added
to the core-mm before module initialization.
For instance, on the first generation of TDX machines, both CXL memory
and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add
any CXL memory or NVDIMM to the core-mm before module initialization
will result in failure to initialize the module. The SEAMCALL error
code will be available in the dmesg to help user to understand the
failure.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
host and certain physical attacks. A CPU-attested software module
called 'the TDX module' runs inside a new isolated memory range as a
trusted hypervisor to manage and run protected VMs.
Pre-TDX Intel hardware has support for a memory encryption architecture
called MKTME. The memory encryption hardware underpinning MKTME is also
used for Intel TDX. TDX ends up "stealing" some of the physical address
space from the MKTME architecture for crypto-protection to VMs. The
BIOS is responsible for partitioning the "KeyID" space between legacy
MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private
KeyIDs' or 'TDX KeyIDs' for short.
During machine boot, TDX microcode verifies that the BIOS programmed TDX
private KeyIDs consistently and correctly programmed across all CPU
packages. The MSRs are locked in this state after verification. This
is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
it indicates not just that the hardware supports TDX, but that all the
boot-time security checks passed.
The TDX module is expected to be loaded by the BIOS when it enables TDX,
but the kernel needs to properly initialize it before it can be used to
create and run any TDX guests. The TDX module will be initialized by
the KVM subsystem when KVM wants to use TDX.
Detect platform TDX support by detecting TDX private KeyIDs.
The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
to protect its metadata. Each TDX guest also needs a TDX KeyID for its
own protection. Just use the first TDX KeyID as the global KeyID and
leave the rest for TDX guests. If no TDX KeyID is left for TDX guests,
disable TDX as initializing the TDX module alone is useless.
[ dhansen: add X86_FEATURE, replace helper function ]
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
There is no real reason to have a separate ASM entry point implementation
for the legacy INT 0x80 syscall emulation on 64-bit.
IDTENTRY provides all the functionality needed with the only difference
that it does not:
- save the syscall number (AX) into pt_regs::orig_ax
- set pt_regs::ax to -ENOSYS
Both can be done safely in the C code of an IDTENTRY before invoking any of
the syscall related functions which depend on this convention.
Aside of ASM code reduction this prepares for detecting and handling a
local APIC injected vector 0x80.
[ kirill.shutemov: More verbose comments ]
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # v6.0+
Convert x86 to use the arch_cpu_is_hotpluggable() helper rather than
arch_register_cpu().
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1r5R3w-00Cszy-6k@rmk-PC.armlinux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since the x86 version of arch_unregister_cpu() is the same as the weak
version, drop the x86 specific version.
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1r5R3r-00Cszs-2R@rmk-PC.armlinux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that GENERIC_CPU_DEVICES calls arch_register_cpu(), which can be
overridden by the arch code, switch over to this to allow common code
to choose when the register_cpu() call is made.
x86's struct cpus come from struct x86_cpu, which has no other members
or users. Remove this and use the version defined by common code.
This is an intermediate step to the logic being moved to drivers/acpi,
where GENERIC_CPU_DEVICES will do the work when booting with acpi=off.
This patch also has the effect of moving the registration of CPUs from
subsys to driver core initialisation, prior to any initcalls running.
----
Changes since RFC:
* Fixed the second copy of arch_register_cpu() used for non-hotplug
Changes since RFC v2:
* Remove duplicate of the weak generic arch_register_cpu(), spotted
by Jonathan Cameron. Add note about initialisation order change.
Changes since RFC v3:
* Adapt to removal of EXPORT_SYMBOL()s
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1r5R3l-00Cszm-UA@rmk-PC.armlinux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
arch_register_cpu() and arch_unregister_cpu() are not used by anything
that can be a module - they are used by drivers/base/cpu.c and
drivers/acpi/acpi_processor.c, neither of which can be a module.
Remove the exports.
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1r5R2r-00Csyh-7B@rmk-PC.armlinux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
intel_epb_init() is called as a subsys_initcall() to register cpuhp
callbacks. The callbacks make use of get_cpu_device() which will return
NULL unless register_cpu() has been called. register_cpu() is called
from topology_init(), which is also a subsys_initcall().
This is fragile. Moving the register_cpu() to a different
subsys_initcall() leads to a NULL dereference during boot.
Make intel_epb_init() a late_initcall(), user-space can't provide a
policy before this point anyway.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1r5R2m-00Csyb-2S@rmk-PC.armlinux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This was meant to be done only when early microcode got updated
successfully. Move it into the if-branch.
Also, make sure the current revision is read unconditionally and only
once.
Fixes: 080990aa33 ("x86/microcode: Rework early revisions reporting")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Link: https://lore.kernel.org/r/ZWjVt5dNRjbcvlzR@a4bf019067fa.jf.intel.com
Commit in Fixes added an AMD-specific microcode callback. However, it
didn't check the CPU vendor the kernel runs on explicitly.
The only reason the Zenbleed check in it didn't run on other x86 vendors
hardware was pure coincidental luck:
if (!cpu_has_amd_erratum(c, amd_zenbleed))
return;
gives true on other vendors because they don't have those families and
models.
However, with the removal of the cpu_has_amd_erratum() in
05f5f73936 ("x86/CPU/AMD: Drop now unused CPU erratum checking function")
that coincidental condition is gone, leading to the zenbleed check
getting executed on other vendors too.
Add the explicit vendor check for the whole callback as it should've
been done in the first place.
Fixes: 522b1d6921 ("x86/cpu/amd: Add a Zenbleed fix")
Cc: <stable@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231201184226.16749-1-bp@alien8.de
After successful update, the late loading routine prints an update
summary similar to:
microcode: load: updated on 128 primary CPUs with 128 siblings
microcode: revision: 0x21000170 -> 0x21000190
Remove the redundant message in the Intel side of the driver.
[ bp: Massage commit message. ]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ZWjYhedNfhAUmt0k@a4bf019067fa.jf.intel.com
Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old.
X86 CMPXCHG instruction returns success in ZF flag, so this change saves a
compare after the CMPXCHG.
Tested by building a native Fedora-38 kernel and rebooting
a 12-way SMP system using "shutdown -r" command some 100 times.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231123203605.3474745-2-ubizjak@gmail.com
Improve code generation in native_stop_other_cpus() a tiny bit:
smp_processor_id() accesses a per-CPU variable, so the compiler
is not able to move the call after the early exit on its own.
Also rename the "cpu" variable to a more descriptive "this_cpu", and
use 'cpu' as a separate iterator variable later in the function.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231123203605.3474745-1-ubizjak@gmail.com
This is a "nice-to-have" change with minor code generation benefits:
- Instruction with %rip-relative address operand is one byte shorter than
its absolute address counterpart,
- it is also compatible with position independent executable (-fpie) builds,
- it is also consistent with what the compiler emits by default when
a symbol is accessed.
No functional changes intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231103104900.409470-1-ubizjak@gmail.com
Contrary to alternatives, relocations are currently not supported in
call thunk templates. Re-use the existing infrastructure from
alternative.c to allow %rip-relative relocations when copying call
thunk template from its storage location.
The patch allows unification of ASM_INCREMENT_CALL_DEPTH, which already
uses PER_CPU_VAR macro, with INCREMENT_CALL_DEPTH, used in call thunk
template, which is currently limited to use absolute address.
Reuse existing relocation infrastructure from alternative.c.,
as suggested by Peter Zijlstra.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231105213731.1878100-3-ubizjak@gmail.com
A write-access violation page fault kernel crash was observed while running
cpuhotplug LTP testcases on SEV-ES enabled systems. The crash was
observed during hotplug, after the CPU was offlined and the process
was migrated to different CPU. setup_ghcb() is called again which
tries to update ghcb_version in sev_es_negotiate_protocol(). Ideally this
is a read_only variable which is initialised during booting.
Trying to write it results in a pagefault:
BUG: unable to handle page fault for address: ffffffffba556e70
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
[ ...]
Call Trace:
<TASK>
? __die_body.cold+0x1a/0x1f
? __die+0x2a/0x35
? page_fault_oops+0x10c/0x270
? setup_ghcb+0x71/0x100
? __x86_return_thunk+0x5/0x6
? search_exception_tables+0x60/0x70
? __x86_return_thunk+0x5/0x6
? fixup_exception+0x27/0x320
? kernelmode_fixup_or_oops+0xa2/0x120
? __bad_area_nosemaphore+0x16a/0x1b0
? kernel_exc_vmm_communication+0x60/0xb0
? bad_area_nosemaphore+0x16/0x20
? do_kern_addr_fault+0x7a/0x90
? exc_page_fault+0xbd/0x160
? asm_exc_page_fault+0x27/0x30
? setup_ghcb+0x71/0x100
? setup_ghcb+0xe/0x100
cpu_init_exception_handling+0x1b9/0x1f0
The fix is to call sev_es_negotiate_protocol() only in the BSP boot phase,
and it only needs to be done once in any case.
[ mingo: Refined the changelog. ]
Fixes: 95d33bfaa3 ("x86/sev: Register GHCB memory when SEV-SNP is active")
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/1701254429-18250-1-git-send-email-kashwindayan@vmware.com
Setting X86_BUG_AMD_E400 in init_amd() is early enough.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: http://lore.kernel.org/r/20231120104152.13740-12-bp@alien8.de
Set it in init_amd_gh() unconditionally as that is the F10h init
function.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: http://lore.kernel.org/r/20231120104152.13740-11-bp@alien8.de
Add X86_FEATURE flags for each Zen generation. They should be used from
now on instead of checking f/m/s.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/r/20231120104152.13740-2-bp@alien8.de
The long names of the SMCA banks are only used by the MCE decoder
module.
Move them out of the arch code and into the decoder module.
[ bp: Name the long names array "smca_long_names", drop local ptr in
decode_smca_error(), constify arrays. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-5-yazen.ghannam@amd.com
and remove the driver version announcement to avoid version
confusion when distros backport fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmVjFKgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h/7hAAs5hL1KvrfdW0VpAW91MbX6mtIe7Emc8T
LCiBJtl9UngRdASUC9CGrcIZ5JIps0702gAq0qPVzk5zKxC22ySWsqMZybask+eF
d6E7amMtF+KX0wiCZSuC66StCKA08JfrUXgxvYHnxDjNqERYFmVr1QabGL1IN5lZ
KUrVUyvN8VOnzypOiQ98lXGWDJYwaV7t+IzMMh7mT5OUkoo09e6tFm7IF+NWg5xe
NYCcvZqyo0Ipld7HOjlGHYG+blFkDxJpfTby5UevZybXsPd00cxBzDSR9zs1sYeG
Kt6cgwDhfewfcM1QFVvNV/SDVsPp1BlVvMUa6Xa3vtsnWCit8zQqMbkYYWUaTcIh
yUJvtzh/xZQtxaQ8Z8SbI7EhUBOFJXoHWV9JoEe3gWsWA5thu+4iOCh/P8C2ON3n
6kLSgNQ4GAylH34MWoS84t2Jxv7XmNZljR/78ucRQrJ1JJIEA+r4sJ9hK9btxqf2
0n86StHuwtNXSQwEhDcacqUpFPLZ65Za1Y9AXc69CDuiwj3DvTVBMwjEOBbnGTrZ
dL9QOYG5gkklOx4o5ePj7RoLrzz/j6dj6idmu8FxZZ4q+QB9vvL2lRHusJnUEloE
yxR7WWOB/kyUZT7FriLHRuEP7yQNRSvLs6U7b8uXCHiGcAb2mN0fFi2m/BcJGS4A
hHA0t9WyNBM=
=eYQ4
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode fixes from Ingo Molnar:
"Fix/enhance x86 microcode version reporting: fix the bootup log spam,
and remove the driver version announcement to avoid version confusion
when distros backport fixes"
* tag 'x86-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Rework early revisions reporting
x86/microcode: Remove the driver announcement and version
intel_epb_init() is called as a subsys_initcall() to register cpuhp
callbacks. The callbacks make use of get_cpu_device() which will return
NULL unless register_cpu() has been called. register_cpu() is called
from topology_init(), which is also a subsys_initcall().
This is fragile. Moving the register_cpu() to a different
subsys_initcall() leads to a NULL dereference during boot.
Make intel_epb_init() a late_initcall(), user-space can't provide a
policy before this point anyway.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
some architectures run into a -Wmissing-prototypes warning
for trap_init()
arch/microblaze/kernel/traps.c:21:6: warning: no previous prototype for 'trap_init' [-Wmissing-prototypes]
Include the right header to avoid this consistently, removing
the extra declarations on m68k and x86 that were added as local
workarounds already.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
AMD systems generally allow MCA "simulation" where MCA registers can be
written with valid data and the full MCA handling flow can be tested by
software.
However, the platform on Scalable MCA systems, can prevent software from
writing data to the MCA registers. There is no architectural way to
determine this configuration. Therefore, the MCE injection module will
check for this behavior by writing and reading back a test status value.
This is done during module init, and the check can run on any CPU with
any valid MCA bank.
If MCA_STATUS writes are ignored by the platform, then there are no side
effects on the hardware state.
If the writes are not ignored, then the test status value will remain in
the hardware MCA_STATUS register. It is likely that the value will not
be overwritten by hardware or software, since the tested CPU and bank
are arbitrary. Therefore, the user may see a spurious, synthetic MCA
error reported whenever MCA is polled for this CPU.
Clear the test value immediately after writing it. It is very unlikely
that a valid MCA error is logged by hardware during the test. Errors
that cause an #MC won't be affected.
Fixes: 891e465a1b ("x86/mce: Check whether writes to MCA_STATUS are getting ignored")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-2-yazen.ghannam@amd.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmVdgqYTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXhBsCACzUGLF3vOQdrmgTMymzaaOzfLJtvNW
oQ34FwMJMOAyJ6FxM12IJPHA2j+azl9CPjQc5O6F2CBcF8hVj2mDIINQIi+4wpV5
FQv445g2KFml/+AJr/1waz1GmhHtr1rfu7B7NX6tPUtOpxKN7AHAQXWYmHnwK8BJ
5Mh2a/7Lphjin4M1FWCeBTj0JtqF1oVAW2L9jsjqogq1JV0a51DIFutROtaPSC/4
ssTLM5Rqpnw8Z1GWVYD2PObIW4A+h1LV1tNGOIoGW6mX56mPU+KmVA7tTKr8Je/i
z3Jk8bZXFyLvPW2+KNJacbldKNcfwAFpReffNz/FO3R16Stq9Ta1mcE2
=wXju
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20231121' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- One fix for the KVP daemon (Ani Sinha)
- Fix for the detection of E820_TYPE_PRAM in a Gen2 VM (Saurabh Sengar)
- Micro-optimization for hv_nmi_unknown() (Uros Bizjak)
* tag 'hyperv-fixes-signed-20231121' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Use atomic_try_cmpxchg() to micro-optimize hv_nmi_unknown()
x86/hyperv: Fix the detection of E820_TYPE_PRAM in a Gen2 VM
hv/hv_kvp_daemon: Some small fixes for handling NM keyfiles
This field is set to APIC_DELIVERY_MODE_FIXED in all cases, and is read
exactly once. Fold the constant in uv_program_mmr() and drop the field.
Searching for the origin of the stale HyperV comment reveals commit
a31e58e129 ("x86/apic: Switch all APICs to Fixed delivery mode") which
notes:
As a consequence of this change, the apic::irq_delivery_mode field is
now pointless, but this needs to be cleaned up in a separate patch.
6 years is long enough for this technical debt to have survived.
[ bp: Fold in
https://lore.kernel.org/r/20231121123034.1442059-1-andrew.cooper3@citrix.com
]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-1-bf049a2a0ed6@citrix.com
The AMD side of the loader issues the microcode revision for each
logical thread on the system, which can become really noisy on huge
machines. And doing that doesn't make a whole lot of sense - the
microcode revision is already in /proc/cpuinfo.
So in case one is interested in the theoretical support of mixed silicon
steppings on AMD, one can check there.
What is also missing on the AMD side - something which people have
requested before - is showing the microcode revision the CPU had
*before* the early update.
So abstract that up in the main code and have the BSP on each vendor
provide those revision numbers.
Then, dump them only once on driver init.
On Intel, do not dump the patch date - it is not needed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/CAHk-=wg=%2B8rceshMkB4VnKxmRccVLtBLPBawnewZuuqyx5U=3A@mail.gmail.com