Drop 32-bit "support" from __svm_sev_es_vcpu_run(), as SEV/SEV-ES firmly
64-bit only. The "support" was purely the result of bad copy+paste from
__svm_vcpu_run(), which in turn was slightly less bad copy+paste from
__vmx_vcpu_run().
Opportunistically convert to unadulterated register accesses so that it's
easier (but still not easy) to follow which registers hold what arguments,
and when.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Compile (and link) __svm_sev_es_vcpu_run() if and only if SEV support is
actually enabled. This will allow dropping non-existent 32-bit "support"
from __svm_sev_es_vcpu_run() without causing undue confusion.
Intentionally don't provide a stub (but keep the declaration), as any sane
compiler, even with things like KASAN enabled, should eliminate the call
to __svm_sev_es_vcpu_run() since sev_es_guest() unconditionally returns
"false" if CONFIG_KVM_AMD_SEV=n.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Unconditionally create a stack frame in __svm_vcpu_run() to play nice with
unwinding via frame pointers, at least until the point where RBP is loaded
with the guest's value. Don't bother conditioning the code on
CONFIG_FRAME_POINTER=y, as RBP needs to be saved and restored anyways (due
to it being clobbered with the guest's value); omitting the "MOV RSP, RBP"
is not worth the extra #ifdef.
Creating a stack frame will allow removing the OBJECT_FILES_NON_STANDARD
tag from vmenter.S once __svm_sev_es_vcpu_run() is fixed to not stomp all
over RBP for no reason.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240223204233.3337324-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove KVM's unnecessary zeroing of memory when allocating the pages array
in sev_pin_memory() via __vmalloc(), as the array is only used to hold
kernel pointers. The kmalloc() path for "small" regions doesn't zero the
array, and if KVM leaks state and/or accesses uninitialized data, then the
kernel has bigger problems.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/c7619a3d3cbb36463531a7c73ccbde9db587986c.1710004509.git.christophe.jaillet@wanadoo.fr
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSRmYRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gGLg/9GQS5B/kub1ydjoFx2sCuHe5Zl9vLXpLg
515yOSKRWnpzjDJaMlAJg9MHiNlcR2Df5rcnVLaKpQSomajFQBKpsJkX4skqAUm3
wuuCwP0Wc+XJFNDzHGb5yqWaCo06XI80BnZyGHDHDD6jJ5fypzKVyJERnGBpLwjO
h2YTomtLP5j5Gq7em2d+A9pVRpKwcBOCB8K2sBnJRlPNVR190MWTOm1wEQ+vYQeR
x8nIx7WaaA6SInqWVGkhloasTmeWEH89Q2wjCltZpYFnRiEa1yS/VHVT6ZQKrpOy
+mBqr92tXdxT2Y+8LNAMUg5PgRVMbZoY+Glin0Q0N4Cg92BZIl8NX8wtb3oacYgd
XhiRyRWaw8JDCC2mEmhlEa01M2Y7PXtcBjvOVQwoZLS/711Zyf+fHjyX4FUG8Vcb
T0PgaQoterlVnN4H2uWq8Za8ubjI0TW0nRBw2oQKlSv/5ldJ2IKJsQdsbl1q4wQr
TtYJY2bq5Hrn+qlFZi6jFB2KvBOUV3molXlZAPJ0Nr/Y9mkMBRVcq6ufrAunpgUB
l62Ls61HHZ9+hVNIIpM8/p/rTYjeVilA7vjHGiCJFcsclvPNBBkobtbpri/ioE0t
4+pH60LsMBIwAhOAKlJ6Jzf4LaWEikJfDPpj8yMKixGOxDT542rUN7A3NwdP2H6b
2fV3Nyr7sEw=
=v7jY
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fix from Ingo Molnar:
"Fix a combined PEBS events bug on x86 Intel CPUs"
* tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
srso_alias_untrain_ret() is special code, even if it is a dummy
which is called in the !SRSO case, so annotate it like its real
counterpart, to address the following objtool splat:
vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0
Fixes: 4535e1a417 ("x86/bugs: Fix the SRSO mitigation on Zen3/4")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org
We want to fix:
0e11073247 ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO")
So merge in Linus's latest into x86/urgent to have it available.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a couple of fixups for this cycle's vmalloc changes and one for
the stackdepot changes. And a fix for a very old x86 PAT issue which can
cause a warning splat.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZhBEXAAKCRDdBJ7gKXxA
ju9fAQCxPdqApKQ49IAJpUtMcRJmI594dmB+/CfrFgiS+GaQcwEA1+2SidI9fQWT
R/fcKrRr4+zlgQw0T0aSDR1HBLUPxw0=
=rkxT
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"8 hotfixes, 3 are cc:stable
There are a couple of fixups for this cycle's vmalloc changes and one
for the stackdepot changes. And a fix for a very old x86 PAT issue
which can cause a warning splat"
* tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
stackdepot: rename pool_index to pool_index_plus_1
x86/mm/pat: fix VM_PAT handling in COW mappings
MAINTAINERS: change vmware.com addresses to broadcom.com
selftests/mm: include strings.h for ffsl
mm: vmalloc: fix lockdep warning
mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
init: open output files from cpio unpacking with O_LARGEFILE
mm/secretmem: fix GUP-fast succeeding on secretmem folios
PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios. Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.
Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().
In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.
To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.
We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.
For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.
Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():
<--- C reproducer --->
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <liburing.h>
int main(void)
{
struct io_uring_params p = {};
int ring_fd;
size_t size;
char *map;
ring_fd = io_uring_setup(1, &p);
if (ring_fd < 0) {
perror("io_uring_setup");
return 1;
}
size = p.sq_off.array + p.sq_entries * sizeof(unsigned);
/* Map the submission queue ring MAP_PRIVATE */
map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
ring_fd, IORING_OFF_SQ_RING);
if (map == MAP_FAILED) {
perror("mmap");
return 1;
}
/* We have at least one page. Let's COW it. */
*map = 0;
pause();
return 0;
}
<--- C reproducer --->
On a system with 16 GiB RAM and swap configured:
# ./iouring &
# memhog 16G
# killall iouring
[ 301.552930] ------------[ cut here ]------------
[ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[ 301.565725] PKRU: 55555554
[ 301.565944] Call Trace:
[ 301.566148] <TASK>
[ 301.566325] ? untrack_pfn+0xf4/0x100
[ 301.566618] ? __warn+0x81/0x130
[ 301.566876] ? untrack_pfn+0xf4/0x100
[ 301.567163] ? report_bug+0x171/0x1a0
[ 301.567466] ? handle_bug+0x3c/0x80
[ 301.567743] ? exc_invalid_op+0x17/0x70
[ 301.568038] ? asm_exc_invalid_op+0x1a/0x20
[ 301.568363] ? untrack_pfn+0xf4/0x100
[ 301.568660] ? untrack_pfn+0x65/0x100
[ 301.568947] unmap_single_vma+0xa6/0xe0
[ 301.569247] unmap_vmas+0xb5/0x190
[ 301.569532] exit_mmap+0xec/0x340
[ 301.569801] __mmput+0x3e/0x130
[ 301.570051] do_exit+0x305/0xaf0
...
Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b19 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add CPUID_LNX_5 to track cpufeatures' word 21, and add the appropriate
compile-time assert in KVM to prevent direct lookups on the features in
CPUID_LNX_5. KVM uses X86_FEATURE_* flags to manage guest CPUID, and so
must translate features that are scattered by Linux from the Linux-defined
bit to the hardware-defined bit, i.e. should never try to directly access
scattered features in guest CPUID.
Opportunistically add NR_CPUID_WORDS to enum cpuid_leafs, along with a
compile-time assert in KVM's CPUID infrastructure to ensure that future
additions update cpuid_leafs along with NCAPINTS.
No functional change intended.
Fixes: 7f274e609f ("x86/cpufeatures: Add new word for scattered features")
Cc: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of
the newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT"
with better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting
to some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing problems,
incorrect checksum conversions, and incorrect packet transformations
which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmYO91wACgkQMUZtbf5S
IrvHBQ/+PH/hobI+o3aLqwtdVlyxhmA31bVQ0I3aTIZV7c3ideMBcfgYa8TiZM2g
pLiBiWoJXCN0h33wgUmlUee+sBvpoPCdPjGD/g99OJyKWjVt2D7ObnSwxMfjHUoq
dtcN2JupqHP0SHz6wPPCmnWtTLxSGUsDdKjmkHQcCRhQIGTYFkYyHcOmPgNbBjaB
6jvmH1kE9WQTFD8QcOMaZmXQ5omoafpxxQLsgundtOWxPWHL7XNvk0B5k/ESDRG1
ujbxwtNnOESzpxZMQ6OyZlsnN/1tWfnEvLJFYVwf9BMrOlahJT/f5b/EJ9/Xy4dC
zkAp7Tul3uAvNRKhBNhVBTWQbnIykmiNMp1VBFmiScQAy8hcnX+6d4LKTIHxbXZK
V3AqcUS6YU2nyMdLRkhvq9f3uxD6hcY19gQdyqgCUPOtyUAs/JPv7lXQjCuuEqkq
urEZkigUApnEqPIrIqANJ7nXUy3U0K8qU6evOZoGZ5OdiKeNKC3+tIr+g2f1ZUZq
a7Dkat7JH9WQ7IG8Geody6Z30K9EpSqYMTKzB5wTfmuqw6cV8bl9OAW9UOSRK0GL
pyG8GwpkpFPkNiZdu9Zt44Pno5xdLIa1+C3QZR0r5CJWYAzCbI80MppP5veF9Mw+
v+2v8iBWuh9iv0AUj9KJOwG5QQ+EXLUuSlhtx/DFnmn2CJ9plXI=
=6bQI
-----END PGP SIGNATURE-----
Merge tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, bluetooth and bpf.
Fairly usual collection of driver and core fixes. The large selftest
accompanying one of the fixes is also becoming a common occurrence.
Current release - regressions:
- ipv6: fix infinite recursion in fib6_dump_done()
- net/rds: fix possible null-deref in newly added error path
Current release - new code bugs:
- net: do not consume a full cacheline for system_page_pool
- bpf: fix bpf_arena-related file descriptor leaks in the verifier
- drv: ice: fix freeing uninitialized pointers, fixing misuse of the
newfangled __free() auto-cleanup
Previous releases - regressions:
- x86/bpf: fixes the BPF JIT with retbleed=stuff
- xen-netfront: add missing skb_mark_for_recycle, fix page pool
accounting leaks, revealed by recently added explicit warning
- tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6
non-wildcard addresses
- Bluetooth:
- replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with
better workarounds to un-break some buggy Qualcomm devices
- set conn encrypted before conn establishes, fix re-connecting to
some headsets which use slightly unusual sequence of msgs
- mptcp:
- prevent BPF accessing lowat from a subflow socket
- don't account accept() of non-MPC client as fallback to TCP
- drv: mana: fix Rx DMA datasize and skb_over_panic
- drv: i40e: fix VF MAC filter removal
Previous releases - always broken:
- gro: various fixes related to UDP tunnels - netns crossing
problems, incorrect checksum conversions, and incorrect packet
transformations which may lead to panics
- bpf: support deferring bpf_link dealloc to after RCU grace period
- nf_tables:
- release batch on table validation from abort path
- release mutex after nft_gc_seq_end from abort path
- flush pending destroy work before exit_net release
- drv: r8169: skip DASH fw status checks when DASH is disabled"
* tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
netfilter: validate user input for expected length
net/sched: act_skbmod: prevent kernel-infoleak
net: usb: ax88179_178a: avoid the interface always configured as random address
net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45()
net: ravb: Always update error counters
net: ravb: Always process TX descriptor ring
netfilter: nf_tables: discard table flag update with pending basechain deletion
netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get()
netfilter: nf_tables: reject new basechain after table flag update
netfilter: nf_tables: flush pending destroy work before exit_net release
netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path
netfilter: nf_tables: release batch on table validation from abort path
Revert "tg3: Remove residual error handling in tg3_suspend"
tg3: Remove residual error handling in tg3_suspend
net: mana: Fix Rx DMA datasize and skb_over_panic
net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()
net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping
net: stmmac: fix rx queue priority assignment
net: txgbe: fix i2c dev name cannot match clkdev
net: fec: Set mac_managed_pm during probe
...
Modifying a MCA bank's MCA_CTL bits which control which error types to
be reported is done over
/sys/devices/system/machinecheck/
├── machinecheck0
│ ├── bank0
│ ├── bank1
│ ├── bank10
│ ├── bank11
...
sysfs nodes by writing the new bit mask of events to enable.
When the write is accepted, the kernel deletes all current timers and
reinits all banks.
Doing that in parallel can lead to initializing a timer which is already
armed and in the timer wheel, i.e., in use already:
ODEBUG: init active (active state 0) object: ffff888063a28000 object
type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642
WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514
debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514
Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code
does.
Reported by: Yue Sun <samsun1006219@gmail.com>
Reported by: xingwei lee <xrivendell7@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%2Bx_-PqMv98gKig55=2vbzffRw@mail.gmail.com
The host SNP worthiness can determined later, after alternatives have
been patched, in snp_rmptable_init() depending on cmdline options like
iommu=pt which is incompatible with SNP, for example.
Which means that one cannot use X86_FEATURE_SEV_SNP and will need to
have a special flag for that control.
Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places.
Move kdump_sev_callback() to its rightful place, while at it.
Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-6-bp@alien8.de
Add functionality to set and/or clear different attributes of the
machine as a confidential computing platform. Add the first one too:
whether the machine is running as a host for SEV-SNP guests.
Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-5-bp@alien8.de
The functionality to load SEV-SNP guests by the host will soon rely on
cc_platform* helpers because the cpu_feature* API with the early
patching is insufficient when SNP support needs to be disabled late.
Therefore, pull that functionality in.
Fixes: 216d106c7f ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Srikanth Aithal <sraithal@amd.com>
Link: https://lore.kernel.org/r/20240327154317.29909-4-bp@alien8.de
There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.
If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.
So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.
Add this deliberately to be "just a CoCo x86 driver feature" and not
part of the RNG itself. Many device drivers and platforms have some
desire to contribute something to the RNG, and add_device_randomness()
is specifically meant for this purpose.
Any driver can call it with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse.
Rather than trying to build something into the core of the RNG, consider
the particular CoCo issue just a CoCo issue, and therefore separate it
all out into driver (well, arch/platform) code.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
The __vmalloc_start_set declaration is in a header that is not included
in numa_32.c in current linux-next:
arch/x86/mm/numa_32.c: In function 'initmem_init':
arch/x86/mm/numa_32.c:57:9: error: '__vmalloc_start_set' undeclared (first use in this function)
57 | __vmalloc_start_set = true;
| ^~~~~~~~~~~~~~~~~~~
arch/x86/mm/numa_32.c:57:9: note: each undeclared identifier is reported only once for each function it appears in
Add an explicit #include.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240403202344.3463169-1-arnd@kernel.org
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
RISC-V:
- Remove redundant semicolon in num_isa_ext_regs().
- Fix APLIC setipnum_le/be write emulation.
- Fix APLIC in_clrip[x] read emulation.
x86:
- Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old
vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting
disabled.
- Documentation fixes for SEV.
- Fix compat ABI for KVM_MEMORY_ENCRYPT_OP.
- Fix a 14-year-old goof in a declaration shared by host and guest; the enabled
field used by Linux when running as a guest pushes the size of "struct
kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is really unconsequential
because KVM never consumes anything beyond the first 64 bytes, but the
resulting struct does not match the documentation.
Selftests:
- Fix spelling mistake in arch_timer selftest.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYMOJYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP2zAf/Z7/cK0+yFSvm7/tsbWtjnWofad/p
82puu0V+8lZSjGVs3AydiDCV+FahvLS0QIwgrffVr4XA10Km5ZZMjZyJ3uH4xki/
VFFsDnZPdKuj55T0wwN7JFn0YVOMdtgcP0b+F8aMbkL0uoJXjutOMKNhssuW12kw
9cmPjaBWm/bfrfoTUUB9mCh0Ub3HKpguYwTLQuf6Fyn2FK7oORpt87Zi+oIKUn6H
pFXFtZYduLg6M2LXvZqsXZLXnvABPjANNWEhiiwrvuF/wmXXTwTpvRXlYXhCvpAN
q0AhxPhPm3NnsmRhEB6SmoMjXyZIByezcEiqAspBrUvEqs/2u6VyzFMrXw==
=PlsI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM:
- Ensure perf events programmed to count during guest execution are
actually enabled before entering the guest in the nVHE
configuration
- Restore out-of-range handler for stage-2 translation faults
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches
- Fix early handling of architectural VHE-only systems to ensure E2H
is appropriately set
- Correct a format specifier warning in the arch_timer selftest
- Make the KVM banner message correctly handle all of the possible
configurations
RISC-V:
- Remove redundant semicolon in num_isa_ext_regs()
- Fix APLIC setipnum_le/be write emulation
- Fix APLIC in_clrip[x] read emulation
x86:
- Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID
entries (old vs. new) and ultimately neglects to clear PV_UNHALT
from vCPUs with HLT-exiting disabled
- Documentation fixes for SEV
- Fix compat ABI for KVM_MEMORY_ENCRYPT_OP
- Fix a 14-year-old goof in a declaration shared by host and guest;
the enabled field used by Linux when running as a guest pushes the
size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is
really unconsequential because KVM never consumes anything beyond
the first 64 bytes, but the resulting struct does not match the
documentation
Selftests:
- Fix spelling mistake in arch_timer selftest"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: arm64: Rationalise KVM banner output
arm64: Fix early handling of FEAT_E2H0 not being implemented
KVM: arm64: Ensure target address is granule-aligned for range TLBI
KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range()
KVM: arm64: Don't pass a TLBI level hint when zapping table entries
KVM: arm64: Don't defer TLB invalidation when zapping table entries
KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test
KVM: arm64: Fix out-of-IPA space translation fault handling
KVM: arm64: Fix host-programmed guest events in nVHE
RISC-V: KVM: Fix APLIC in_clrip[x] read emulation
RISC-V: KVM: Fix APLIC setipnum_le/be write emulation
RISC-V: KVM: Remove second semicolon
KVM: selftests: Fix spelling mistake "trigged" -> "triggered"
Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP
Documentation: kvm/sev: separate description of firmware
KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP
KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled
KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT
KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper
KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
...
The srso_alias_untrain_ret() dummy thunk in the !CONFIG_MITIGATION_SRSO
case is there only for the altenative in CALL_UNTRAIN_RET to have
a symbol to resolve.
However, testing with kernels which don't have CONFIG_MITIGATION_SRSO
enabled, leads to the warning in patch_return() to fire:
missing return thunk: srso_alias_untrain_ret+0x0/0x10-0x0: eb 0e 66 66 2e
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:826 apply_returns (arch/x86/kernel/alternative.c:826
Put in a plain "ret" there so that gcc doesn't put a return thunk in
in its place which special and gets checked.
In addition:
ERROR: modpost: "srso_alias_untrain_ret" [arch/x86/kvm/kvm-amd.ko] undefined!
make[2]: *** [scripts/Makefile.modpost:145: Module.symvers] Chyba 1
make[1]: *** [/usr/src/linux-6.8.3/Makefile:1873: modpost] Chyba 2
make: *** [Makefile:240: __sub-make] Chyba 2
since !SRSO builds would use the dummy return thunk as reported by
petr.pisar@atlas.cz, https://bugzilla.kernel.org/show_bug.cgi?id=218679.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202404020901.da75a60f-oliver.sang@intel.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/202404020901.da75a60f-oliver.sang@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The MSR_PEBS_DATA_CFG MSR register is used to configure which data groups
should be generated into a PEBS record, and it's shared among all counters.
If there are different configurations among counters, perf combines all the
configurations.
The first perf command as below requires a complete PEBS record
(including memory info, GPRs, XMMs, and LBRs). The second perf command
only requires a basic group. However, after the second perf command is
running, the MSR_PEBS_DATA_CFG register is cleared. Only a basic group is
generated in a PEBS record, which is wrong. The required information
for the first perf command is missed.
$ perf record --intr-regs=AX,SP,XMM0 -a -C 8 -b -W -d -c 100000003 -o /dev/null -e cpu/event=0xd0,umask=0x81/upp &
$ sleep 5
$ perf record --per-thread -c 1 -e cycles:pp --no-timestamp --no-tid taskset -c 8 ./noploop 1000
The first PEBS event is a system-wide PEBS event. The second PEBS event
is a per-thread event. When the thread is scheduled out, the
intel_pmu_pebs_del() function is invoked to update the PEBS state.
Since the system-wide event is still available, the cpuc->n_pebs is 1.
The cpuc->pebs_data_cfg is cleared. The data configuration for the
system-wide PEBS event is lost.
The (cpuc->n_pebs == 1) check was introduced in commit:
b6a32f023f ("perf/x86: Fix PEBS threshold initialization")
At that time, it indeed didn't hurt whether the state was updated
during the removal, because only the threshold is updated.
The calculation of the threshold takes the last PEBS event into
account.
However, since commit:
b752ea0c28 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
we delay the threshold update, and clear the PEBS data config, which triggers
the bug.
The PEBS data config update scope should not be shrunk during removal.
[ mingo: Improved the changelog & comments. ]
Fixes: b752ea0c28 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240401133320.703971-1-kan.liang@linux.intel.com
Tony encountered this OOPS when the last CPU of a domain goes
offline while running a kernel built with CONFIG_NO_HZ_FULL:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
...
RIP: 0010:__find_nth_andnot_bit+0x66/0x110
...
Call Trace:
<TASK>
? __die()
? page_fault_oops()
? exc_page_fault()
? asm_exc_page_fault()
cpumask_any_housekeeping()
mbm_setup_overflow_handler()
resctrl_offline_cpu()
resctrl_arch_offline_cpu()
cpuhp_invoke_callback()
cpuhp_thread_fun()
smpboot_thread_fn()
kthread()
ret_from_fork()
ret_from_fork_asm()
</TASK>
The NULL pointer dereference is encountered while searching for another
online CPU in the domain (of which there are none) that can be used to
run the MBM overflow handler.
Because the kernel is configured with CONFIG_NO_HZ_FULL the search for
another CPU (in its effort to prefer those CPUs that aren't marked
nohz_full) consults the mask representing the nohz_full CPUs,
tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y
tick_nohz_full_mask is not allocated unless the kernel is booted with
the "nohz_full=" parameter and because of that any access to
tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled().
Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled().
The latter ensures tick_nohz_full_mask can be accessed safely and can be
used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not.
[ Use Ingo's suggestion that combines the two NO_HZ checks into one. ]
Fixes: a4846aaf39 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow")
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com
Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQSNXHjWXuzMZutrKNKivnWIJHzdFgUCZgtpWBccb2xpdmVyLnVw
dG9uQGxpbnV4LmRldgAKCRCivnWIJHzdFoilAQCQk6kLIeuih5QOe50fK4XkNsyg
PGcxw0a0BP8cfjtJsgEArwLlfHQOTE4tRWtXyEHvapJfe/bE1hjLmzUJx7BwLQ4=
=6hNq
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.9, part #1
- Ensure perf events programmed to count during guest execution
are actually enabled before entering the guest in the nVHE
configuration.
- Restore out-of-range handler for stage-2 translation faults.
- Several fixes to stage-2 TLB invalidations to avoid stale
translations, possibly including partial walk caches.
- Fix early handling of architectural VHE-only systems to ensure E2H is
appropriately set.
- Correct a format specifier warning in the arch_timer selftest.
- Make the KVM banner message correctly handle all of the possible
configurations.
The commit:
59bec00ace ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()")
made PER_CPU_VAR() to use rip-relative addressing, hence
INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative
asm code inside of it. A follow up commit:
17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
changed x86_call_depth_emit_accounting() to use apply_relocation(),
but mistakenly assumed that the code is being patched in-place (where
the destination of the relocation matches the address of the code),
using *pprog as the destination ip. This is not true for the call depth
accounting, emitted by the BPF JIT, so the calculated address was wrong,
JIT-ed BPF progs on kernels with call depth tracking got broken and
usually caused a page fault.
Pass the destination IP when the BPF JIT emits call depth accounting.
Fixes: 17bce3b2ae ("x86/callthunks: Handle %rip-relative relocations in call thunk template")
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-3-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adjust the IP passed to `emit_patch` so it calculates the correct offset
for the CALL instruction if `x86_call_depth_emit_accounting` emits code.
Otherwise we will skip some instructions and most likely crash.
Fixes: b2e9dfe54b ("x86/bpf: Emit call depth accounting if required")
Link: https://lore.kernel.org/lkml/20230105214922.250473-1-joanbrugueram@gmail.com/
Co-developed-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Joan Bruguera Micó <joanbrugueram@gmail.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240401185821.224068-2-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Use the correct stalled cycles PMCs on AMD Zen2 and newer
- Fix detection of the LBR freeze feature on AMD
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJR3oACgkQEsHwGGHe
VUo/eRAAnGUq0rSi4ZUqsTtbu/jNepKbaeS7jR3/p3V6iSwfmjEoJ2xE6uIdN5vD
fnL6UkeDRMc8LaKHIdLD4ZbN8NRa3hOyzf5K7wwVp5bwle0NeyrcG5wVK8LgT/X/
rPSk7YxoR5frkYcA6zZwezJOv3HGYt8RMr5bKMD3YiJ35/XCdPsKnbHJTHb+F23Y
tYFBeyzRzOebQu0fFKP8ML9LbqvELESqJ5Smwu/jQ25aBW7sFsUNAxseGU2tYahX
c6pm8ytIlpZFwmi1HzXmMICF7lWugFO/KkP/ndCM1IpmujVGy56hrpLEy5gT3gzh
NE/nZDoqJAO2zhg2FuKybh3akdT+IgXUTjxYMYGUOkJIChzie3o4p9OqichgTIv5
+ngAq5qzjAHfC7cZ5nA96XWkw1fFU6BqlA3KPs1mzQU9uTDz7tSkyxIitp3C8L0B
JlilTr6yHUprJzFwCDk4hb+hfP5A9qYnrNeacMlldZmbH1jLYHEzB9FudK82MeM+
tIKFnM2jyRaRs/s8n+/UdrOVFNGk/+scX8GQllEBF451a8J5x1CYeHB7dGW+4pf/
cx5TupHg8dDRgNMsbaeEvwERoPu4h/VRozfBi6r1WjjskVm24lIdFFKTSm3BDbLk
EH3cflv/h8KE19cr0XLb7aYYw/9jb4cpnb0WBMw1gQOSvUMXzxU=
=gmta
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:
- Define the correct set of default hw events on AMD Zen4
- Use the correct stalled cycles PMCs on AMD Zen2 and newer
- Fix detection of the LBR freeze feature on AMD
* tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later
perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later
perf/x86/amd/lbr: Use freeze based on availability
x86/cpufeatures: Add new word for scattered features
make ... arch/x86/virt/vmx/tdx/seamcall.o
work again
- Do not do ROM range scans and memory validation when the kernel is
running as a SEV-SNP guest as those can get problematic and, before
that, are not really needed in such a guest
- Exclude the build-time generated vdso-image-x32.o object from objtool
validation and in particular the return sites in there due to
a warning which fires when an unpatched return thunk is being used
- Improve the NMI CPUs stall message to show additional information
about the state of each CPU wrt the NMI handler
- Enable gcc named address spaces support only on !KCSAN configs due to
compiler options incompatibility
- Revert a change which was trying to use GB pages for mapping regions
only when the regions would be large enough but that change lead to
kexec failing
- A documentation fixlet
-----BEGIN PGP SIGNATURE-----
iQIyBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYJN64ACgkQEsHwGGHe
VUo9KA/48ELqeKhCAFQdkZn8nnYf57g7bSMibMCnI5/ndNjyuShdZO92SGtBV7fV
4L77vJRJXx2ytz4MztOPVClXaBEs4uDEa0NyZFSuEdvRIvJuglibsleQOFTbKfkD
HdL0geI5exyfCC3BmslCS857sd6aBanqmYPddVk1+5mwhTGcAcwy9aiNTCzxxE22
KuaO12A0+UNOdXNLAuUythmYL2V0Xn2Z9sXpDRchwsRnj12C1S2flIhPsWxg9AU+
3ws9PnfTFTcfcViEug5p1nyN0gWCgYhMxaN8i3IS4smEc09Yq0pxVsitwYqJB1D6
+neC27FTF4DCBr4G/8yu/x5BVHS92s9VK4u9Wo4nf9M4XBlMVU+If0bn9jtvE6jI
GbdoHzoF7e2SEgzVvIjOHpdiBxEzW6S4lZGEtj6oaYR+cI0Fc7vZXoHy8kXohtTz
QT2M1Hl9tMh2HjkwvoQLtNMYVHMWnuDy1X9k25Qk4nnCQJy9DbGiPnpxaGuPtaJ7
s8kha4r+U/zpEiEa0S5AoIozVD/syXe1lOtaN2dUpOWjqBJAMGr1QkK3NnOxEcO3
CeKOCcSfjvvAKPyTnk1Vrtec5KwHmUgD/VN2fn7EdHZVwxlbaoCybpP3g4CT4Jpl
TQakg0du9TyG9BXituN4XaZgGskDYFRlXv+U+Lity4wvEjWgmg==
=JrjO
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure single object builds in arch/x86/virt/ ala
make ... arch/x86/virt/vmx/tdx/seamcall.o
work again
- Do not do ROM range scans and memory validation when the kernel is
running as a SEV-SNP guest as those can get problematic and, before
that, are not really needed in such a guest
- Exclude the build-time generated vdso-image-x32.o object from objtool
validation and in particular the return sites in there due to a
warning which fires when an unpatched return thunk is being used
- Improve the NMI CPUs stall message to show additional information
about the state of each CPU wrt the NMI handler
- Enable gcc named address spaces support only on !KCSAN configs due to
compiler options incompatibility
- Revert a change which was trying to use GB pages for mapping regions
only when the regions would be large enough but that change lead to
kexec failing
- A documentation fixlet
* tag 'x86_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Use obj-y to descend into arch/x86/virt/
x86/sev: Skip ROM range scans and validation for SEV-SNP guests
x86/vdso: Fix rethunk patching for vdso-image-x32.o too
x86/nmi: Upgrade NMI backtrace stall checks & messages
x86/percpu: Disable named address spaces for KCSAN
Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
Documentation/x86: Fix title underline length
Commit c33621b4c5 ("x86/virt/tdx: Wire up basic SEAMCALL functions")
introduced a new instance of core-y instead of the standardized obj-y
syntax.
X86 Makefiles descend into subdirectories of arch/x86/virt inconsistently;
into arch/x86/virt/ via core-y defined in arch/x86/Makefile, but into
arch/x86/virt/svm/ via obj-y defined in arch/x86/Kbuild.
This is problematic when you build a single object in parallel because
multiple threads attempt to build the same file.
$ make -j$(nproc) arch/x86/virt/vmx/tdx/seamcall.o
[ snip ]
AS arch/x86/virt/vmx/tdx/seamcall.o
AS arch/x86/virt/vmx/tdx/seamcall.o
fixdep: error opening file: arch/x86/virt/vmx/tdx/.seamcall.o.d: No such file or directory
make[4]: *** [scripts/Makefile.build:362: arch/x86/virt/vmx/tdx/seamcall.o] Error 2
Use the obj-y syntax, as it works correctly.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240330060554.18524-1-masahiroy@kernel.org
The original version of the mitigation would patch in the calls to the
untraining routines directly. That is, the alternative() in UNTRAIN_RET
will patch in the CALL to srso_alias_untrain_ret() directly.
However, even if commit e7c25c441e ("x86/cpu: Cleanup the untrain
mess") meant well in trying to clean up the situation, due to micro-
architectural reasons, the untraining routine srso_alias_untrain_ret()
must be the target of a CALL instruction and not of a JMP instruction as
it is done now.
Reshuffle the alternative macros to accomplish that.
Fixes: e7c25c441e ("x86/cpu: Cleanup the untrain mess")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Revert to the old initrd memory allocation soft limit of INT_MAX,
which was dropped inadvertently
- Ensure that startup_32() is entered with a valid boot_params pointer
when using the new EFI mixed mode protocol
- Fix a compiler warning introduced by a fix from the previous PR
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZgWMHgAKCRAwbglWLn0t
XI8TAQDVLH0N1DKirQk9LzOP1oVIoEwhUWaPK/kD+8b0lO9cRAEA9IQSPIGXM4d3
X1Gd7eZDojNOew+birQjGsWyPu1CuAk=
=YlEy
-----END PGP SIGNATURE-----
Merge tag 'efi-fixes-for-v6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"These address all the outstanding EFI/x86 boot related regressions:
- Revert to the old initrd memory allocation soft limit of INT_MAX,
which was dropped inadvertently
- Ensure that startup_32() is entered with a valid boot_params
pointer when using the new EFI mixed mode protocol
- Fix a compiler warning introduced by a fix from the previous pull"
* tag 'efi-fixes-for-v6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efistub: Reinstate soft limit for initrd loading
efi/libstub: Cast away type warning in use of max()
x86/efistub: Add missing boot_params for mixed mode compat entry
There are regression reports[1][2] that crashkernel region on x86_64 can't
be added into iomem tree sometime. This causes the later failure of kdump
loading.
This happened after commit 4a693ce65b ("kdump: defer the insertion of
crashkernel resources") was merged.
Even though, these reported issues are proved to be related to other
component, they are just exposed after above commmit applied, I still
would like to keep crashk_res and crashk_low_res being added into iomem
early as before because the early adding has been always there on x86_64
and working very well. For safety of kdump, Let's change it back.
Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that
only ARCH defining the macro can have the early adding
crashk_res/_low_res into iomem. Then define
HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it.
Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res
handling which was mistakenly added back in commit 85fcde402d ("kexec:
split crashkernel reservation code out from crash_core.c").
[1]
[PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u
[2]
Question about Address Range Validation in Crash Kernel Allocation
https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u
Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv
Fixes: 4a693ce65b ("kdump: defer the insertion of crashkernel resources")
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.
The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:
* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
falls in the ROM range) prior to validation.
For example, Project Oak Stage 0 provides a minimal guest firmware
that currently meets these configuration conditions, meaning guests
booting atop Oak Stage 0 firmware encounter a problematic call chain
during dmi_setup() -> dmi_scan_machine() that results in a crash
during boot if SEV-SNP is enabled.
* With regards to necessity, SEV-SNP guests generally read garbage
(which changes across boots) from the ROM range, meaning these scans
are unnecessary. The guest reads garbage because the legacy ROM range
is unencrypted data but is accessed via an encrypted PMD during early
boot (where the PMD is marked as encrypted due to potentially mapping
actually-encrypted data in other PMD-contained ranges).
In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.
Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.
While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).
As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.
In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.
In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.
[ bp: Massage commit message, remove "we"s. ]
Fixes: 9704c07bf9 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
The commit to improve NMI stall debuggability:
344da544f1 ("x86/nmi: Print reasons why backtrace NMIs are ignored")
... has shown value, but widespread use has also identified a few
opportunities for improvement.
The systems have (as usual) shown far more creativity than that commit's
author, demonstrating yet again that failing CPUs can do whatever they want.
In addition, the current message format is less friendly than one might
like to those attempting to use these messages to identify failing CPUs.
Therefore, separately flag CPUs that, during the full time that the
stack-backtrace request was waiting, were always in an NMI handler,
were never in an NMI handler, or exited one NMI handler.
Also, split the message identifying the CPU and the time since that CPU's
last NMI-related activity so that a single line identifies the CPU without
any other variable information, greatly reducing the processing overhead
required to identify repeat-offender CPUs.
Co-developed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/ab4d70c8-c874-42dc-b206-643018922393@paulmck-laptop
Add the "ref-cycles" event for AMD processors based on Zen 4 and later
microarchitectures. The backing event is based on PMCx120 which counts
cycles not in halt state in P0 frequency (same as MPERF).
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/089155f19f7c7e65aeb1caa727a882e2ca9b8b04.1711352180.git.sandipan.das@amd.com
AMD processors based on Zen 2 and later microarchitectures do not
support PMCx087 (instruction pipe stalls) which is used as the backing
event for "stalled-cycles-frontend" and "stalled-cycles-backend".
Use PMCx0A9 (cycles where micro-op queue is empty) instead to count
frontend stalls and remove the entry for backend stalls since there
is no direct replacement.
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Fixes: 3fe3331bb2 ("perf/x86/amd: Add event map for AMD Family 17h")
Link: https://lore.kernel.org/r/03d7fc8fa2a28f9be732116009025bdec1b3ec97.1711352180.git.sandipan.das@amd.com
The pure EFI stub entry point does not take a struct boot_params from
the boot loader, but creates it from scratch, and populates only the
fields that still have meaning in this context (command line, initrd
base and size, etc)
The original mixed mode implementation used the EFI handover protocol
instead, where the boot loader (i.e., GRUB) populates a boot_params
struct and passes it to a special Linux specific EFI entry point that
takes the boot_params pointer as its third argument.
When the new mixed mode implementation was introduced, using a special
32-bit PE entrypoint in the 64-bit kernel, it adopted the pure approach,
and relied on the EFI stub to create the struct boot_params. This is
preferred because it makes the bootloader side much easier to implement,
as it does not need any x86-specific knowledge on how struct boot_params
and struct setup_header are put together. This mixed mode implementation
was adopted by systemd-boot version 252 and later.
When commit
e2ab9eab32 ("x86/boot/compressed: Move 32-bit entrypoint code into .text section")
refactored this code and moved it out of head_64.S, the fact that ESI
was populated with the address of the base of the image was overlooked,
and to simplify the code flow, ESI is now zeroed and stored to memory
unconditionally in shared code, so that the NULL-ness of that variable
can still be used later to determine which mixed mode boot protocol is
in use.
With ESI pointing to the base of the image, it can serve as a struct
boot_params pointer for startup_32(), which only accesses the init_data
and kernel_alignment fields (and the scratch field as a temporary
stack). Zeroing ESI means that those accesses produce garbage now, even
though things appear to work if the first page of memory happens to be
zeroed, and the region right before LOAD_PHYSICAL_ADDR (== 16 MiB)
happens to be free.
The solution is to pass a special, temporary struct boot_params to
startup_32() via ESI, one that is sufficient for getting it to create
the page tables correctly and is discarded right after. This involves
setting a minimal alignment of 4k, only to get the statically allocated
page tables line up correctly, and setting init_size to the executable
image size (_end - startup_32). This ensures that the page tables are
covered by the static footprint of the PE image.
Given that EFI boot no longer calls the decompressor and no longer pads
the image to permit the decompressor to execute in place, the same
temporary struct boot_params should be used in the EFI handover protocol
based mixed mode implementation as well, to prevent the page tables from
being placed outside of allocated memory.
Fixes: e2ab9eab32 ("x86/boot/compressed: Move 32-bit entrypoint code into .text section")
Cc: <stable@kernel.org> # v6.1+
Closes: https://lore.kernel.org/all/20240321150510.GI8211@craftyguy.net/
Reported-by: Clayton Craft <clayton@craftyguy.net>
Tested-by: Clayton Craft <clayton@craftyguy.net>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
-fsanitize=thread (KCSAN) is at the moment incompatible
with named address spaces in a similar way as KASAN -
see GCC PR sanitizer/111736:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736
The patch disables named address spaces with KCSAN.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240325110128.615933-1-ubizjak@gmail.com
This reverts commit d794734c9b.
While the original change tries to fix a bug, it also unintentionally broke
existing systems, see the regressions reported at:
https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/
Since d794734c9b was also marked for -stable, let's back it out before
causing more damage.
Note that due to another upstream change the revert was not 100% automatic:
0a845e0f63 mm/treewide: replace pud_large() with pud_leaf()
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Russ Anderson <rja@hpe.com>
Cc: Steve Wahl <steve.wahl@hpe.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/
Fixes: d794734c9b ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Currently, the LBR code assumes that LBR Freeze is supported on all processors
when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX]
bit 1 is set. This is incorrect as the availability of the feature is
additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set,
which may not be set for all Zen 4 processors.
Define a new feature bit for LBR and PMC freeze and set the freeze enable bit
(FLBRI) in DebugCtl (MSR 0x1d9) conditionally.
It should still be possible to use LBR without freeze for profile-guided
optimization of user programs by using an user-only branch filter during
profiling. When the user-only filter is enabled, branches are no longer
recorded after the transition to CPL 0 upon PMI arrival. When branch
entries are read in the PMI handler, the branch stack does not change.
E.g.
$ perf record -j any,u -e ex_ret_brn_tkn ./workload
Since the feature bit is visible under flags in /proc/cpuinfo, it can be
used to determine the feasibility of use-cases which require LBR Freeze
to be supported by the hardware such as profile-guided optimization of
kernels.
Fixes: ca5b7c0d96 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
- Fix logic that is supposed to prevent placement of the kernel image
below LOAD_PHYSICAL_ADDR
- Use the firmware stack in the EFI stub when running in mixed mode
- Clear BSS only once when using mixed mode
- Check efi.get_variable() function pointer for NULL before trying to
call it
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZgCRgwAKCRAwbglWLn0t
XHozAP9jLdeGs1ReYZAn+W0QtW/SJHJznoPiHcktdNKG4rNX3QD9G3URu0f4jKCG
yvjw8qHM1pC2cihXXjABjf7gL7g6LAE=
=cNP7
-----END PGP SIGNATURE-----
Merge tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Fix logic that is supposed to prevent placement of the kernel image
below LOAD_PHYSICAL_ADDR
- Use the firmware stack in the EFI stub when running in mixed mode
- Clear BSS only once when using mixed mode
- Check efi.get_variable() function pointer for NULL before trying to
call it
* tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: fix panic in kdump kernel
x86/efistub: Don't clear BSS twice in mixed mode
x86/efistub: Call mixed mode boot services on the firmware's stack
efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
- Ensure that the encryption mask at boot is properly propagated on
5-level page tables, otherwise the PGD entry is incorrectly set to
non-encrypted, which causes system crashes during boot.
- Undo the deferred 5-level page table setup as it cannot work with
memory encryption enabled.
- Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
to the default value but the cached variable is not, so subsequent
comparisons might yield the wrong result and as a consequence the
result prevents updating the MSR.
- Register the local APIC address only once in the MPPARSE enumeration to
prevent triggering the related WARN_ONs() in the APIC and topology code.
- Handle the case where no APIC is found gracefully by registering a fake
APIC in the topology code. That makes all related topology functions
work correctly and does not affect the actual APIC driver code at all.
- Don't evaluate logical IDs during early boot as the local APIC IDs are
not yet enumerated and the invoked function returns an error
code. Nothing requires the logical IDs before the final CPUID
enumeration takes place, which happens after the enumeration.
- Cure the fallout of the per CPU rework on UP which misplaced the
copying of boot_cpu_data to per CPU data so that the final update to
boot_cpu_data got lost which caused inconsistent state and boot
crashes.
- Use copy_from_kernel_nofault() in the kprobes setup as there is no
guarantee that the address can be safely accessed.
- Reorder struct members in struct saved_context to work around another
kmemleak false positive
- Remove the buggy code which tries to update the E820 kexec table for
setup_data as that is never passed to the kexec kernel.
- Update the resource control documentation to use the proper units.
- Fix a Kconfig warning observed with tinyconfig
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYAUH4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoXzREAC/HVB7yzUEbjbh7dyYRBEgFU19bcyC
JKf9HVmEHj03HstUxF1dxguUhwfHVPNTWpjmy/fRwxqgM9JG+QpV6T4DIldWqchv
AUYFrQBMvql8hTKxRa/Ny75d2IqKPgEEGUuyU+ZHAzEEPwhKrbtVRDPuEiMxpd5I
9B1Pya4EzUyOv1UhPIg7PRoya1msimBZ0mCw4In6ri6xVRm1uC3Ln4LZPylxn96l
f77rz5UToUw0gfgDaezF0z4ml1phGEdSX0Z3hhD0PX12wbJGEdvPzL0qTgEq72Ad
AeLmHx4K8z2zoHMHK7iTEwjoplQxGsWLoezh22cVEEJX0dtzHz6R0ftBCa6uzATJ
C8FF1oDDHAhTL94YmVSTZHr6AdJ6LwgYHO3zXZUhxuB7PNXAT4FmT0zgU1fU3sC1
U/1mIFdgOEUOlGll2Ra5uTUKc0K/dc+yC9dcbz37Kwj3KlfqTN+5BWocjySkHomr
gcv37aU1TJGSC/D1lYWTDWGKVbbP5lk+KIGICT5SBKn0METa/wOo8dE6+T1kIwvS
t2QTlJdzilLcWGVQ8GiNjjRxFtRKY5i9Shi4K+wUvCee4/XJzRrpxrCEY8w/qceV
hc3kfUIon3TCv8+rnlSuNRZBvmFhXMYwMt0gQv4YywB+aOITKTzbGUOazLtRNKAH
lFCnBRS55AB8mg==
=WyQ2
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Ensure that the encryption mask at boot is properly propagated on
5-level page tables, otherwise the PGD entry is incorrectly set to
non-encrypted, which causes system crashes during boot.
- Undo the deferred 5-level page table setup as it cannot work with
memory encryption enabled.
- Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
to the default value but the cached variable is not, so subsequent
comparisons might yield the wrong result and as a consequence the
result prevents updating the MSR.
- Register the local APIC address only once in the MPPARSE enumeration
to prevent triggering the related WARN_ONs() in the APIC and topology
code.
- Handle the case where no APIC is found gracefully by registering a
fake APIC in the topology code. That makes all related topology
functions work correctly and does not affect the actual APIC driver
code at all.
- Don't evaluate logical IDs during early boot as the local APIC IDs
are not yet enumerated and the invoked function returns an error
code. Nothing requires the logical IDs before the final CPUID
enumeration takes place, which happens after the enumeration.
- Cure the fallout of the per CPU rework on UP which misplaced the
copying of boot_cpu_data to per CPU data so that the final update to
boot_cpu_data got lost which caused inconsistent state and boot
crashes.
- Use copy_from_kernel_nofault() in the kprobes setup as there is no
guarantee that the address can be safely accessed.
- Reorder struct members in struct saved_context to work around another
kmemleak false positive
- Remove the buggy code which tries to update the E820 kexec table for
setup_data as that is never passed to the kexec kernel.
- Update the resource control documentation to use the proper units.
- Fix a Kconfig warning observed with tinyconfig
* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/64: Move 5-level paging global variable assignments back
x86/boot/64: Apply encryption mask to 5-level pagetable update
x86/cpu: Add model number for another Intel Arrow Lake mobile processor
x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Documentation/x86: Document that resctrl bandwidth control units are MiB
x86/mpparse: Register APIC address only once
x86/topology: Handle the !APIC case gracefully
x86/topology: Don't evaluate logical IDs during early boot
x86/cpu: Ensure that CPU info updates are propagated on UP
kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
x86/pm: Work around false positive kmemleak report in msr_build_context()
x86/kexec: Do not update E820 kexec table for setup_data
x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
Normally, the EFI stub calls into the EFI boot services using the stack
that was live when the stub was entered. According to the UEFI spec,
this stack needs to be at least 128k in size - this might seem large but
all asynchronous processing and event handling in EFI runs from the same
stack and so quite a lot of space may be used in practice.
In mixed mode, the situation is a bit different: the bootloader calls
the 32-bit EFI stub entry point, which calls the decompressor's 32-bit
entry point, where the boot stack is set up, using a fixed allocation
of 16k. This stack is still in use when the EFI stub is started in
64-bit mode, and so all calls back into the EFI firmware will be using
the decompressor's limited boot stack.
Due to the placement of the boot stack right after the boot heap, any
stack overruns have gone unnoticed. However, commit
5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code")
moved the definition of the boot heap into C code, and now the boot
stack is placed right at the base of BSS, where any overruns will
corrupt the end of the .data section.
While it would be possible to work around this by increasing the size of
the boot stack, doing so would affect all x86 systems, and mixed mode
systems are a tiny (and shrinking) fraction of the x86 installed base.
So instead, record the firmware stack pointer value when entering from
the 32-bit firmware, and switch to this stack every time a EFI boot
service call is made.
Cc: <stable@kernel.org> # v6.1+
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Commit 63bed96604 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.
While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.
Fixes: 63bed96604 ("x86/startup_64: Defer assignment of 5-level paging global variables")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.
Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.
Fixes: 533568e06b ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
Commit 672365477a ("x86/fpu: Update XFD state where required") and
commit 8bf26758ca ("x86/fpu: Add XFD state to fpstate") introduced a
per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
order to avoid unnecessary writes to the MSR.
On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
wipes out any stale state. But the per CPU cached xfd value is not
reset, which brings them out of sync.
As a consequence a subsequent xfd_update_state() might fail to update
the MSR which in turn can result in XRSTOR raising a #NM in kernel
space, which crashes the kernel.
To fix this, introduce xfd_set_state() to write xfd_state together
with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.
Fixes: 672365477a ("x86/fpu: Update XFD state where required")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com
Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
The APIC address is registered twice. First during the early detection and
afterwards when actually scanning the table for APIC IDs. The APIC and
topology core warn about the second attempt.
Restrict it to the early detection call.
Fixes: 81287ad65d ("x86/apic: Sanitize APIC address setup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de
If there is no local APIC enumerated and registered then the topology
bitmaps are empty. Therefore, topology_init_possible_cpus() will die with
a division by zero exception.
Prevent this by registering a fake APIC id to populate the topology
bitmap. This also allows to use all topology query interfaces
unconditionally. It does not affect the actual APIC code because either
the local APIC address was not registered or no local APIC could be
detected.
Fixes: f1f758a805 ("x86/topology: Add a mechanism to track topology via APIC IDs")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de