linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-29 15:41:36 +00:00

Author	SHA1	Message	Date
Tiwei Bie	3144013e48	um: Fix the -Wmissing-prototypes warning for get_thread_reg The get_thread_reg function is defined in the user code, and is called by the kernel code. It should be declared in a shared header. Fixes: `dbba7f704a` ("um: stop polluting the namespace with registers.h contents") Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-30 14:15:17 +02:00
Tiwei Bie	323ced9669	um: Fix -Wmissing-prototypes warnings for (rt_)sigreturn Use SYSCALL_DEFINE0 to define (rt_)sigreturn. This will address below -Wmissing-prototypes warnings: arch/x86/um/signal.c:453:6: warning: no previous prototype for ‘sys_sigreturn’ [-Wmissing-prototypes] arch/x86/um/signal.c:560:6: warning: no previous prototype for ‘sys_rt_sigreturn’ [-Wmissing-prototypes] Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-30 14:13:38 +02:00
Thomas Gleixner	720a22fd6c	x86/apic: Don't access the APIC when disabling x2APIC With 'iommu=off' on the kernel command line and x2APIC enabled by the BIOS the code which disables the x2APIC triggers an unchecked MSR access error: RDMSR from 0x802 at rIP: 0xffffffff94079992 (native_apic_msr_read+0x12/0x50) This is happens because default_acpi_madt_oem_check() selects an x2APIC driver before the x2APIC is disabled. When the x2APIC is disabled because interrupt remapping cannot be enabled due to 'iommu=off' on the command line, x2apic_disable() invokes apic_set_fixmap() which in turn tries to read the APIC ID. This triggers the MSR warning because x2APIC is disabled, but the APIC driver is still x2APIC based. Prevent that by adding an argument to apic_set_fixmap() which makes the APIC ID read out conditional and set it to false from the x2APIC disable path. That's correct as the APIC ID has already been read out during early discovery. Fixes: `d10a904435` ("x86/apic: Consolidate boot_cpu_physical_apicid initialization sites") Reported-by: Adrian Huang <ahuang12@lenovo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Adrian Huang <ahuang12@lenovo.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/875xw5t6r7.ffs@tglx	2024-04-30 07:51:34 +02:00
Jacob Pan	be9be07b22	iommu/vt-d: Make posted MSI an opt-in command line option Add a command line opt-in option for posted MSI if CONFIG_X86_POSTED_MSI=y. Also introduce a helper function for testing if posted MSI is supported on the platform. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-12-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:43 +02:00
Jacob Pan	ce0a928711	x86/irq: Extend checks for pending vectors to posted interrupts During interrupt affinity change, it is possible to have interrupts delivered to the old CPU after the affinity has changed to the new one. To prevent lost interrupts, local APIC IRR is checked on the old CPU. Similar checks must be done for posted MSIs given the same reason. Consider the following scenario: Device system agent iommu memory CPU/LAPIC 1 FEEX_XXXX 2 Interrupt request 3 Fetch IRTE -> 4 ->Atomic Swap PID.PIR(vec) Push to Global Observable(GO) 5 if (ON) done; else 6 send a notification -> * ON: outstanding notification, 1 will suppress new notifications If the affinity change happens between 3 and 5 in the IOMMU, the old CPU's posted interrupt request (PIR) could have the pending bit set for the vector being moved. Add a helper function to check individual vector status. Then use the helper to check for pending interrupts on the source CPU's PID. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-11-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:43 +02:00
Jacob Pan	fef05a078b	x86/irq: Factor out common code for checking pending interrupts Use a common function for checking pending interrupt vector in APIC IRR instead of duplicated open coding them. Additional checks for posted MSI vectors can then be contained in this function. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-10-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:43 +02:00
Jacob Pan	1b03d82ba1	x86/irq: Install posted MSI notification handler All MSI vectors are multiplexed into a single notification vector when posted MSI is enabled. It is the responsibility of the notification vector handler to demultiplex MSI vectors. In the handler the MSI vector handlers are dispatched without IDT delivery for each pending MSI interrupt. For example, the interrupt flow will change as follows: (3 MSIs of different vectors arrive in a a high frequency burst) BEFORE: interrupt(MSI) irq_enter() handler() /* EOI / irq_exit() process_softirq() interrupt(MSI) irq_enter() handler() / EOI / irq_exit() process_softirq() interrupt(MSI) irq_enter() handler() / EOI / irq_exit() process_softirq() AFTER: interrupt / Posted MSI notification vector */ irq_enter() atomic_xchg(PIR) handler() handler() handler() pi_clear_on() apic_eoi() irq_exit() process_softirq() Except for the leading MSI, CPU notifications are skipped/coalesced. For MSIs which arrive at a low frequency, the demultiplexing loop does not wait for more interrupts to coalesce. Therefore, there's no additional latency other than the processing time. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-9-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Jacob Pan	6087c7f36a	x86/irq: Factor out handler invocation from common_interrupt() Prepare for calling external interrupt handlers directly from the posted MSI demultiplexing loop. Extract the common code from common_interrupt() to avoid code duplication. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-8-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Jacob Pan	43650dcf6d	x86/irq: Set up per host CPU posted interrupt descriptors To support posted MSIs, create a posted interrupt descriptor (PID) for each host CPU. Later on, when setting up interrupt affinity, the IOMMU's interrupt remapping table entry (IRTE) will point to the physical address of the matching CPU's PID. Each PID is initialized with the owner CPU's physical APICID as the destination. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-7-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Jacob Pan	f5a3562ec9	x86/irq: Reserve a per CPU IDT vector for posted MSIs When posted MSI is enabled, all device MSIs are multiplexed into a single notification vector. MSI handlers will be de-multiplexed at run-time by system software without IDT delivery. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-6-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Jacob Pan	7fec07fd21	x86/irq: Add a Kconfig option for posted MSI This option will be used to support delivering MSIs as posted interrupts. Interrupt remapping is required. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-5-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Jacob Pan	2254808b53	x86/irq: Remove bitfields in posted interrupt descriptor Mixture of bitfields and types is weird and really not intuitive, remove bitfields and use typed data exclusively. Bitfields often result in inferior machine code. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-4-jacob.jun.pan@linux.intel.com Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a	2024-04-30 00:54:42 +02:00
Jacob Pan	4ec8fd0371	x86/irq: Unionize PID.PIR for 64bit access w/o casting Make the PIR field into u64 such that atomic xchg64 can be used without ugly casting. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-3-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Jacob Pan	699f67512f	KVM: VMX: Move posted interrupt descriptor out of VMX code To prepare native usage of posted interrupts, move the PID declarations out of VMX code such that they can be shared. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240423174114.526704-2-jacob.jun.pan@linux.intel.com	2024-04-30 00:54:42 +02:00
Daniel J Blueman	455f9075f1	x86/tsc: Trust initial offset in architectural TSC-adjust MSRs When the BIOS configures the architectural TSC-adjust MSRs on secondary sockets to correct a constant inter-chassis offset, after Linux brings the cores online, the TSC sync check later resets the core-local MSR to 0, triggering HPET fallback and leading to performance loss. Fix this by unconditionally using the initial adjust values read from the MSRs. Trusting the initial offsets in this architectural mechanism is a better approach than special-casing workarounds for specific platforms. Signed-off-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steffen Persvold <sp@numascale.com> Reviewed-by: James Cleverdon <james.cleverdon.external@eviden.com> Reviewed-by: Dimitri Sivanich <sivanich@hpe.com> Reviewed-by: Prarit Bhargava <prarit@redhat.com> Link: https://lore.kernel.org/r/20240419085146.175665-1-daniel@quora.org	2024-04-29 23:27:16 +02:00
Jakub Kicinski	89de2db193	bpf-next-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZi9+AAAKCRDbK58LschI g0nEAP487m7L0nLVriC2oIOWsi29tklW3etm6DO7gmGRGIHgrgEAnMyV1xBj3bGj v6jJwDcybCym1hLx+1x1JCZ4eoAFswE= =xbna -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-04-29 We've added 147 non-merge commits during the last 32 day(s) which contain a total of 158 files changed, 9400 insertions(+), 2213 deletions(-). The main changes are: 1) Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86 BPF JIT. This allows inlining per-CPU array and hashmap lookups and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko. 2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song. 3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction, from Alexei Starovoitov. 4) Add support for passing mark with bpf_fib_lookup helper, from Anton Protopopov. 5) Add a new bpf_wq API for deferring events and refactor sleepable bpf_timer code to keep common code where possible, from Benjamin Tissoires. 6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs to check when NULL is passed for non-NULLable parameters, from Eduard Zingerman. 7) Harden the BPF verifier's and/or/xor value tracking, from Harishankar Vishwanathan. 8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel crypto subsystem, from Vadim Fedorenko. 9) Various improvements to the BPF instruction set standardization doc, from Dave Thaler. 10) Extend libbpf APIs to partially consume items from the BPF ringbuffer, from Andrea Righi. 11) Bigger batch of BPF selftests refactoring to use common network helpers and to drop duplicate code, from Geliang Tang. 12) Support bpf_tail_call_static() helper for BPF programs with GCC 13, from Jose E. Marchesi. 13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled, from Kumar Kartikeya Dwivedi. 14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs, from David Vernet. 15) Extend the BPF verifier to allow different input maps for a given bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu. 16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan. 17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison the stack memory, from Martin KaFai Lau. 18) Improve xsk selftest coverage with new tests on maximum and minimum hardware ring size configurations, from Tushar Vyavahare. 19) Various ReST man pages fixes as well as documentation and bash completion improvements for bpftool, from Rameez Rehman & Quentin Monnet. 20) Fix libbpf with regards to dumping subsequent char arrays, from Quentin Deslandes. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits) bpf, docs: Clarify PC use in instruction-set.rst bpf_helpers.h: Define bpf_tail_call_static when building with GCC bpf, docs: Add introduction for use in the ISA Internet Draft selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args selftests/bpf: dummy_st_ops should reject 0 for non-nullable params bpf: check bpf_dummy_struct_ops program params for test runs selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops selftests/bpf: adjust dummy_st_ops_success to detect additional error bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable selftests/bpf: Add ring_buffer__consume_n test. bpf: Add bpf_guard_preempt() convenience macro selftests: bpf: crypto: add benchmark for crypto functions selftests: bpf: crypto skcipher algo selftests bpf: crypto: add skcipher to bpf crypto bpf: make common crypto API for TC/XDP programs bpf: update the comment for BTF_FIELDS_MAX selftests/bpf: Fix wq test. selftests/bpf: Use make_sockaddr in test_sock_addr selftests/bpf: Use connect_to_addr in test_sock_addr ... ==================== Link: https://lore.kernel.org/r/20240429131657.19423-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-29 13:12:19 -07:00
Ashish Kalra	400fea4b96	x86/sev: Add callback to apply RMP table fixups for kexec Handle cases where the RMP table placement in the BIOS is not 2M aligned and the kexec-ed kernel could try to allocate from within that chunk which then causes a fatal RMP fault. The kexec failure is illustrated below: SEV-SNP: RMP table physical range [0x0000007ffe800000 - 0x000000807f0fffff] BIOS-provided physical RAM map: BIOS-e820: [mem 0x0000000000000000-0x000000000008efff] usable BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS ... BIOS-e820: [mem 0x0000004080000000-0x0000007ffe7fffff] usable BIOS-e820: [mem 0x0000007ffe800000-0x000000807f0fffff] reserved BIOS-e820: [mem 0x000000807f100000-0x000000807f1fefff] usable As seen here in the e820 memory map, the end range of the RMP table is not aligned to 2MB and not reserved but it is usable as RAM. Subsequently, kexec -s (KEXEC_FILE_LOAD syscall) loads it's purgatory code and boot_param, command line and other setup data into this RAM region as seen in the kexec logs below, which leads to fatal RMP fault during kexec boot. Loaded purgatory at 0x807f1fa000 Loaded boot_param, command line and misc at 0x807f1f8000 bufsz=0x1350 memsz=0x2000 Loaded 64bit kernel at 0x7ffae00000 bufsz=0xd06200 memsz=0x3894000 Loaded initrd at 0x7ff6c89000 bufsz=0x4176014 memsz=0x4176014 E820 memmap: 0000000000000000-000000000008efff (1) 000000000008f000-000000000008ffff (4) 0000000000090000-000000000009ffff (1) ... 0000004080000000-0000007ffe7fffff (1) 0000007ffe800000-000000807f0fffff (2) 000000807f100000-000000807f1fefff (1) 000000807f1ff000-000000807fffffff (2) nr_segments = 4 segment[0]: buf=0x00000000e626d1a2 bufsz=0x4000 mem=0x807f1fa000 memsz=0x5000 segment[1]: buf=0x0000000029c67bd6 bufsz=0x1350 mem=0x807f1f8000 memsz=0x2000 segment[2]: buf=0x0000000045c60183 bufsz=0xd06200 mem=0x7ffae00000 memsz=0x3894000 segment[3]: buf=0x000000006e54f08d bufsz=0x4176014 mem=0x7ff6c89000 memsz=0x4177000 kexec_file_load: type:0, start:0x807f1fa150 head:0x1184d0002 flags:0x0 Check if RMP table start and end physical range in the e820 tables are not aligned to 2MB and in that case map this range to reserved in all the three e820 tables. [ bp: Massage. ] Fixes: `c3b86e61b7` ("x86/cpufeatures: Enable/unmask SEV-SNP CPU feature") Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/df6e995ff88565262c2c7c69964883ff8aa6fc30.1714090302.git.ashish.kalra@amd.com	2024-04-29 11:21:09 +02:00
Ashish Kalra	d6d85ac15c	x86/e820: Add a new e820 table update helper Add a new API helper e820__range_update_table() with which to update an arbitrary e820 table. Move all current users of e820__range_update_kexec() to this new helper. [ bp: Massage. ] Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/b726af213ad55053f8a7a1e793b01bb3f1ca9dd5.1714090302.git.ashish.kalra@amd.com	2024-04-29 11:15:31 +02:00
Tony Luck	2eda374e88	x86/mm: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. [ dhansen: vertically align 0's in invlpg_miss_ids[] ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181518.41946-1-tony.luck%40intel.com	2024-04-29 10:31:36 +02:00
Tony Luck	d9b6886cd7	x86/tsc_msr: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181518.41927-1-tony.luck%40intel.com	2024-04-29 10:31:34 +02:00
Tony Luck	f21b075b67	x86/tsc: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181517.41907-1-tony.luck%40intel.com	2024-04-29 10:31:32 +02:00
Tony Luck	4db64279bc	x86/cpu: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. [ dhansen: vertically align macro and remove stray subject / ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181516.41887-1-tony.luck%40intel.com	2024-04-29 10:31:30 +02:00
Tony Luck	db99675e43	x86/resctrl: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. [ bp: Squash two resctrl patches into one. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181514.41848-1-tony.luck%40intel.com	2024-04-29 10:31:28 +02:00
Tony Luck	375a756448	x86/microcode/intel: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181513.41829-1-tony.luck%40intel.com	2024-04-29 10:31:26 +02:00
Tony Luck	4a5f2dd162	x86/mce: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. [ bp: Squash three mce patches into one, fold in fix: https://lore.kernel.org/r/20240429022051.63360-1-tony.luck@intel.com ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181511.41772-1-tony.luck%40intel.com	2024-04-29 10:31:23 +02:00
Tony Luck	c73cd37221	x86/cpu: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181511.41753-1-tony.luck%40intel.com	2024-04-29 10:31:19 +02:00
Tony Luck	fe3edc524d	x86/cpu/intel_epb: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181510.41733-1-tony.luck%40intel.com	2024-04-29 10:31:16 +02:00
Tony Luck	d32bc2111c	x86/aperfmperf: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181505.41654-1-tony.luck%40intel.com	2024-04-29 10:31:12 +02:00
Tony Luck	8fb5f44e5d	x86/apic: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181504.41634-1-tony.luck%40intel.com	2024-04-29 10:31:08 +02:00
Tony Luck	add69475de	perf/x86/msr: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181503.41614-1-tony.luck%40intel.com	2024-04-29 10:31:04 +02:00
Tony Luck	9828a1cff4	perf/x86/intel/uncore: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. [ bp: Squash three uncore patches into one. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20240424181501.41557-1-tony.luck%40intel.com	2024-04-29 10:30:39 +02:00
Jakub Kicinski	b2ff42c6d3	bpf-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZiwdfQAKCRDbK58LschI g1oqAP9mjayeIHCfYMQZa2eevy1PmVlgdNdFdMDWZFS/pHv9cgD/ZdmGzbUDKCAQ Y/KiTajitZw3kxtHX45v8/Ugtlsh9Qg= =Ewiw -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-04-26 We've added 12 non-merge commits during the last 22 day(s) which contain a total of 14 files changed, 168 insertions(+), 72 deletions(-). The main changes are: 1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page, from Puranjay Mohan. 2) Fix a crash in XDP with devmap broadcast redirect when the latter map is in process of being torn down, from Toke Høiland-Jørgensen. 3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF program runtime stats, from Xu Kuohai. 4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue, from Jason Xing. 5) Fix BPF verifier error message in resolve_pseudo_ldimm64, from Anton Protopopov. 6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64 bpf, x86: Fix PROBE_MEM runtime load check bpf: verifier: prevent userspace memory access xdp: use flags field to disambiguate broadcast redirect arm32, bpf: Reimplement sign-extension mov instruction riscv, bpf: Fix incorrect runtime stats bpf, arm64: Fix incorrect runtime stats bpf: Fix a verifier verbose message bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers MAINTAINERS: Update email address for Puranjay Mohan bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition ==================== Link: https://lore.kernel.org/r/20240426224248.26197-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-26 17:36:53 -07:00
Puranjay Mohan	b599d7d26d	bpf, x86: Fix PROBE_MEM runtime load check When a load is marked PROBE_MEM - e.g. due to PTR_UNTRUSTED access - the address being loaded from is not necessarily valid. The BPF jit sets up exception handlers for each such load which catch page faults and 0 out the destination register. If the address for the load is outside kernel address space, the load will escape the exception handling and crash the kernel. To prevent this from happening, the emits some instruction to verify that addr is > end of userspace addresses. x86 has a legacy vsyscall ABI where a page at address 0xffffffffff600000 is mapped with user accessible permissions. The addresses in this page are considered userspace addresses by the fault handler. Therefore, a BPF program accessing this page will crash the kernel. This patch fixes the runtime checks to also check that the PROBE_MEM address is below VSYSCALL_ADDR. Example BPF program: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock sk) { (volatile unsigned long )&sk->sk_tsq_flags; return 0; } BPF Assembly: 0: (79) r1 = (u64 )(r1 +0) 1: (79) r1 = (u64 *)(r1 +344) 2: (b7) r0 = 0 3: (95) exit x86-64 JIT ========== BEFORE AFTER ------ ----- 0: nopl 0x0(%rax,%rax,1) 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 5: xchg %ax,%ax 7: push %rbp 7: push %rbp 8: mov %rsp,%rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi b: mov 0x0(%rdi),%rdi ------------------------------------------------------------------------------- f: movabs $0x100000000000000,%r11 f: movabs $0xffffffffff600000,%r10 19: add $0x2a0,%rdi 19: mov %rdi,%r11 20: cmp %r11,%rdi 1c: add $0x2a0,%r11 23: jae 0x0000000000000029 23: sub %r10,%r11 25: xor %edi,%edi 26: movabs $0x100000000a00000,%r10 27: jmp 0x000000000000002d 30: cmp %r10,%r11 29: mov 0x0(%rdi),%rdi 33: ja 0x0000000000000039 --------------------------------\ 35: xor %edi,%edi 2d: xor %eax,%eax \ 37: jmp 0x0000000000000040 2f: leave \ 39: mov 0x2a0(%rdi),%rdi 30: ret \-------------------------------------------- 40: xor %eax,%eax 42: leave 43: ret Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20240424100210.11982-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-26 09:45:18 -07:00
Puranjay Mohan	66e13b615a	bpf: verifier: prevent userspace memory access With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = (size )(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock sk) { (volatile long )sk; return 0; } BPF Program before \| BPF Program after ------------------ \| ----------------- 0: (79) r1 = (u64 )(r1 +0) 0: (79) r1 = (u64 )(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = (u64 )(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = (u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: `8008342853` ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20240424100210.11982-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-04-26 09:45:18 -07:00
Eric Biggers	ed265f7fd9	crypto: x86/aes-gcm - simplify GCM hash subkey derivation Remove a redundant expansion of the AES key, and use rodata for zeroes. Also rename rfc4106_set_hash_subkey() to aes_gcm_derive_hash_subkey() because it's used for both versions of AES-GCM, not just RFC4106. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-26 17:26:10 +08:00
Eric Biggers	a0bbb1c187	crypto: x86/aes-gcm - delete unused GCM assembly code Delete aesni_gcm_enc() and aesni_gcm_dec() because they are unused. Only the incremental AES-GCM functions (aesni_gcm_init(), aesni_gcm_enc_update(), aesni_gcm_finalize()) are actually used. This saves 17 KB of object code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-26 17:26:10 +08:00
Eric Biggers	6a80586474	crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath() Since the total length processed by the loop in xts_crypt_slowpath() is a multiple of AES_BLOCK_SIZE, just round the length down to AES_BLOCK_SIZE even on the last step. This doesn't change behavior, as the last step will process a multiple of AES_BLOCK_SIZE regardless. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-26 17:26:10 +08:00
Alexander Potapenko	61b258b0d2	x86: call instrumentation hooks from copy_mc.c Memory accesses in copy_mc_to_kernel() and copy_mc_to_user() are performed by assembly routines and are invisible to KASAN, KCSAN, and KMSAN. Add hooks from instrumentation.h to tell the tools these functions have memcpy/copy_from_user semantics. The call to copy_mc_fragile() in copy_mc_fragile_handle_tail() is left intact, because the latter is only called from the assembly implementation of copy_mc_fragile(), so the memory accesses in it are covered by the instrumentation in copy_mc_to_kernel() and copy_mc_to_user(). Link: https://lore.kernel.org/all/3b7dbd88-0861-4638-b2d2-911c97a4cadf@I-love.SAKURA.ne.jp/ Link: https://lkml.kernel.org/r/20240320101851.2589698-3-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:02 -07:00
David Hildenbrand	25176ad09c	mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:41 -07:00
Kefeng Wang	bc7996c864	x86: mm: accelerate pagefault when badaccess The access_error() of vma is already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_area_access_error(). If mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-8-wangkefeng.wang@huawei.com Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:40 -07:00
Rick Edgecombe	c44357c2e7	x86/mm: care about shadow stack guard gap during placement When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in its guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Now that the vm_flags is passed into the arch get_unmapped_area()'s, and vm_unmapped_area() is ready to consider it, have VM_SHADOW_STACK's get guard gap consideration for scenario 2. Link: https://lkml.kernel.org/r/20240326021656.202649-14-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:28 -07:00
Rick Edgecombe	c5ecd8eb8c	x86/mm: implement HAVE_ARCH_UNMAPPED_AREA_VMFLAGS When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in its guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. Add x86 arch implementations of arch_get_unmapped_area_vmflags/_topdown() so future changes can allow the guard gap of type of vma being placed to be taken into account. This will be used for shadow stack memory. Link: https://lkml.kernel.org/r/20240326021656.202649-13-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:28 -07:00
Rick Edgecombe	b80fa3cbb7	treewide: use initializer for struct vm_unmapped_area_info Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new ones are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area(), if the convention is to zero initialize the struct and any new field addition missed a call site that initializes each field manually. So it is useful to do things similar across the kernel. The consensus (see links) was that in general the best way to accomplish taking into account both code cleanliness and minimizing the chance of introducing bugs, was to do C99 static initialization. As in: struct vm_unmapped_area_info info = {}; With this method of initialization, the whole struct will be zero initialized, and any statements setting fields to zero will be unneeded. The change should not leave cleanup at the call sides. While iterating though the possible solutions a few archs kindly acked other variations that still zero initialized the struct. These sites have been modified in previous changes using the pattern acked by the respective arch. So to be reduce the chance of bugs via uninitialized fields, perform a tree wide change using the consensus for the best general way to do this change. Use C99 static initializing to zero the struct and remove and statements that simply set members to zero. Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:27 -07:00
Rick Edgecombe	529ce23a76	mm: switch mm->get_unmapped_area() to a flag The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:25 -07:00
Peter Xu	35a76f5c08	mm/arch: provide pud_pfn() fallback The comment in the code explains the reasons. We took a different approach comparing to pmd_pfn() by providing a fallback function. Another option is to provide some lower level config options (compare to HUGETLB_PAGE or THP) to identify which layer an arch can support for such huge mappings. However that can be an overkill. [peterx@redhat.com: fix loongson defconfig] Link: https://lkml.kernel.org/r/20240403013249.1418299-4-peterx@redhat.com Link: https://lkml.kernel.org/r/20240327152332.950956-6-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:21 -07:00
Christoph Hellwig	5b34b76cb0	mm: move follow_phys to arch/x86/mm/pat/memtype.c follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c. Move it there and hardcode the two arguments that get the same values passed by both callers. [david@redhat.com: conflict resolutions] Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:12 -07:00
Baoquan He	fdb022f6e9	x86: remove unneeded memblock_find_dma_reserve() Patch series "mm/mm_init.c: refactor free_area_init_core()". In function free_area_init_core(), the code calculating zone->managed_pages and the subtracting dma_reserve from DMA zone looks very confusing. From git history, the code calculating zone->managed_pages was for zone->present_pages originally. The early rough assignment is for optimize zone's pcp and water mark setting. Later, managed_pages was introduced into zone to represent the number of managed pages by buddy. Now, zone->managed_pages is zeroed out and reset in mem_init() when calling memblock_free_all(). zone's pcp and wmark setting relying on actual zone->managed_pages are done later than mem_init() invocation. So we don't need rush to early calculate and set zone->managed_pages, just set it as zone->present_pages, will adjust it in mem_init(). And also add a new function calc_nr_kernel_pages() to count up free but not reserved pages in memblock, then assign it to nr_all_pages and nr_kernel_pages after memmap pages are allocated. This patch (of 6): Variable dma_reserve and its usage was introduced in commit `0e0b864e06` ("[PATCH] Account for memmap and optionally the kernel image as holes"). Its original purpose was to accounting for the reserved pages in DMA zone to make DMA zone's watermarks calculation more accurate on x86. However, currently there's zone->managed_pages to account for all available pages for buddy, zone->present_pages to account for all present physical pages in zone. What is more important, on x86, calculating and setting the zone->managed_pages is a temporary move, all zone's managed_pages will be zeroed out and reset to the actual value according to how many pages are added to buddy allocator in mem_init(). Before mem_init(), no buddy alloction is requested. And zone's pcp and watermark setting are all done after mem_init(). So, no need to worry about the DMA zone's setting accuracy during free_area_init(). Hence, remove memblock_find_dma_reserve() to stop calculating and setting dma_reserve. Link: https://lkml.kernel.org/r/20240325145646.1044760-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240325145646.1044760-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:10 -07:00
Suren Baghdasaryan	8a2f118787	change alloc_pages name in dma_map_ops to avoid name conflicts After redefining alloc_pages, all uses of that name are being replaced. Change the conflicting names to prevent preprocessor from replacing them when it's not intended. Link: https://lkml.kernel.org/r/20240321163705.3067592-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:53 -07:00
Kent Overstreet	0069455bcb	fix missing vmalloc.h includes Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev [surenb@google.com: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com [kent.overstreet@linux.dev: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev [arnd@arndb.de: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org [surenb@google.com: fix arc build] Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:49 -07:00
Peter Xu	9636f055da	mm/treewide: remove pXd_huge() This API is not used anymore, drop it for the whole tree. Link: https://lkml.kernel.org/r/20240318200404.448346-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:47 -07:00
Peter Xu	1965e933dd	mm/treewide: replace pXd_huge() with pXd_leaf() Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:46 -07:00
Peter Xu	d0973cb9b4	mm/x86: change pXd_huge() behavior to exclude swap entries This patch partly reverts below commits: `3a194f3f8a` ("mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry") `cbef8478be` ("mm/hugetlb: pmd_huge() returns true for non-present hugepage") Right now, pXd_huge() definition across kernel is unclear. We have two groups that think differently on swap entries: - x86/sparc: Allow pXd_huge() to accept swap entries - all the rest: Doesn't allow pXd_huge() to accept swap entries This is so confusing. Since the sparc helpers seem to be added in 2016, which is after x86's (2015), so sparc could have followed a trend. x86 proposed such swap handling in 2015 to resolve hugetlb swap entries hit in GUP, but now GUP guards swap entries with !pXd_present() in all layers so we should be safe. We should define this API properly, one way or another, rather than keep them defined differently across archs. Gut feeling tells me that pXd_huge() shouldn't include swap entries, and it turns out that I am not the only one thinking so, the question was raised when the current pmd_huge() for x86 was proposed by Ville Syrjälä: https://lore.kernel.org/all/Y2WQ7I4LXh8iUIRd@intel.com/ I might also be missing something obvious, but why is it even necessary to treat PRESENT==0+PSE==0 as a huge entry? It is also questioned when Jason Gunthorpe reviewed the other patchset on swap entry handlings: https://lore.kernel.org/all/20240221125753.GQ13330@nvidia.com/ Revert its meaning back to original. It shouldn't have any functional change as we should be ready with guards on !pXd_present() explicitly everywhere. Note that I also dropped the "#if CONFIG_PGTABLE_LEVELS > 2", it was there probably because it was breaking things when `3a194f3f8a` was proposed, according to the report here: https://lore.kernel.org/all/Y2LYXItKQyaJTv8j@intel.com/ Now we shouldn't need that. Instead of reverting to _PAGE_PSE raw check, leverage pXd_leaf(). Link: https://lkml.kernel.org/r/20240318200404.448346-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:45 -07:00
Jakub Kicinski	2bd87951de	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c `89884459a0` ("wifi: mac80211: fix idle calculation with multi-link") `87f5500285` ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c `1971d13ffa` ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") `4090fa373f` ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c `4dcd0e83ea` ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") `e2dc7bfd67` ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-25 12:41:37 -07:00
Tony Luck	a7011b852a	perf/x86/intel/pt: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240424181500.41538-1-tony.luck%40intel.com	2024-04-25 09:04:32 -07:00
Tony Luck	0011a51d73	perf/x86/lbr: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240424181500.41519-1-tony.luck%40intel.com	2024-04-25 09:04:32 -07:00
Tony Luck	5ee800945a	perf/x86/intel/cstate: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240424181459.41500-1-tony.luck%40intel.com	2024-04-25 09:04:32 -07:00
Tom Lendacky	e2f4c8c319	x86/sev: Make the VMPL0 checking more straight forward Currently, the enforce_vmpl0() function uses a set argument when modifying the VMPL1 permissions used to test for VMPL0. If the guest is not running at VMPL0, the guest self-terminates. The function is just a wrapper for a fixed RMPADJUST function. Eliminate the function and perform the RMPADJUST directly. No functional change. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/ed01ddf04bfb475596b24b634fd26cffaa85173a.1713974291.git.thomas.lendacky@amd.com	2024-04-25 16:14:25 +02:00
Tom Lendacky	88ed43d32b	x86/sev: Rename snp_init() in boot/compressed/sev.c The snp_init() function in boot/compressed/sev.c is local to that file, is not called from outside of the file and is independent of the snp_init() function in kernel/sev.c. Change the name to better differentiate when each function is used. Move the renamed snp_init() and related functions up in the file to avoid having to add a forward declaration and make the function static. No functional change. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/afda29585c2724b9698003f24cefa77eb35f4ffb.1713974291.git.thomas.lendacky@amd.com	2024-04-25 16:14:25 +02:00
Tom Lendacky	1e52550729	x86/sev: Shorten struct name snp_secrets_page_layout to snp_secrets_page Ending a struct name with "layout" is a little redundant, so shorten the snp_secrets_page_layout name to just snp_secrets_page. No functional change. [ bp: Rename the local pointer to "secrets" too for more clarity. ] Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/bc8d58302c6ab66c3beeab50cce3ec2c6bd72d6c.1713974291.git.thomas.lendacky@amd.com	2024-04-25 16:13:51 +02:00
Sean Christopherson	ce0abef6a1	cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n Explicitly disallow enabling mitigations at runtime for kernels that were built with CONFIG_CPU_MITIGATIONS=n, as some architectures may omit code entirely if mitigations are disabled at compile time. E.g. on x86, a large pile of Kconfigs are buried behind CPU_MITIGATIONS, and trying to provide sane behavior for retroactively enabling mitigations is extremely difficult, bordering on impossible. E.g. page table isolation and call depth tracking require build-time support, BHI mitigations will still be off without additional kernel parameters, etc. [ bp: Touchups. ] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240420000556.2645001-3-seanjc@google.com	2024-04-25 15:47:39 +02:00
Sean Christopherson	fe42754b94	cpu: Re-enable CPU mitigations by default for !X86 architectures Rename x86's to CPU_MITIGATIONS, define it in generic code, and force it on for all architectures exception x86. A recent commit to turn mitigations off by default if SPECULATION_MITIGATIONS=n kinda sorta missed that "cpu_mitigations" is completely generic, whereas SPECULATION_MITIGATIONS is x86-specific. Rename x86's SPECULATIVE_MITIGATIONS instead of keeping both and have it select CPU_MITIGATIONS, as having two configs for the same thing is unnecessary and confusing. This will also allow x86 to use the knob to manage mitigations that aren't strictly related to speculative execution. Use another Kconfig to communicate to common code that CPU_MITIGATIONS is already defined instead of having x86's menu depend on the common CPU_MITIGATIONS. This allows keeping a single point of contact for all of x86's mitigations, and it's not clear that other architectures want to allow disabling mitigations at compile-time. Fixes: `f337a6a21e` ("x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n") Closes: https://lkml.kernel.org/r/20240413115324.53303a68%40canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240420000556.2645001-2-seanjc@google.com	2024-04-25 15:47:35 +02:00
Tony Luck	b24e466abf	x86/bugs: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20240424181507.41693-1-tony.luck@intel.com	2024-04-25 12:42:13 +02:00
Tony Luck	8a28b02202	x86/bugs: Switch to new Intel CPU model defines New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20240424181506.41673-1-tony.luck@intel.com	2024-04-25 12:27:25 +02:00
Kirill A. Shutemov	a0a8d15a79	x86/tdx: Preserve shared bit on mprotect() The TDX guest platform takes one bit from the physical address to indicate if the page is shared (accessible by VMM). This bit is not part of the physical_mask and is not preserved during mprotect(). As a result, the 'shared' bit is lost during mprotect() on shared mappings. _COMMON_PAGE_CHG_MASK specifies which PTE bits need to be preserved during modification. AMD includes 'sme_me_mask' in the define to preserve the 'encrypt' bit. To cover both Intel and AMD cases, include 'cc_mask' in _COMMON_PAGE_CHG_MASK instead of 'sme_me_mask'. Reported-and-tested-by: Chris Oo <cho@microsoft.com> Fixes: `41394e33f3` ("x86/tdx: Extend the confidential computing API to support TDX guests") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240424082035.4092071-1-kirill.shutemov%40linux.intel.com	2024-04-24 08:11:43 -07:00
David Kaplan	b53c6bd5d2	x86/cpu: Fix check for RDPKRU in __show_regs() cpu_feature_enabled(X86_FEATURE_OSPKE) does not necessarily reflect whether CR4.PKE is set on the CPU. In particular, they may differ on non-BSP CPUs before setup_pku() is executed. In this scenario, RDPKRU will #UD causing the system to hang. Fix by checking CR4 for PKE enablement which is always correct for the current CPU. The scenario happens by inserting a WARN* before setup_pku() in identiy_cpu() or some other diagnostic which would lead to calling __show_regs(). [ bp: Massage commit message. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240421191728.32239-1-bp@kernel.org	2024-04-24 14:30:21 +02:00
Haifeng Xu	931be446c6	x86/resctrl: Add tracepoint for llc_occupancy tracking In our production environment, after removing monitor groups, those unused RMIDs get stuck in the limbo list forever because their llc_occupancy is always larger than the threshold. But the unused RMIDs can be successfully freed by turning up the threshold. In order to know how much the threshold should be, perf can be used to acquire the llc_occupancy of RMIDs in each rdt domain. Instead of using perf tool to track llc_occupancy and filter the log manually, it is more convenient for users to use tracepoint to do this work. So add a new tracepoint that shows the llc_occupancy of busy RMIDs when scanning the limbo list. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240408092303.26413-3-haifeng.xu@shopee.com	2024-04-24 14:24:48 +02:00
Haifeng Xu	8773922948	x86/resctrl: Rename pseudo_lock_event.h to trace.h Now only the pseudo-locking part uses tracepoints to do event tracking, but other parts of resctrl may need new tracepoints. It is unnecessary to create separate header files and define CREATE_TRACE_POINTS in different c files which fragments the resctrl tracing. Therefore, give the resctrl tracepoint header file a generic name to support its use for tracepoints that are not specific to pseudo-locking. No functional change. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20240408092303.26413-2-haifeng.xu@shopee.com	2024-04-24 14:21:52 +02:00
Wenkuan Wang	2718a7fdf2	x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range Add some more Zen5 models. Fixes: `3e4147f33f` ("x86/CPU/AMD: Add X86_FEATURE_ZEN5") Signed-off-by: Wenkuan Wang <Wenkuan.Wang@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240423144111.1362-1-bp@kernel.org	2024-04-24 14:05:25 +02:00
Tony Luck	bd4955d4bc	x86/resctrl: Simplify call convention for MSR update functions The per-resource MSR update functions cat_wrmsr(), mba_wrmsr_intel(), and mba_wrmsr_amd() all take three arguments: (struct rdt_domain d, struct msr_param m, struct rdt_resource *r) struct msr_param contains pointers to both struct rdt_resource and struct rdt_domain, thus only struct msr_param is necessary. Pass struct msr_param as a single parameter. Clean up formatting and fix some fir tree declaration ordering. No functional change. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Link: https://lore.kernel.org/r/20240308213846.77075-3-tony.luck@intel.com	2024-04-24 13:47:00 +02:00
Tony Luck	e3ca96e479	x86/resctrl: Pass domain to target CPU reset_all_ctrls() and resctrl_arch_update_domains() use on_each_cpu_mask() to call rdt_ctrl_update() on potentially one CPU from each domain. But this means rdt_ctrl_update() needs to figure out which domain to apply changes to. Doing so requires a search of all domains in a resource, which can only be done safely if cpus_lock is held. Both callers do hold this lock, but there isn't a way for a function called on another CPU via IPI to verify this. Commit `c0d848fcb0` ("x86/resctrl: Remove lockdep annotation that triggers false positive") removed the incorrect assertions. Add the target domain to the msr_param structure and call rdt_ctrl_update() for each domain separately using smp_call_function_single(). This means that rdt_ctrl_update() doesn't need to search for the domain and get_domain_from_cpu() can safely assert that the cpus_lock is held since the remaining callers do not use IPI. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Link: https://lore.kernel.org/r/20240308213846.77075-2-tony.luck@intel.com	2024-04-24 13:41:41 +02:00
Uros Bizjak	532453e7aa	locking/pvqspinlock/x86: Use _Q_LOCKED_VAL in PV_UNLOCK_ASM macro Use _Q_LOCKED_VAL instead of hardcoded $0x1 in PV_UNLOCK_ASM macro. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Boqun Feng <boqun.feng@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240422151752.53997-1-ubizjak@gmail.com	2024-04-24 11:48:08 +02:00
Uros Bizjak	94af3a04e3	locking/qspinlock/x86: Micro-optimize virt_spin_lock() Optimize virt_spin_lock() to use simpler and faster: atomic_try_cmpxchg(ptr, &val, new) instead of: atomic_cmpxchg(ptr, val, new) == val The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. Also optimize retry loop a bit. atomic_try_cmpxchg() fails iff &lock->val != 0, so there is no need to load and compare the lock value again - cpu_relax() can be unconditinally called in this case. This allows us to generate optimized: 1f: ba 01 00 00 00 mov $0x1,%edx 24: 8b 03 mov (%rbx),%eax 26: 85 c0 test %eax,%eax 28: 75 63 jne 8d <...> 2a: f0 0f b1 13 lock cmpxchg %edx,(%rbx) 2e: 75 5d jne 8d <...> ... 8d: f3 90 pause 8f: eb 93 jmp 24 <...> instead of: 1f: ba 01 00 00 00 mov $0x1,%edx 24: 8b 03 mov (%rbx),%eax 26: 85 c0 test %eax,%eax 28: 75 13 jne 3d <...> 2a: f0 0f b1 13 lock cmpxchg %edx,(%rbx) 2e: 85 c0 test %eax,%eax 30: 75 f2 jne 24 <...> ... 3d: f3 90 pause 3f: eb e3 jmp 24 <...> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240422120054.199092-1-ubizjak@gmail.com	2024-04-24 11:46:28 +02:00
Uros Bizjak	33eb8ab4ec	locking/atomic/x86: Merge __arch{,_try}_cmpxchg64_emu_local() with __arch{,_try}_cmpxchg64_emu() Macros __arch{,_try}_cmpxchg64_emu() are almost identical to their local variants __arch{,_try}_cmpxchg64_emu_local(), differing only by lock prefixes. Merge these two macros by introducing additional macro parameters to pass lock location and lock prefix from their respective static inline functions. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240417175830.161561-1-ubizjak@gmail.com	2024-04-24 11:45:13 +02:00
Sourabh Jain	79365026f8	crash: add a new kexec flag for hotplug support Commit `a72bbec70d` ("crash: hotplug support for kexec_load()") introduced a new kexec flag, `KEXEC_UPDATE_ELFCOREHDR`. Kexec tool uses this flag to indicate to the kernel that it is safe to modify the elfcorehdr of the kdump image loaded using the kexec_load system call. However, it is possible that architectures may need to update kexec segments other then elfcorehdr. For example, FDT (Flatten Device Tree) on PowerPC. Introducing a new kexec flag for every new kexec segment may not be a good solution. Hence, a generic kexec flag bit, `KEXEC_CRASH_HOTPLUG_SUPPORT`, is introduced to share the CPU/Memory hotplug support intent between the kexec tool and the kernel for the kexec_load system call. Now we have two kexec flags that enables crash hotplug support for kexec_load system call. First is KEXEC_UPDATE_ELFCOREHDR (only used in x86), and second is KEXEC_CRASH_HOTPLUG_SUPPORT (for all architectures). To simplify the process of finding and reporting the crash hotplug support the following changes are introduced. 1. Define arch specific function to process the kexec flags and determine crash hotplug support 2. Rename the @update_elfcorehdr member of struct kimage to @hotplug_support and populate it for both kexec_load and kexec_file_load syscalls, because architecture can update more than one kexec segment 3. Let generic function crash_check_hotplug_support report hotplug support for loaded kdump image based on value of @hotplug_support To bring the x86 crash hotplug support in line with the above points, the following changes have been made: - Introduce the arch_crash_hotplug_support function to process kexec flags and determine crash hotplug support - Remove the arch_crash_hotplug_[cpu\|memory]_support functions Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240326055413.186534-3-sourabhjain@linux.ibm.com	2024-04-23 14:59:01 +10:00
Sourabh Jain	118005713e	crash: forward memory_notify arg to arch crash hotplug handler In the event of memory hotplug or online/offline events, the crash memory hotplug notifier `crash_memhp_notifier()` receives a `memory_notify` object but doesn't forward that object to the generic and architecture-specific crash hotplug handler. The `memory_notify` object contains the starting PFN (Page Frame Number) and the number of pages in the hot-removed memory. This information is necessary for architectures like PowerPC to update/recreate the kdump image, specifically `elfcorehdr`. So update the function signature of `crash_handle_hotplug_event()` and `arch_crash_handle_hotplug_event()` to accept the `memory_notify` object as an argument from crash memory hotplug notifier. Since no such object is available in the case of CPU hotplug event, the crash CPU hotplug notifier `crash_cpuhp_online()` passes NULL to the crash hotplug handler. Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240326055413.186534-2-sourabhjain@linux.ibm.com	2024-04-23 14:59:01 +10:00
Johannes Berg	158a6b914c	um: signal: move pid variable where needed We have W=1 warnings on 64-bit because the pid is only used in branches on 32-bit; move it inside to get rid of the warnings. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-22 22:25:44 +02:00
Jason Gunthorpe	20516d6e51	x86: Stop using weak symbols for __iowrite32_copy() Start switching iomap_copy routines over to use #define and arch provided inline/macro functions instead of weak symbols. Inline functions allow more compiler optimization and this is often a driver hot path. x86 has the only weak implementation for __iowrite32_copy(), so replace it with a static inline containing the same single instruction inline assembly. The compiler will generate the "mov edx,ecx" in a more optimal way. Remove iomap_copy_64.S Link: https://lore.kernel.org/r/1-v3-1893cd8b9369+1925-mlx5_arm_wc_jgg@nvidia.com Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2024-04-22 17:11:19 -03:00
Tiwei Bie	49ff7d8712	um: Fix -Wmissing-prototypes warnings for __warp_* and foo These functions are not called explicitly. Let's just workaround the -Wmissing-prototypes warnings by declaring them locally similar to what was done in arch/x86/kernel/asm-offsets_32.c. This will address below -Wmissing-prototypes warnings: ./arch/x86/um/shared/sysdep/kernel-offsets.h:9:6: warning: no previous prototype for ‘foo’ [-Wmissing-prototypes] arch/um/os-Linux/main.c:187:7: warning: no previous prototype for ‘__wrap_malloc’ [-Wmissing-prototypes] arch/um/os-Linux/main.c:208:7: warning: no previous prototype for ‘__wrap_calloc’ [-Wmissing-prototypes] arch/um/os-Linux/main.c:222:6: warning: no previous prototype for ‘__wrap_free’ [-Wmissing-prototypes] arch/x86/um/user-offsets.c:17:6: warning: no previous prototype for ‘foo’ [-Wmissing-prototypes] Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-22 21:58:48 +02:00
Tiwei Bie	a4b4382f3e	um: Move declarations to proper headers This will address below -Wmissing-prototypes warnings: arch/um/kernel/initrd.c:18:12: warning: no previous prototype for ‘read_initrd’ [-Wmissing-prototypes] arch/um/kernel/um_arch.c:408:19: warning: no previous prototype for ‘read_initrd’ [-Wmissing-prototypes] arch/um/os-Linux/start_up.c:301:12: warning: no previous prototype for ‘parse_iomem’ [-Wmissing-prototypes] arch/x86/um/ptrace_32.c:15:6: warning: no previous prototype for ‘arch_switch_to’ [-Wmissing-prototypes] arch/x86/um/ptrace_32.c:101:5: warning: no previous prototype for ‘poke_user’ [-Wmissing-prototypes] arch/x86/um/ptrace_32.c:153:5: warning: no previous prototype for ‘peek_user’ [-Wmissing-prototypes] arch/x86/um/ptrace_64.c:111:5: warning: no previous prototype for ‘poke_user’ [-Wmissing-prototypes] arch/x86/um/ptrace_64.c:171:5: warning: no previous prototype for ‘peek_user’ [-Wmissing-prototypes] arch/x86/um/syscalls_64.c:48:6: warning: no previous prototype for ‘arch_switch_to’ [-Wmissing-prototypes] arch/x86/um/tls_32.c:184:5: warning: no previous prototype for ‘arch_switch_tls’ [-Wmissing-prototypes] Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-22 21:58:48 +02:00
Tiwei Bie	9ffc6724a3	um: Add missing headers This will address below -Wmissing-prototypes warnings: arch/um/kernel/mem.c:202:8: warning: no previous prototype for ‘pgd_alloc’ [-Wmissing-prototypes] arch/um/kernel/mem.c:215:7: warning: no previous prototype for ‘uml_kmalloc’ [-Wmissing-prototypes] arch/um/kernel/process.c:207:6: warning: no previous prototype for ‘arch_cpu_idle’ [-Wmissing-prototypes] arch/um/kernel/process.c:328:15: warning: no previous prototype for ‘arch_align_stack’ [-Wmissing-prototypes] arch/um/kernel/reboot.c:45:6: warning: no previous prototype for ‘machine_restart’ [-Wmissing-prototypes] arch/um/kernel/reboot.c:51:6: warning: no previous prototype for ‘machine_power_off’ [-Wmissing-prototypes] arch/um/kernel/reboot.c:57:6: warning: no previous prototype for ‘machine_halt’ [-Wmissing-prototypes] arch/um/kernel/skas/mmu.c:17:5: warning: no previous prototype for ‘init_new_context’ [-Wmissing-prototypes] arch/um/kernel/skas/mmu.c:60:6: warning: no previous prototype for ‘destroy_context’ [-Wmissing-prototypes] arch/um/kernel/skas/process.c:36:12: warning: no previous prototype for ‘start_uml’ [-Wmissing-prototypes] arch/um/kernel/time.c:807:15: warning: no previous prototype for ‘calibrate_delay_is_known’ [-Wmissing-prototypes] arch/um/kernel/tlb.c:594:6: warning: no previous prototype for ‘force_flush_all’ [-Wmissing-prototypes] arch/x86/um/bugs_32.c:22:6: warning: no previous prototype for ‘arch_check_bugs’ [-Wmissing-prototypes] arch/x86/um/bugs_32.c:44:6: warning: no previous prototype for ‘arch_examine_signal’ [-Wmissing-prototypes] arch/x86/um/bugs_64.c:9:6: warning: no previous prototype for ‘arch_check_bugs’ [-Wmissing-prototypes] arch/x86/um/bugs_64.c:13:6: warning: no previous prototype for ‘arch_examine_signal’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:10:12: warning: no previous prototype for ‘elf_core_extra_phdrs’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:15:5: warning: no previous prototype for ‘elf_core_write_extra_phdrs’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:42:5: warning: no previous prototype for ‘elf_core_write_extra_data’ [-Wmissing-prototypes] arch/x86/um/elfcore.c:63:8: warning: no previous prototype for ‘elf_core_extra_data_size’ [-Wmissing-prototypes] arch/x86/um/fault.c:18:5: warning: no previous prototype for ‘arch_fixup’ [-Wmissing-prototypes] arch/x86/um/os-Linux/mcontext.c:7:6: warning: no previous prototype for ‘get_regs_from_mc’ [-Wmissing-prototypes] arch/x86/um/os-Linux/tls.c:22:6: warning: no previous prototype for ‘check_host_supports_tls’ [-Wmissing-prototypes] Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-22 21:46:20 +02:00
Tiwei Bie	53471c5749	um: Make local functions and variables static This will also fix the warnings like: warning: no previous prototype for ‘fork_handler’ [-Wmissing-prototypes] 140 \| void fork_handler(void) \| ^~~~~~~~~~~~ Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2024-04-22 21:43:03 +02:00
Tom Lendacky	e70316d17f	x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler The MWAITX and MONITORX instructions generate the same #VC error code as the MWAIT and MONITOR instructions, respectively. Update the #VC handler opcode checking to also support the MWAITX and MONITORX opcodes. Fixes: `e3ef461af3` ("x86/sev: Harden #VC instruction emulation somewhat") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/453d5a7cfb4b9fe818b6fb67f93ae25468bc9e23.1713793161.git.thomas.lendacky@amd.com	2024-04-22 18:38:28 +02:00
Tony Luck	f055b6260e	x86/cpu/vfm: Update arch/x86/include/asm/intel-family.h New CPU #defines encode vendor and family as well as model. Update the example usage comment in arch/x86/kernel/cpu/match.c Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240416211941.9369-4-tony.luck@intel.com	2024-04-22 11:44:00 +02:00
Tony Luck	e6dfdc2e89	x86/cpu/vfm: Add new macros to work with (vendor/family/model) values To avoid adding a slew of new macros for each new Intel CPU family switch over from providing CPU model number #defines to a new scheme that encodes vendor, family, and model in a single number. [ bp: s/casted/cast/g ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240416211941.9369-3-tony.luck@intel.com	2024-04-22 11:43:55 +02:00
Tony Luck	a9d0adce69	x86/cpu/vfm: Add/initialize x86_vfm field to struct cpuinfo_x86 Refactor struct cpuinfo_x86 so that the vendor, family, and model fields are overlaid in a union with a 32-bit field that combines all three (together with a one byte reserved field in the upper byte). This will make it easy, cheap, and reliable to check all three values at once. See https://lore.kernel.org/r/Zgr6kT8oULbnmEXx@agluck-desk3 for why the ordering is (low-to-high bits): (vendor, family, model) [ bp: Move comments over the line, add the backstory about the particular order of the fields. ] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240416211941.9369-2-tony.luck@intel.com	2024-04-22 11:43:48 +02:00
Masahiro Yamada	71d99ea47f	x86/Kconfig: Merge the two CONFIG_X86_EXTENDED_PLATFORM entries There are two menu entries for X86_EXTENDED_PLATFORM, one for X86_32 and the other for X86_64. These entries are nearly identical, with the only difference being the platform list in the help message. While this structure was intended by commit `8425091ff8` ("x86: improve the help text of X86_EXTENDED_PLATFORM"), there is no need to duplicate the entire config entry. Instead, provide a little more clarification in the help message. [ bp: Massage. ] Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240204100719.42574-1-masahiroy@kernel.org	2024-04-21 18:41:40 +02:00
Linus Torvalds	3b68086599	- Add a missing memory barrier in the concurrency ID mm switching -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYk064ACgkQEsHwGGHe VUo+Gg/9GvZD7InmNHQx+TZORbCphljcx110JxB0wh6l1hTdqy57VbLjZtnztCVt EkFGku/5DEDLQVDD1VHvR6w8w3rayqBAhbGApnJfVGEDJfqyYBGXG4Nhx6qsJD5F YBtgk+HZjZ7G5kmGlokgNKXmh4NadUwADbkBaV+iSXBQs+Teld1THJFylO7ju+rZ jUttExna3YhsiHdcRwBhlxFwHm0SXsd591n9slTxNoIRg19SVAvfkYmcoL2W70c0 qtHtaVS37sHInFuVzI53W82hNfGLLgPt+vkQEtiG00CwR4956I2LkorsfAG3JwAA CPlbgO3ypwWSJjSdRrvYkjn7pw9JtlMQMpVh+ypaMgQ1zucb6FpDEzHnAA4lbgNW u31ALXlqhUcnKC/3HZC+vPQosRNSyBP2rWtNNdMZ3YKrW4vU/5zII+4foZj9kv3u irH27Suh6aKMc1y6THNDZmwcgJIFRtvr3K1fm7WwM0z5qz5uNq3sOW5RIQTTL3ti +lW7pK1NqUHZhTsOs09+kt/M7vfxyrMqk5qma18kgdgkaRCwCSJLcPGICVCM5WmD mnLsbs15PIKdN4cwfqeC0KzWr/xroPn1VHiWgU9zZeXcEVss0+UDcMbopQERsXH0 6AJV5A0/H/PVfDFKs5aTdwzQXB3LKG2AGdGoX7KpN8vtHf5TFt4= =YLBT -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Add a missing memory barrier in the concurrency ID mm switching * tag 'sched_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Add missing memory barrier in switch_mm_cid	2024-04-21 09:39:36 -07:00
Linus Torvalds	d07a0b861d	- Fix CPU feature dependencies of GFNI, VAES, and VPCLMULQDQ - Print the correct error code when FRED reports a bad event type - Add a FRED-specific INT80 handler without the special dances that need to happen in the current one - Enable the using-the-default-return-thunk-but-you-should-not warning only on configs which actually enable those special return thunks - Check the proper feature flags when selecting BHI retpoline mitigation -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmYk0HUACgkQEsHwGGHe VUr7LQ//SZAd9Gcww794BP54TelXnrOWv7A5VZrzt1W3vknQCavKVdsY4seUVkT/ mSzg0GPXMUNf5SA11y41k9ZxpuBp3J1rF/NG/x0tkB7mIxP6Lmuseaun9gQQd00e 9YzyK9Da9J9svQvWFd3cPDHiQT6vwz/05f2iyc7xVgSqnatRs/xl/BYZkJm6xl9w nSOjD4XEy5pydP1W0mH4/g+rWG5rBTzhPkbehX+aRFhC+fSEYv0g9TJpMwWs82vv RJwbUaSxHBtycZ9x77WoBJOvtEVISO35JQuJfYzK82MUiZS5pND4dpzjyF/0mQjQ kfVTOqWnE1jtDX+QaqCmwEVrGfEGw2aratQDXIcahqhpNUX269X4XuFSH5vhVF+V yblOx8zX4og4/g0mQ9ijddc2Jr9fbGPNAZwFQztiK+uGaua/S6E/smtte0YOyV5D qZD1DXtDg4WjjxRY2tPQTuIEBGicb6xDY7rby7kXzAwIgZO/rcVL3/kWDMzBffIK In0Khlu33dezzvpQVWgkQcoK/mGVywJyyOxpYwuiLRZVfarMt304/i7FbCGSAf+4 hGVdl+gfa8+j2vjveQe3a2QeCfjo1XkYwVJYJdzWmTnoU90mAzQbeFm3u+didEqE kj1ObMlKNbrsv70M805Vmequ5VwKTxPyHcbxI+tjKGS/TC8Gy+0= =53Ih -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix CPU feature dependencies of GFNI, VAES, and VPCLMULQDQ - Print the correct error code when FRED reports a bad event type - Add a FRED-specific INT80 handler without the special dances that need to happen in the current one - Enable the using-the-default-return-thunk-but-you-should-not warning only on configs which actually enable those special return thunks - Check the proper feature flags when selecting BHI retpoline mitigation * tag 'x86_urgent_for_v6.9_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpufeatures: Fix dependencies for GFNI, VAES, and VPCLMULQDQ x86/fred: Fix incorrect error code printout in fred_bad_type() x86/fred: Fix INT80 emulation for FRED x86/retpolines: Enable the default thunk warning only on relevant configs x86/bugs: Fix BHI retpoline check	2024-04-21 09:36:12 -07:00
Linus Torvalds	817772266d	* Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched return thunk in use. This should not happen!" when running KVM selftests. * Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). * Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. * Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). * Fix a bug where KVM incorrectly clears root_role.direct when userspace sets guest CPUID. * Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used by a nested guest, if KVM is using Page-Modification Logging and the nested hypervisor is NOT using EPT. x86 PMU: * Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. * Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD processors. * Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and would fail on such CPUs. Tests: * Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. * Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmYjdqcUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPNRAgAh1AdKBAWnq9bFN2Np1kSAcRAk3bs REDq/0iD1T9TvIwEmE1lHaRuqvCSO15WW+DKvbs7TS8zA0DyY7X/x8sIIy5YzZ5C bQ+JXiqk55OAj0sPskBpCvE5qEreuU8qAit57+8OseKWs57EICvJjrfsRnHlmIub pgGas3I42LjIgsuZRr2kjv+GrvaiikW+wWK6sq3CvPzTtHV196d26AK5l4NOoLkY 0FTbBIYUSJ7wxs92xuTed5mZ7JFZdsa5DVMXF5MRZ9W6g2vZCLbqCNRddRhSAsl0 gKmqZkuPTB7AnGQbJ2h/aKFT0ydsguzqbbKq62sK7ft5f1CUlbp9luDC9w== =99rq -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "This is a bit on the large side, mostly due to two changes: - Changes to disable some broken PMU virtualization (see below for details under "x86 PMU") - Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched return thunk in use. This should not happen!" when running KVM selftests. Everything else is small bugfixes and selftest changes: - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a bug where KVM incorrectly clears root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used by a nested guest, if KVM is using Page-Modification Logging and the nested hypervisor is NOT using EPT. x86 PMU: - Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. - Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD processors. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and would fail on such CPUs. Tests: - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits) KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start() KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}() KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks perf/x86/intel: Expose existence of callback support to KVM KVM: VMX: Snapshot LBR capabilities during module initialization KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run() KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run() KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding KVM: SVM: Remove a useless zeroing of allocated memory ...	2024-04-20 11:10:51 -07:00
Ard Biesheuvel	cba786af84	x86/purgatory: Switch to the position-independent small code model On x86, the ordinary, position dependent small and kernel code models only support placement of the executable in 32-bit addressable memory, due to the use of 32-bit signed immediates to generate references to global variables. For the kernel, this implies that all global variables must reside in the top 2 GiB of the kernel virtual address space, where the implicit address bits 63:32 are equal to sign bit 31. This means the kernel code model is not suitable for other bare metal executables such as the kexec purgatory, which can be placed arbitrarily in the physical address space, where its address may no longer be representable as a sign extended 32-bit quantity. For this reason, commit `e16c2983fb` ("x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large to fix kexec relocation errors") switched to the large code model, which uses 64-bit immediates for all symbol references, including function calls, in order to avoid relying on any assumptions regarding proximity of symbols in the final executable. The large code model is rarely used, clunky and the least likely to operate in a similar fashion when comparing GCC and Clang, so it is best avoided. This is especially true now that Clang 18 has started to emit executable code in two separate sections (.text and .ltext), which triggers an issue in the kexec loading code at runtime. The SUSE bugzilla fixes tag points to gcc 13 having issues with the large model too and that perhaps the large model should simply not be used at all. Instead, use the position independent small code model, which makes no assumptions about placement but only about proximity, where all referenced symbols must be within -/+ 2 GiB, i.e., in range for a RIP-relative reference. Use hidden visibility to suppress the use of a GOT, which carries absolute addresses that are not covered by static ELF relocations, and is therefore incompatible with the kexec loader's relocation logic. [ bp: Massage commit message. ] Fixes: `e16c2983fb` ("x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large to fix kexec relocation errors") Fixes: https://bugzilla.suse.com/show_bug.cgi?id=1211853 Closes: https://github.com/ClangBuiltLinux/linux/issues/2016 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/all/20240417-x86-fix-kexec-with-llvm-18-v1-0-5383121e8fb7@kernel.org/	2024-04-20 18:54:59 +02:00
Isaku Yamahata	8131cf5b4f	KVM: VMX: Introduce test mode related to EPT violation VE To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the suppress #VE bit in EPT entries selectively, in order to be able to trap non-present conditions. However, #VE isn't used for VMX and it's a bug if it happens. To be defensive and test that VMX case isn't broken introduce an option ept_violation_ve_test and when it's set, BUG the vm. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:21 -04:00
Paolo Bonzini	fb29541ead	KVM, x86: add architectural support code for #VE Dump the contents of the #VE info data structure and assert that #VE does not happen, but do not yet do anything with it. No functional change intended, separated for clarity only. Extracted from a patch by Isaku Yamahata <isaku.yamahata@intel.com>. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:20 -04:00
Sean Christopherson	949019b982	KVM: x86/mmu: Track shadow MMIO value on a per-VM basis TDX will use a different shadow PTE entry value for MMIO from VMX. Add a member to kvm_arch and track value for MMIO per-VM instead of a global variable. By using the per-VM EPT entry value for MMIO, the existing VMX logic is kept working. Introduce a separate setter function so that guest TD can use a different value later. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:20 -04:00
Isaku Yamahata	7fa5e29291	KVM: x86/mmu: Add Suppress VE bit to EPT shadow_mmio_mask/shadow_present_mask To make use of the same value of shadow_mmio_mask and shadow_present_mask for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and shadow_present_mask so that they can be common for both VMX and TDX. TDX will require shadow_mmio_mask and shadow_present_mask to include VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the spte value is defined so as to cause EPT misconfig. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:19 -04:00
Sean Christopherson	7f01cab849	KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE For TD guest, the current way to emulate MMIO doesn't work any more, as KVM is not able to access the private memory of TD guest and do the emulation. Instead, TD guest expects to receive #VE when it accesses the MMIO and then it can explicitly make hypercall to KVM to get the expected information. To achieve this, the TDX module always enables "EPT-violation #VE" in the VMCS control. And accordingly, for the MMIO spte for the shared GPA, 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT violation happens on TD accessing MMIO range. 2. On EPT violation, KVM sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA that is not populated yet, EPT violation need to be triggered when TD guest accesses such shared GPA. The non-present SPTE value for shared GPA should set "suppress #VE" bit. Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63) for both AMD and Intel as: 1) AMD hardware doesn't use this bit when present bit is off; 2) for normal VMX guest, KVM never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:19 -04:00
Sean Christopherson	d8fa2031fa	KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE The TDX support will need the "suppress #VE" bit (bit 63) set as the initial value for SPTE. To reduce code change size, introduce a new macro SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table entry (SPTE) and replace hard-coded value 0 for it. Initialize shadow page tables with their value. The plan is to unconditionally set the "suppress #VE" bit for both AMD and Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and ignored for non-present SPTE; 2) for conventional VMX guests, KVM never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by hardware. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <acdf09bf60cad12c495005bf3495c54f6b3069c9.1705965635.git.isaku.yamahata@intel.com> [Remove unnecessary CONFIG_X86_64 check. - Paolo] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 12:15:18 -04:00
Paolo Bonzini	a96cb3bf39	Merge x86 bugfixes from Linux 6.9-rc3 Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-19 09:02:22 -04:00
Eric Biggers	543ea178fb	crypto: x86/aes-xts - optimize size of instructions operating on lengths x86_64 has the "interesting" property that the instruction size is generally a bit shorter for instructions that operate on the 32-bit (or less) part of registers, or registers that are in the original set of 8. This patch adjusts the AES-XTS code to take advantage of that property by changing the LEN parameter from size_t to unsigned int (which is all that's needed and is what the non-AVX implementation uses) and using the %eax register for KEYLEN. This decreases the size of aes-xts-avx-x86_64.o by 1.2%. Note that changing the kmovq to kmovd was going to be needed anyway to make the AVX10/256 code really work on CPUs that don't support 512-bit vectors (since the AVX10 spec says that 64-bit opmask instructions will only be supported on processors that support 512-bit vectors). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	e619723a85	crypto: x86/aes-xts - eliminate a few more instructions - For conditionally subtracting 16 from LEN when decrypting a message whose length isn't a multiple of 16, use the cmovnz instruction. - Fold the addition of 4*VL to LEN into the sub of VL or 16 from LEN. - Remove an unnecessary test instruction. This results in slightly shorter code, both source and binary. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	2717e01fc3	crypto: x86/aes-xts - handle AES-128 and AES-192 more efficiently Decrease the amount of code specific to the different AES variants by "right-aligning" the sequence of round keys, and for AES-128 and AES-192 just skipping irrelevant rounds at the beginning. This shrinks the size of aes-xts-avx-x86_64.o by 13.3%, and it improves the efficiency of AES-128 and AES-192. The tradeoff is that for AES-256 some additional not-taken conditional jumps are now executed. But these are predicted well and are cheap on x86. Note that the ARMv8 CE based AES-XTS implementation uses a similar strategy to handle the different AES variants. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	ea9459ef36	crypto: x86/aesni-xts - deduplicate aesni_xts_enc() and aesni_xts_dec() Since aesni_xts_enc() and aesni_xts_dec() are very similar, generate them from a macro that's passed an argument enc=1 or enc=0. This reduces the length of aesni-intel_asm.S by 112 lines while still producing the exact same object file in both 32-bit and 64-bit mode. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	1d27e1f5c8	crypto: x86/aes-xts - handle CTS encryption more efficiently When encrypting a message whose length isn't a multiple of 16 bytes, encrypt the last full block in the main loop. This works because only decryption uses the last two tweaks in reverse order, not encryption. This improves the performance of decrypting messages whose length isn't a multiple of the AES block length, shrinks the size of aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test and a not-taken conditional jump) when encrypting a message whose length is a multiple of the AES block length. While it's not super useful to optimize for ciphertext stealing given that it's rarely needed in practice, the other two benefits mentioned above make this optimization worthwhile. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:19 +08:00
Eric Biggers	7daba20cc7	crypto: x86/sha256-ni - simplify do_4rounds Instead of loading the message words into both MSG and \m0 and then adding the round constants to MSG, load the message words into \m0 and the round constants into MSG and then add \m0 to MSG. This shortens the source code slightly. It changes the instructions slightly, but it doesn't affect binary code size and doesn't seem to affect performance. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00
Eric Biggers	59e62b20ac	crypto: x86/sha256-ni - optimize code size - Load the SHA-256 round constants relative to a pointer that points into the middle of the constants rather than to the beginning. Since x86 instructions use signed offsets, this decreases the instruction length required to access some of the later round constants. - Use punpcklqdq or punpckhqdq instead of longer instructions such as pshufd, pblendw, and palignr. This doesn't harm performance. The end result is that sha256_ni_transform shrinks from 839 bytes to 791 bytes, with no loss in performance. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00
Eric Biggers	1b5ddb067d	crypto: x86/sha256-ni - rename some register aliases MSGTMP[0-3] are used to hold the message schedule and are not temporary registers per se. MSGTMP4 is used as a temporary register for several different purposes and isn't really related to MSGTMP[0-3]. Rename them to MSG[0-3] and TMP accordingly. Also add a comment that clarifies what MSG is. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00
Eric Biggers	ffaec34b0f	crypto: x86/sha256-ni - convert to use rounds macros To avoid source code duplication, do the SHA-256 rounds using macros. This reduces the length of sha256_ni_asm.S by 153 lines while still producing the exact same object file. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00
Eric Biggers	b924ecd305	crypto: x86/aes-xts - access round keys using single-byte offsets Access the AES round keys using offsets -716 through 716, instead of 016 through 1416. This allows VEX-encoded instructions to address all round keys using 1-byte offsets, whereas before some needed 4-byte offsets. This decreases the code size of aes-xts-avx-x86_64.o by 4.2%. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-19 18:54:18 +08:00
Jakub Kicinski	41e3ddb291	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/trace/events/rpcgss.h `386f4a7379` ("trace: events: cleanup deprecated strncpy uses") `a4833e3aba` ("SUNRPC: Fix rpcgss_context trace event acceptor field") Adjacent changes: drivers/net/ethernet/intel/ice/ice_tc_lib.c `2cca35f5dd` ("ice: Fix checking for unsupported keys on non-tunnel device") `784feaa65d` ("ice: Add support for PFCP hardware offload in switchdev") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-18 13:12:24 -07:00
Alexei Starovoitov	462e5e2a59	bpf: Fix JIT of is_mov_percpu_addr instruction. The codegen for is_mov_percpu_addr instruction works for rax/r8 registers only. Fix it to generate proper x86 byte code for other registers. Fixes: `7bdbf74463` ("bpf: add special internal-only MOV instruction to resolve per-CPU addrs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240417214406.15788-1-alexei.starovoitov@gmail.com	2024-04-18 09:03:21 -07:00
Eric Biggers	9543f6e266	x86/cpufeatures: Fix dependencies for GFNI, VAES, and VPCLMULQDQ Fix cpuid_deps[] to list the correct dependencies for GFNI, VAES, and VPCLMULQDQ. These features don't depend on AVX512, and there exist CPUs that support these features but not AVX512. GFNI actually doesn't even depend on AVX. This prevents GFNI from being unnecessarily disabled if AVX is disabled to mitigate the GDS vulnerability. This also prevents all three features from being unnecessarily disabled if AVX512VL (or its dependency AVX512F) were to be disabled, but it looks like there isn't any case where this happens anyway. Fixes: `c128dbfa0f` ("x86/cpufeatures: Enable new SSE/AVX/AVX512 CPU features") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20240417060434.47101-1-ebiggers@kernel.org	2024-04-18 17:27:52 +02:00
Hou Wenlong	a4b37f5033	x86/fred: Fix incorrect error code printout in fred_bad_type() regs->orig_ax has been set to -1 on entry so in the printout, fred_bad_type() should use the passed parameter error_code. Fixes: `14619d912b` ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: https://lore.kernel.org/r/b2a8f0a41449d25240e314a2ddfbf6549511fb04.1713353612.git.houwenlong.hwl@antgroup.com	2024-04-18 10:47:17 +02:00
Xin Li (Intel)	32f5f73b79	x86/fred: Fix INT80 emulation for FRED Add a FRED-specific INT80 handler and document why it differs from the current one. Eventually, the common bits will be unified once FRED hw is available and it turns out that no further changes are needed but for now, keep the handlers separate for everyone's sanity's sake. [ bp: Zap duplicated commit message, massage. ] Fixes: `55617fb991` ("x86/entry: Do not allow external 0x80 interrupts") Suggested-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240417174731.4189592-1-xin@zytor.com	2024-04-18 10:37:11 +02:00
Borislav Petkov (AMD)	6376306add	x86/retpolines: Enable the default thunk warning only on relevant configs The using-default-thunk warning check makes sense only with configurations which actually enable the special return thunks. Otherwise, it fires on unrelated 32-bit configs on which the special return thunks won't even work (they're 64-bit only) and, what is more, those configs even go off into the weeds when booting in the alternatives patching code, leading to a dead machine. Fixes: `4461438a84` ("x86/retpoline: Ensure default return thunk isn't used at runtime") Reported-by: Klara Modin <klarasmodin@gmail.com> Reported-by: Erhard Furtner <erhard_f@mailbox.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/r/78e0d19c-b77a-4169-a80f-2eef91f4a1d6@gmail.com Link: https://lore.kernel.org/r/20240413024956.488d474e@yea	2024-04-17 18:02:05 +02:00
Paolo Bonzini	44ecfa3e5f	Merge branch 'svm' of https://github.com/kvm-x86/linux into HEAD Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. The "standard" __svm_vcpu_run() can't be made 100% bulletproof, as RBP isn't restored on #VMEXIT, but that's also the case for __vmx_vcpu_run(), and getting "close enough" is better than not even trying. As for SEV-ES, after yet another refresher on swap types, I realized KVM can simply let the hardware restore registers after #VMEXIT, all that's missing is storing the current values to the host save area (they are swap type B). This should provide 100% accuracy when using stack frames for unwinding, and requires less assembly. In between, build the SEV-ES code iff CONFIG_KVM_AMD_SEV=y, and yank out "support" for 32-bit kernels in __svm_sev_es_vcpu_run, which was unnecessarily polluting the code for a configuration that is disabled at build time. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-17 11:44:37 -04:00
Paolo Bonzini	1c3bed8006	KVM fixes for 6.9-rcN: - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Disable support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken and can leak host LBRs to the guest. - Fix a bug where KVM neglects to set the enable bits for general purpose counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel and AMD architectures require the bits to be set at RESET in order for v2 PMUs to be backwards compatible with software that was written for v1 PMUs, i.e. for software that will never manually set the global enables. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the virtual LBR perf event, i.e. KVM will always fail to create LBR events on such CPUs. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test. - Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow root due KVM unnecessarily clobbering root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1 hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU to run L2). For simplicity, KVM always disables PML when running L2, but the TDP MMU wasn't accounting for root-specific conditions that force write- protect based dirty logging. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmYYRoUACgkQOlYIJqCj N/2sDQ/8Dgd8lvzHieVZaWRCXzvtrmqZqxr08NTHJo4yqXiPxUd5z3lC1s6mSSQc RHAD21A6JstSdz6O6p3Y+koYws8YTVAZNhlBCiRnVyNuopEs+EVmUQQI5YfQiVFO 0dX7aWRUlPH7q4OQVFhI7/owLahsuzvYCEFInWQt+586oQCpkPiiRRKF48d+n/Ba fuY2jYxmxI72lMoSVFE/ZSh23lKyhpyiJW/qMCBv2jbNFR8tkbrQkcuBMaHJ6Z7d f/7sJ4T5SA4VH+4fwctONqepAGk1jLcfZFl/21Peyf2Ieh/Oy1d1+MOmVgbpdUZR WE9pVsktoDMH4tMSgNI7uOgVIh43/mDVIoYwYnfrKFjoASGWpFJV7UOf87X2soVi MHxjYKc9PXkaG8Kua1jM0VB2jo7LKFtSoHjFBHLeKJa9Y2CS1eE8y0iWarZufEtA tlt6KUqOdICzB8lbNWLwRtB9jp3V/LYWRJ+YqL3QKiN9kpTB79qH+mIOjhzunASV RfkT8No76dCoTgX1e/qhElmWJ0OBB0zhtmELxHxGCH5AUZG4JgebyomsqkZaUAeM DMgMb3nZMiijW94n8xQCGVEJ1SHL3L70DtNFej3udY6Q49c6RDsoppkMSlO3D90r ratTwHhMc5KTk51zDW+DRmVgbBZwyhDfVK2KKJi37PbObfbJyIY= =0hRN -----END PGP SIGNATURE----- Merge tag 'kvm-x86-fixes-6.9-rcN' of https://github.com/kvm-x86/linux into HEAD - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Disable support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken and can leak host LBRs to the guest. - Fix a bug where KVM neglects to set the enable bits for general purpose counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel and AMD architectures require the bits to be set at RESET in order for v2 PMUs to be backwards compatible with software that was written for v1 PMUs, i.e. for software that will never manually set the global enables. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the virtual LBR perf event, i.e. KVM will always fail to create LBR events on such CPUs. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test. - Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow root due KVM unnecessarily clobbering root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1 hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU to run L2). For simplicity, KVM always disables PML when running L2, but the TDP MMU wasn't accounting for root-specific conditions that force write- protect based dirty logging.	2024-04-16 12:50:21 -04:00
Mathieu Desnoyers	fe90f3967b	sched: Add missing memory barrier in switch_mm_cid Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit: commit `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put. There is a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than: - spin_unlock(), - switch_to(). So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock(). It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid(). Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier. Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3. Fixes: `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com	2024-04-16 13:59:45 +02:00
Uros Bizjak	d26e46f6bf	locking/atomic/x86: Introduce arch_try_cmpxchg64_local() Introduce arch_try_cmpxchg64_local() for 64-bit and 32-bit targets to improve code using cmpxchg64_local(). On 64-bit targets, the generated assembly improves from: 3e28: 31 c0 xor %eax,%eax 3e2a: 4d 0f b1 7d 00 cmpxchg %r15,0x0(%r13) 3e2f: 48 85 c0 test %rax,%rax 3e32: 0f 85 9f 00 00 00 jne 3ed7 <...> to: 3e28: 31 c0 xor %eax,%eax 3e2a: 4d 0f b1 7d 00 cmpxchg %r15,0x0(%r13) 3e2f: 0f 85 9f 00 00 00 jne 3ed4 <...> where a TEST instruction after CMPXCHG is saved. The improvements for 32-bit targets are even more noticeable, because double-word compare after CMPXCHG8B gets eliminated. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240414161257.49145-1-ubizjak@gmail.com	2024-04-14 22:40:54 +02:00
Ingo Molnar	d0331aa978	Merge branch 'linus' into perf/core, to pick up perf/urgent fixes Pick up perf/urgent fixes that are upstream already, but not yet in the perf/core development branch. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-14 22:25:18 +02:00
Juergen Gross	5bc8b0f5da	x86/pat: Fix W^X violation false-positives when running as Xen PV guest When running as Xen PV guest in some cases W^X violation WARN()s have been observed. Those WARN()s are produced by verify_rwx(), which looks into the PTE to verify that writable kernel pages have the NX bit set in order to avoid code modifications of the kernel by rogue code. As the NX bits of all levels of translation entries are or-ed and the RW bits of all levels are and-ed, looking just into the PTE isn't enough for the decision that a writable page is executable, too. When running as a Xen PV guest, the direct map PMDs and kernel high map PMDs share the same set of PTEs. Xen kernel initialization will set the NX bit in the direct map PMD entries, and not the shared PTEs. Fixes: `652c5bf380` ("x86/mm: Refuse W^X violations") Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412151258.9171-5-jgross@suse.com	2024-04-14 22:16:30 +02:00
Juergen Gross	02eac06b82	x86/pat: Restructure _lookup_address_cpa() Modify _lookup_address_cpa() to no longer use lookup_address(), but only lookup_address_in_pgd(). This is done in preparation of using lookup_address_in_pgd_attr(). No functional change intended. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412151258.9171-4-jgross@suse.com	2024-04-14 22:16:28 +02:00
Juergen Gross	d29dc5177b	x86/mm: Use lookup_address_in_pgd_attr() in show_fault_oops() Fix show_fault_oops() to not only look at the leaf PTE for detecting NX violations, but to use the effective NX value returned by lookup_address_in_pgd_attr() instead. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412151258.9171-3-jgross@suse.com	2024-04-14 22:16:27 +02:00
Juergen Gross	ceb647b4b5	x86/pat: Introduce lookup_address_in_pgd_attr() Add lookup_address_in_pgd_attr() doing the same as the already existing lookup_address_in_pgd(), but returning the effective settings of the NX and RW bits of all walked page table levels, too. This will be needed in order to match hardware behavior when looking for effective access rights, especially for detecting writable code pages. In order to avoid code duplication, let lookup_address_in_pgd() call lookup_address_in_pgd_attr() with dummy parameters. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412151258.9171-2-jgross@suse.com	2024-04-14 22:16:26 +02:00
Linus Torvalds	27fd80851d	Misc x86 fixes: - Follow up fixes for the BHI mitigations code. - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as expected. - Work around an APIC emulation bug when the kernel is built with Clang and run as a SEV guest. - Follow up x86 topology fixes. Note that there's minor cleanups included in the BHI fixes, which we'd normally delay to the next merge window, but the BHI mitigations code is new and will be backported widely, so we thought it would be better to have a unified codebase at this stage. (Let me know if that assumption is wrong and I'll rebase it.) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbmmkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1iArg//R2VYSVgAfUf4cN3rTm8RuX8U/8V4A1C/ FklDWziEWTw8DhQCidiIt7ETuMjvOf7HFzVTfLP+oOiBxHGMCEjxtf74mHRaeZ+f avPvwOfp3mrIdn/h2+t7pSYxZDf310EDvarliIXq0z2hmNjmQjKXo3dyWNiWJe9i LFzST8dRyR0Tg4MjLuY2g9ZEauRHpY6aAVk9UrRi7uaiLyoHnLBASn1DCL1pmMhT cqKfHhilkeUMYwTXbDTu1iIQwBHqcpCmUp2h6VPpxPkTxJXwzo0E4lVuFVu7tzUM yrMZrNxhtC5Cg9NVF56AqZocKjutlJfWXnmuvpc+dM7z1dF/EwFbx2ZM+3PDuw8Z uODs6bVzYlhzWxTMb9Obsp3RvHe1B7ZCFCZ8uo3G9lMFXqu047UwfuwqnZw4YpX6 CDEKz3zLbV4s64HPlvbju3CpX+m9CXhg5cR6HfCJiwYCytb1bxZU0ottnPnYxVCb 9Th7wi2f1sCZvtQ/T8aZpF7FhYe7abt/CDvlDoDxiRg1f+Z2nduznfDJH81nn1KS duZEicG0O+iYi9OCPVBTVutRxSWA6D8D4F0VN7cDRr4QHyXpOFJ1KsZyHBbZo9I6 ubp9RiFNaocgdWiWVgEpvvmlYpVPDnR38oVHNPpNAyIYHu3mlCoS76znZTNX8WfJ GlvHqOp+EM0= =2ig7 -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Follow up fixes for the BHI mitigations code - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as expected - Work around an APIC emulation bug when the kernel is built with Clang and run as a SEV guest - Follow up x86 topology fixes * tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Move TOPOEXT enablement into the topology parser x86/cpu/amd: Make the NODEID_MSR union actually work x86/cpu/amd: Make the CPUID 0x80000008 parser correct x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto x86/bugs: Clarify that syscall hardening isn't a BHI mitigation x86/bugs: Fix BHI handling of RRSBA x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr' x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES x86/bugs: Fix BHI documentation x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n x86/topology: Don't update cpu_possible_map in topo_set_cpuids() x86/bugs: Fix return type of spectre_bhi_state() x86/apic: Force native_apic_mem_read() to use the MOV instruction	2024-04-14 10:48:51 -07:00
Linus Torvalds	a1505c47e7	Fix the x86 PMU multi-counter code returning invalid data in certain circumstances. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYbkE8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gZURAAjtG4+vnthjGFIC9HSKPFa95aOhAd0tCE NfW8HDTuA1zhK0MoKU2cQkiE/1LVcrSS60UomYUhwB7Hrdx+HCz81W9CMrwyzkY1 0MhiUiq2cSYo0FsWflzipOoFffyY2THrlimqILQmhzo5JlbYA8WNTxAWs6Th651e an5GZJvPq98LhhWEbULiiT+2GGYi4CtbVtZvo+FyCJvYcgxF70EeJ/jvxBUax4yW 23r+uxruo3kuiJoBAs3+OBWt/C+Ij/YF9tKqOW1XdXpPxYAbGKtYw2Ck0EIkSVHS kilU0ig4TMRCUmUXXSqWnTxheZBF7eXwu9cMKdnqydLjazuO5M8uh+yiZEuA/UFA I5YGBmuYe/Q+CaZmobFtRyvRZZE3kF1xTxyyw2vJMDjbW1FOppXPrdFrnbav5/h0 pDxBz7H5f6L/Egyi173cRY8dzcKTjP0adtczb1M5Q6BxnuCO4I0P7RUIg41tabxg YuojtmIGAjhMXVXmC9VUabVDzB2o5vYXbZ4xTM6uG5XG7/2IX0nOJidwhjotw+1D 6LK4U1kcaY8kqolYiUt5zyEc96CPvFwM7w8592Ohxs5arROYEgz0QnjGnmRu8OsP yV5U8lz+aoX2Cwh6n06XJB5GJqWX5BzkJ+D9q/mRqPRL8/BDscD2WcHkKdRE89T6 YkecmTrHYD0= =L86s -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix the x86 PMU multi-counter code returning invalid data in certain circumstances" * tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix out of range data	2024-04-14 10:26:27 -07:00
Josh Poimboeuf	69129794d9	x86/bugs: Fix BHI retpoline check Confusingly, X86_FEATURE_RETPOLINE doesn't mean retpolines are enabled, as it also includes the original "AMD retpoline" which isn't a retpoline at all. Also replace cpu_feature_enabled() with boot_cpu_has() because this is before alternatives are patched and cpu_feature_enabled()'s fallback path is slower than plain old boot_cpu_has(). Fixes: `ec9404e40e` ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/ad3807424a3953f0323c011a643405619f2a4927.1712944776.git.jpoimboe@kernel.org	2024-04-14 11:10:05 +02:00
Li RongQing	90167e9658	x86/sev: Take NUMA node into account when allocating memory for per-CPU SEV data per-CPU SEV data is dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Nikunj A Dadhania <nikunj@amd.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20240412030130.49704-1-lirongqing@baidu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-12 12:12:11 +02:00
Ingo Molnar	21f546a43a	Merge branch 'x86/urgent' into x86/cpu, to resolve conflict There's a new conflict between this commit pending in x86/cpu: `63edbaa48a` x86/cpu/topology: Add support for the AMD 0x80000026 leaf And these fixes in x86/urgent: `c064b536a8` x86/cpu/amd: Make the NODEID_MSR union actually work `1b3108f689` x86/cpu/amd: Make the CPUID 0x80000008 parser correct Resolve them. Conflicts: arch/x86/kernel/cpu/topology_amd.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-12 12:11:45 +02:00
Thomas Gleixner	7211274fe0	x86/cpu/amd: Move TOPOEXT enablement into the topology parser The topology rework missed that early_init_amd() tries to re-enable the Topology Extensions when the BIOS disabled them. The new parser is invoked before early_init_amd() so the re-enable attempt happens too late. Move it into the AMD specific topology parser code where it belongs. Fixes: `f7fb3b2dd9` ("x86/cpu: Provide an AMD/HYGON specific topology parser") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx	2024-04-12 12:05:54 +02:00
Thomas Gleixner	c064b536a8	x86/cpu/amd: Make the NODEID_MSR union actually work A system with NODEID_MSR was reported to crash during early boot without any output. The reason is that the union which is used for accessing the bitfields in the MSR is written wrongly and the resulting executable code accesses the wrong part of the MSR data. As a consequence a later division by that value results in 0 and that result is used for another division as divisor, which obviously does not work well. The magic world of C, unions and bitfields: union { u64 bita : 3, bitb : 3; u64 all; } x; x.all = foo(); a = x.bita; b = x.bitb; results in the effective executable code of: a = b = x.bita; because bita and bitb are treated as union members and therefore both end up at bit offset 0. Wrapping the bitfield into an anonymous struct: union { struct { u64 bita : 3, bitb : 3; }; u64 all; } x; works like expected. Rework the NODEID_MSR union in exactly that way to cure the problem. Fixes: `f7fb3b2dd9` ("x86/cpu: Provide an AMD/HYGON specific topology parser") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Laura Nao <laura.nao@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/	2024-04-12 12:05:54 +02:00
Thomas Gleixner	1b3108f689	x86/cpu/amd: Make the CPUID 0x80000008 parser correct CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the package. The parser uses this value to initialize the SMT domain level. That's wrong because cpu_nthreads does not describe the number of threads per physical core. So this needs to set the CORE domain level and let the later parsers set the SMT shift if available. Preset the SMT domain level with the assumption of one thread per core, which is correct ifrt here are no other CPUID leafs to parse, and propagate cpu_nthreads and the core level APIC bitwidth into the CORE domain. Fixes: `f7fb3b2dd9` ("x86/cpu: Provide an AMD/HYGON specific topology parser") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Laura Nao <laura.nao@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de	2024-04-12 12:05:54 +02:00
Josh Poimboeuf	4f511739c5	x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI For consistency with the other CONFIG_MITIGATION_* options, replace the CONFIG_SPECTRE_BHI_{ON,OFF} options with a single CONFIG_MITIGATION_SPECTRE_BHI option. [ mingo: Fix ] Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/r/3833812ea63e7fdbe36bf8b932e63f70d18e2a2a.1712813475.git.jpoimboe@kernel.org	2024-04-12 12:05:54 +02:00
Josh Poimboeuf	36d4fe147c	x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto Unlike most other mitigations' "auto" options, spectre_bhi=auto only mitigates newer systems, which is confusing and not particularly useful. Remove it. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/412e9dc87971b622bbbaf64740ebc1f140bff343.1712813475.git.jpoimboe@kernel.org	2024-04-12 12:05:54 +02:00
Uros Bizjak	9109566612	locking/pvqspinlock/x86: Remove redundant CMP after CMPXCHG in __raw_callee_save___pv_queued_spin_unlock() x86 CMPXCHG instruction returns success in the ZF flag. Remove redundant CMP instruction after CMPXCHG that performs the same check. Also update the function comment to mention the modern version of the equivalent C code. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240412083908.282802-1-ubizjak@gmail.com	2024-04-12 11:42:39 +02:00
Paolo Bonzini	1ab157ce57	KVM: SEV: use u64_to_user_ptr throughout Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-12 04:42:25 -04:00
Sean Christopherson	2325a21ac1	KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument TDX uses different ABI to get information about VM exit. Pass intr_info to the NMI and INTR handlers instead of pulling it from vcpu_vmx in preparation for sharing the bulk of the handlers with TDX. When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds exit qualification etc rather than the VMCS fields because VMM doesn't have access to the VMCS. The eventual code will be VMX: - get exit reason, intr_info, exit_qualification, and etc from VMCS - call NMI/INTR handlers (common code) TDX: - get exit reason, intr_info, exit_qualification, and etc from guest registers - call NMI/INTR handlers (common code) Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-12 04:42:24 -04:00
Paolo Bonzini	5f18c642ff	KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions to operate on VM. TDX doesn't allow VMM to operate VMCS directly. Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to indirectly operate those data structures. This means we must have a TDX version of kvm_x86_ops. The existing global struct kvm_x86_ops already defines an interface which can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or TDX at run time. To split the runtime switch, the VMX implementation, and the TDX implementation, add main.c, and move out the vmx_x86_ops hooks in preparation for adding TDX. Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx. The eventually converted code will look like this: vmx.c: vmx_op() { ... } VMX initialization tdx.c: tdx_op() { ... } TDX initialization x86_ops.h: vmx_op(); tdx_op(); main.c: static vt_op() { if (tdx) tdx_op() else vmx_op() } static struct kvm_x86_ops vt_x86_ops = { .op = vt_op, initialization functions call both VMX and TDX initialization Opportunistically, fix the name inconsistency from vmx_create_vcpu() and vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free(). Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-12 04:42:24 -04:00
Sean Christopherson	e913ef159f	KVM: x86: Split core of hypercall emulation to helper function By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-12 04:42:23 -04:00
Paolo Bonzini	f9cecb3c50	Merge branch 'kvm-sev-init2' into HEAD The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. The first source of variability that was encountered is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. This series adds all the APIs that are needed to customize the features, with room for future enhancements: - a new /dev/kvm device attribute to retrieve the set of supported features (right now, only debug swap) - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct, replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT It then puts the new op to work by including the VMSA features as a field of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility; but I am considering also making them use zero as the feature mask, and will gladly adjust the patches if so requested. In order to avoid creating two new KVM_MEM_ENCRYPT_OPs, I decided that I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP to reuse the KVM_SEV_INIT2 ioctl. And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all, SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT, reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor once the VMSA has been encrypted... which is how the API should have always behaved. Second, they also synchronize the FPU and AVX state. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-12 04:41:34 -04:00
Eric Biggers	751fb2528c	crypto: x86/aes-xts - make non-AVX implementation use new glue code Make the non-AVX implementation of AES-XTS (xts-aes-aesni) use the new glue code that was introduced for the AVX implementations of AES-XTS. This reduces code size, and it improves the performance of xts-aes-aesni due to the optimization for messages that don't span page boundaries. This required moving the new glue functions higher up in the file and allowing the IV encryption function to be specified by the caller. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-12 15:07:53 +08:00
Eric Biggers	6a24fdfe1e	crypto: x86/sha512-avx2 - add missing vzeroupper Since sha512_transform_rorx() uses ymm registers, execute vzeroupper before returning from it. This is necessary to avoid reducing the performance of SSE code. Fixes: `e01d69cb01` ("crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX instructions.") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-12 15:07:52 +08:00
Eric Biggers	57ce8a4e16	crypto: x86/sha256-avx2 - add missing vzeroupper Since sha256_transform_rorx() uses ymm registers, execute vzeroupper before returning from it. This is necessary to avoid reducing the performance of SSE code. Fixes: `d34a460092` ("crypto: sha256 - Optimized sha256 x86_64 routine using AVX2's RORX instructions") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-12 15:07:52 +08:00
Eric Biggers	4ad096cca9	crypto: x86/nh-avx2 - add missing vzeroupper Since nh_avx2() uses ymm registers, execute vzeroupper before returning from it. This is necessary to avoid reducing the performance of SSE code. Fixes: `0f961f9f67` ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-12 15:07:52 +08:00
Linus Torvalds	52e5070f60	hyperv-fixes for v6.9-rc4 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmYYYPkTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXnxhB/4/8c7lFT53VbujFmVA5sNvpP5Ji5Xg ERhVID7tzDyaVRPr+tpPJIW0Oj/t34SH9seoTYCBCM/UABWe/Gxceg2JaoOzIx+l LHi73T4BBaqExiXbCCFj8N7gLO5P4Xz6ZZRgwHws1KmMXsiWYmYsbv36eSv9x6qK +z/n6p9/ubKFNj2/vsvfiGmY0XHayD3NM4Y4toMbYE/tuRT8uZ7D5sqWdRf+UhW/ goRDA5qppeSfuaQu2LNVoz1e6wRmeJFv8OHgaPvQqAjTRLzPwwss28HICmKc8gh3 HDDUUJCHSs1XItSGDFip6rIFso5X/ZHO0d6pV75hOKCisd7lV0qH6NIZ =k62H -----END PGP SIGNATURE----- Merge tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Some cosmetic changes (Erni Sri Satya Vennela, Li Zhijian) - Introduce hv_numa_node_to_pxm_info() (Nuno Das Neves) - Fix KVP daemon to handle IPv4 and IPv6 combination for keyfile format (Shradha Gupta) - Avoid freeing decrypted memory in a confidential VM (Rick Edgecombe and Michael Kelley) * tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: Don't free ring buffers that couldn't be re-encrypted uio_hv_generic: Don't free decrypted memory hv_netvsc: Don't free decrypted memory Drivers: hv: vmbus: Track decrypted status in vmbus_gpadl Drivers: hv: vmbus: Leak pages if set_memory_encrypted() fails hv/hv_kvp_daemon: Handle IPv4 and Ipv6 combination for keyfile format hv: vmbus: Convert sprintf() family to sysfs_emit() family mshyperv: Introduce hv_numa_node_to_pxm_info() x86/hyperv: Cosmetic changes for hv_apic.c	2024-04-11 16:23:56 -07:00
Jakub Kicinski	94426ed213	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c `47d8ac011f` ("af_unix: Fix garbage collector racing against connect()") `4090fa373f` ("af_unix: Replace garbage collection algorithm.") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c `faa12ca245` ("bnxt_en: Reset PTP tx_avail after possible firmware reset") `b3d0083caf` ("bnxt_en: Support RSS contexts in ethtool .{get\|set}_rxfh()") drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c `7ac10c7d72` ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()") `194fad5b27` ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions") drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c `958f56e483` ("net/mlx5e: Un-expose functions in en.h") `49e6c93870` ("net/mlx5e: RSS, Block XOR hash with over 128 channels") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-11 14:23:47 -07:00
David Matlack	b1a8d2b02b	KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting Drop the "If AD bits are enabled/disabled" verbiage from the comments above kvm_tdp_mmu_clear_dirty_{slot,pt_masked}() since TDP MMU SPTEs may need to be write-protected even when A/D bits are enabled. i.e. These comments aren't technically correct. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240315230541.1635322-4-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:51 -07:00
David Matlack	feac19aa4c	KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}() Drop the comments above clear_dirty_gfn_range() and clear_dirty_pt_masked(), since each is word-for-word identical to the comment above their parent function. Leave the comment on the parent functions since they are APIs called by the KVM/x86 MMU. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240315230541.1635322-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:50 -07:00
David Matlack	2673dfb591	KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status Check kvm_mmu_page_ad_need_write_protect() when deciding whether to write-protect or clear D-bits on TDP MMU SPTEs, so that the TDP MMU accounts for any role-specific reasons for disabling D-bit dirty logging. Specifically, TDP MMU SPTEs must be write-protected when the TDP MMU is being used to run an L2 (i.e. L1 has disabled EPT) and PML is enabled. KVM always disables PML when running L2, even when L1 and L2 GPAs are in the some domain, so failing to write-protect TDP MMU SPTEs will cause writes made by L2 to not be reflected in the dirty log. Reported-by: syzbot+900d58a45dcaab9e4821@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=900d58a45dcaab9e4821 Fixes: `5982a53926` ("KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot") Cc: stable@vger.kernel.org Cc: Vipin Sharma <vipinsh@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240315230541.1635322-2-dmatlack@google.com [sean: massage shortlog and changelog, tweak ternary op formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:50 -07:00
Sean Christopherson	1bc26cb909	KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update Set kvm_mmu_page_role.invalid to mark the various MMU root_roles invalid during CPUID update in order to force a refresh, instead of zeroing out the entire role. This fixes a bug where kvm_mmu_free_roots() incorrectly thinks a root is indirect, i.e. not a TDP MMU, due to "direct" being zeroed, which in turn causes KVM to take mmu_lock for write instead of read. Note, paving over the entire role was largely unintentional, commit `7a458f0e1b` ("KVM: x86/mmu: remove extended bits from mmu_role, rename field") simply missed that "invalid" could be set. Fixes: `576a15de8d` ("KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read") Reported-by: syzbot+dc308fcfcd53f987de73@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000009b38080614c49bdb@google.com Cc: Phi Nguyen <phind.uet@gmail.com> Link: https://lore.kernel.org/r/20240408231115.1387279-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:49 -07:00
Sean Christopherson	bb9dc85908	KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks Disable LBR virtualization if the CPU doesn't support callstacks, which were introduced in HSW (see commit `e9d7f7cd97` ("perf/x86/intel: Add basic Haswell LBR call stack support"), as KVM unconditionally configures the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR virtualization always fails on pre-HSW CPUs. Simply disable LBR support on such CPUs, as it has never worked, i.e. there is no risk of breaking an existing setup, and figuring out a way to performantly context switch LBRs on old CPUs is not worth the effort. Fixes: `be635e34c2` ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Cc: Mingwei Zhang <mizhang@google.com> Cc: Jim Mattson <jmattson@google.com> Tested-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:48 -07:00
Sean Christopherson	0d0b608650	perf/x86/intel: Expose existence of callback support to KVM Add a "has_callstack" field to the x86_pmu_lbr structure used to pass information to KVM, and set it accordingly in x86_perf_get_lbr(). KVM will use has_callstack to avoid trying to create perf LBR events with PERF_SAMPLE_BRANCH_CALL_STACK on CPUs that don't support callstacks. Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:47 -07:00
Sean Christopherson	447112d7ed	KVM: VMX: Snapshot LBR capabilities during module initialization Snapshot VMX's LBR capabilities once during module initialization instead of calling into perf every time a vCPU reconfigures its vPMU. This will allow massaging the LBR capabilities, e.g. if the CPU doesn't support callstacks, without having to remember to update multiple locations. Opportunistically tag vmx_get_perf_capabilities() with __init, as it's only called from vmx_set_cpu_caps(). Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-11 12:58:46 -07:00
Paolo Bonzini	f3b65bbaed	KVM: delete .change_pte MMU notifier callback The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were not bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit `6bdb913f0a` ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Bibo Mao <maobibo@loongson.cn> Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:18:27 -04:00
Paolo Bonzini	4dd5ecacb9	KVM: SEV: allow SEV-ES DebugSwap again The DebugSwap feature of SEV-ES provides a way for confidential guests to use data breakpoints. Its status is record in VMSA, and therefore attestation signatures depend on whether it is enabled or not. In order to avoid invalidating the signatures depending on the host machine, it was disabled by default (see commit `5abf6dceb0`, "SEV: disable SEV-ES DebugSwap by default", 2024-03-09). However, we now have a new API to create SEV VMs that allows enabling DebugSwap based on what the user tells KVM to do, and we also changed the legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore possible to re-enable the feature without breaking compatibility with kernels that pre-date the introduction of DebugSwap, so go ahead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-14-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:26 -04:00
Paolo Bonzini	4f5defae70	KVM: SEV: introduce KVM_SEV_INIT2 operation The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's already a parameter whether SEV or SEV-ES is desired. Another possible source of variability is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct, and put the new op to work by including the VMSA features as a field of the struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility. The struct also includes the usual bells and whistles for future extensibility: a flags field that must be zero for now, and some padding at the end. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:25 -04:00
Paolo Bonzini	eb4441864e	KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA. Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark FPU contents as confidential after it has been copied and encrypted into the VMSA. Since the XSAVE state for AVX is the first, it does not need the compacted-state handling of get_xsave_addr(). However, there are other parts of XSAVE state in the VMSA that currently are not handled, and the validation logic of get_xsave_addr() is pointless to duplicate in KVM, so move get_xsave_addr() to public FPU API; it is really just a facility to operate on XSAVE state and does not expose any internal details of arch/x86/kernel/fpu. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:25 -04:00
Paolo Bonzini	26c44aa9e0	KVM: SEV: define VM types for SEV and SEV-ES Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:25 -04:00
Paolo Bonzini	4ebb105e6c	KVM: SEV: introduce to_kvm_sev_info Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:24 -04:00
Paolo Bonzini	2a955c4db1	KVM: x86: Add supported_vm_types to kvm_caps This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES), and also allows the vendor module to specify which VM types are supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:24 -04:00
Paolo Bonzini	517987e3fb	KVM: x86: add fields to struct kvm_arch for CoCo features Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:23 -04:00
Paolo Bonzini	605bbdc12b	KVM: SEV: store VMSA features in kvm_sev_info Right now, the set of features that are stored in the VMSA upon initialization is fixed and depends on the module parameters for kvm-amd.ko. However, the hypervisor cannot really change it at will because the feature word has to match between the hypervisor and whatever computes a measurement of the VMSA for attestation purposes. Add a field to kvm_sev_info that holds the set of features to be stored in the VMSA; and query it instead of referring to the module parameters. Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this does not yet introduce any functional change, but it paves the way for an API that allows customization of the features per-VM. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-6-pbonzini@redhat.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:23 -04:00
Paolo Bonzini	ac5c48027b	KVM: SEV: publish supported VMSA features Compute the set of features to be stored in the VMSA when KVM is initialized; move it from there into kvm_sev_info when SEV is initialized, and then into the initial VMSA. The new variable can then be used to return the set of supported features to userspace, via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:22 -04:00
Paolo Bonzini	546d714b08	KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <20240404121327.3107131-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:22 -04:00
Paolo Bonzini	8d2aec3b2d	KVM: x86: use u64_to_user_ptr() There is no danger to the kernel if 32-bit userspace provides a 64-bit value that has the high bits set, but for whatever reason happens to resolve to an address that has something mapped there. KVM uses the checked version of get_user() and put_user(), so any faults are caught properly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:22 -04:00
Paolo Bonzini	0d7bf5e5b0	KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers is quite confusing. To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks and the #VMGEXIT handler. Stubs are also restricted to functions that check sev_enabled and to the destruction functions sev_free_cpu() and sev_vm_destroy(), where the style of their callers is to leave checks to the callers. Most call sites instead rely on dead code elimination to take care of functions that are guarded with sev_guest() or sev_es_guest(). Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:21 -04:00
Sean Christopherson	1ff3c89032	KVM: SVM: Invert handling of SEV and SEV_ES feature flags Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled. Aside from the fact that sev_set_cpu_caps() is wildly misleading when it clears capabilities, this will allow compiling out sev.c without falsely advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20240404121327.3107131-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 13:08:21 -04:00
Sandipan Das	49ff3b4aec	KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms On AMD and Hygon platforms, the local APIC does not automatically set the mask bit of the LVTPC register when handling a PMI and there is no need to clear it in the kernel's PMI handler. For guests, the mask bit is currently set by kvm_apic_local_deliver() and unless it is cleared by the guest kernel's PMI handler, PMIs stop arriving and break use-cases like sampling with perf record. This does not affect non-PerfMonV2 guests because PMIs are handled in the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC mask bit irrespective of the vendor. Before: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ] After: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ] Fixes: `a16eb25b09` ("KVM: x86: Mask LVTPC when handling a PMI") Cc: stable@vger.kernel.org Signed-off-by: Sandipan Das <sandipan.das@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> [sean: use is_intel_compatible instead of !is_amd_or_hygon()] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240405235603.1173076-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 12:58:59 -04:00
Sean Christopherson	fd706c9b16	KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel vs. AMD behavior related to masking the LVTPC entry, KVM will need to check for vendor compatibility on every PMI injection, i.e. querying for AMD will soon be a moderately hot path. Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's default behavior, both if userspace omits (or never sets) CPUID 0x0 and if userspace sets a completely unknown vendor. One could argue that KVM should treat such vCPUs as not being compatible with Intel or AMD, but that would add useless complexity to KVM. KVM needs to do something in the face of vendor specific behavior, and so unless KVM conjured up a magic third option, choosing to treat unknown vendors as neither Intel nor AMD means that checks on AMD compatibility would yield Intel behavior, and checks for Intel compatibility would yield AMD behavior. And that's far worse as it would effectively yield random behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs. !Intel. And practically speaking, all x86 CPUs follow either Intel or AMD architecture, i.e. "supporting" an unknown third architecture adds no value. Deliberately don't convert any of the existing guest_cpuid_is_intel() checks, as the Intel side of things is messier due to some flows explicitly checking for exactly vendor==Intel, versus some flows assuming anything that isn't "AMD compatible" gets Intel behavior. The Intel code will be cleaned up in the future. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240405235603.1173076-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2024-04-11 12:58:56 -04:00
Ard Biesheuvel	a0025f587c	x86/boot/64: Clear most of CR4 in startup_64(), except PAE, MCE and LA57 The early 64-bit boot code must be entered with a 1:1 mapping of the bootable image, but it cannot operate without a 1:1 mapping of all the assets in memory that it accesses, and therefore, it creates such mappings for all known assets upfront, and additional ones on demand when a page fault happens on a memory address. These mappings are created with the global bit G set, as the flags used to create page table descriptors are based on __PAGE_KERNEL_LARGE_EXEC defined by the core kernel, even though the context where these mappings are used is very different. This means that the TLB maintenance carried out by the decompressor is not sufficient if it is entered with CR4.PGE enabled, which has been observed to happen with the stage0 bootloader of project Oak. While this is a dubious practice if no global mappings are being used to begin with, the decompressor is clearly at fault here for creating global mappings and not performing the appropriate TLB maintenance. Since commit: `f97b67a773` ("x86/decompressor: Only call the trampoline when changing paging levels") CR4 is no longer modified by the decompressor if no change in the number of paging levels is needed. Before that, CR4 would always be set to a consistent value with PGE cleared. So let's reinstate a simplified version of the original logic to put CR4 into a known state, and preserve the PAE, MCE and LA57 bits, none of which can be modified freely at this point (PAE and LA57 cannot be changed while running in long mode, and MCE cannot be cleared when running under some hypervisors). This effectively clears PGE and works around the project Oak bug. Fixes: `f97b67a773` ("x86/decompressor: Only call the trampoline when ...") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240410151354.506098-2-ardb+git@google.com	2024-04-11 15:37:17 +02:00
Ingo Molnar	5b9b2e6baf	Linux 6.9-rc3 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYTAJYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2bEH/jOBXd0ZCz+s9+F4 TbSvDEE8UjitQdEJ5WSBY9CEvFI8OuVQr23gYPUn+gfgqLX0Vsp8HfxL6bBP5Tj6 DSzAkwF/mvIfa6VLFmO1GmvyhYtmWkmbM825tNqKHSNTBc9cCLH3H+780wNtTMwQ VkB8O3KS/wZBGKSbFfiXW+fT3SkWIMLtdBAaox+vcxHXpiluXxSbxANRD5kTbdG0 UAW9S4+3A0jNk/KeXEvJDqkf7C3ASsjtNPbK+gFDfOXxdNYFTC2IUf93rL61VB4s C2rtUklcLE8gFDtvkQ8JtGWmDj4pWPEDIyhICKlzP/aKCjXcNzLaoM0GJQYJS+PN aNevw24= =318J -----END PGP SIGNATURE----- Merge tag 'v6.9-rc3' into x86/boot, to pick up fixes before queueing up more changes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-11 15:36:23 +02:00
Josh Poimboeuf	5f882f3b0a	x86/bugs: Clarify that syscall hardening isn't a BHI mitigation While syscall hardening helps prevent some BHI attacks, there's still other low-hanging fruit remaining. Don't classify it as a mitigation and make it clear that the system may still be vulnerable if it doesn't have a HW or SW mitigation enabled. Fixes: `ec9404e40e` ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/b5951dae3fdee7f1520d5136a27be3bdfe95f88b.1712813475.git.jpoimboe@kernel.org	2024-04-11 10:30:33 +02:00
Josh Poimboeuf	1cea8a280d	x86/bugs: Fix BHI handling of RRSBA The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been disabled by the Spectre v2 mitigation (or can otherwise be disabled by the BHI mitigation itself if needed). In that case retpolines are fine. Fixes: `ec9404e40e` ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/6f56f13da34a0834b69163467449be7f58f253dc.1712813475.git.jpoimboe@kernel.org	2024-04-11 10:30:33 +02:00
Ingo Molnar	d0485730d2	x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr' So we are using the 'ia32_cap' value in a number of places, which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register. But there's very little 'IA32' about it - this isn't 32-bit only code, nor does it originate from there, it's just a historic quirk that many Intel MSR names are prefixed with IA32_. This is already clear from the helper method around the MSR: x86_read_arch_cap_msr(), which doesn't have the IA32 prefix. So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with its role and with the naming of the helper function. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org	2024-04-11 10:30:33 +02:00
Josh Poimboeuf	cb2db5bb04	x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and over. It's even read in the BHI sysfs function which is a big no-no. Just read it once and cache it. Fixes: `ec9404e40e` ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org	2024-04-11 10:30:33 +02:00
Thomas Gleixner	a9025cd1c6	x86/topology: Don't update cpu_possible_map in topo_set_cpuids() topo_set_cpuids() updates cpu_present_map and cpu_possible map. It is invoked during enumeration and "physical hotplug" operations. In the latter case this results in a kernel crash because cpu_possible_map is marked read only after init completes. There is no reason to update cpu_possible_map in that function. During enumeration cpu_possible_map is not relevant and gets fully initialized after enumeration completed. On "physical hotplug" the bit is already set because the kernel allows only CPUs to be plugged which have been enumerated and associated to a CPU number during early boot. Remove the bogus update of cpu_possible_map. Fixes: `0e53e7b656` ("x86/cpu/topology: Sanitize the APIC admission logic") Reported-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/87ttkc6kwx.ffs@tglx	2024-04-10 15:31:38 +02:00
Uros Bizjak	21689e4bfb	locking/atomic/x86: Define arch_atomic_sub() family using arch_atomic_add() functions There is no need to implement arch_atomic_sub() family of inline functions, corresponding macros can be directly implemented using arch_atomic_add() inlines with negated argument. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240410062957.322614-4-ubizjak@gmail.com	2024-04-10 15:04:55 +02:00
Uros Bizjak	95ece48165	locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions to use arch_atomic64_try_cmpxchg(). This implementation avoids one extra trip through the CMPXCHG loop. The value preload before the cmpxchg loop does not need to be atomic. Use arch_atomic64_read_nonatomic(v) to load the value from atomic_t location in a non-atomic way. The generated code improves from: 1917d5: 31 c9 xor %ecx,%ecx 1917d7: 31 db xor %ebx,%ebx 1917d9: 89 4c 24 3c mov %ecx,0x3c(%esp) 1917dd: 8b 74 24 24 mov 0x24(%esp),%esi 1917e1: 89 c8 mov %ecx,%eax 1917e3: 89 5c 24 34 mov %ebx,0x34(%esp) 1917e7: 8b 7c 24 28 mov 0x28(%esp),%edi 1917eb: 21 ce and %ecx,%esi 1917ed: 89 74 24 4c mov %esi,0x4c(%esp) 1917f1: 21 df and %ebx,%edi 1917f3: 89 de mov %ebx,%esi 1917f5: 89 7c 24 50 mov %edi,0x50(%esp) 1917f9: 8b 54 24 4c mov 0x4c(%esp),%edx 1917fd: 8b 7c 24 2c mov 0x2c(%esp),%edi 191801: 8b 4c 24 50 mov 0x50(%esp),%ecx 191805: 89 d3 mov %edx,%ebx 191807: 89 f2 mov %esi,%edx 191809: f0 0f c7 0f lock cmpxchg8b (%edi) 19180d: 89 c1 mov %eax,%ecx 19180f: 8b 74 24 34 mov 0x34(%esp),%esi 191813: 89 d3 mov %edx,%ebx 191815: 89 44 24 4c mov %eax,0x4c(%esp) 191819: 8b 44 24 3c mov 0x3c(%esp),%eax 19181d: 89 df mov %ebx,%edi 19181f: 89 54 24 44 mov %edx,0x44(%esp) 191823: 89 ca mov %ecx,%edx 191825: 31 de xor %ebx,%esi 191827: 31 c8 xor %ecx,%eax 191829: 09 f0 or %esi,%eax 19182b: 75 ac jne 1917d9 <...> to: 1912ba: 8b 06 mov (%esi),%eax 1912bc: 8b 56 04 mov 0x4(%esi),%edx 1912bf: 89 44 24 3c mov %eax,0x3c(%esp) 1912c3: 89 c1 mov %eax,%ecx 1912c5: 23 4c 24 34 and 0x34(%esp),%ecx 1912c9: 89 d3 mov %edx,%ebx 1912cb: 23 5c 24 38 and 0x38(%esp),%ebx 1912cf: 89 54 24 40 mov %edx,0x40(%esp) 1912d3: 89 4c 24 2c mov %ecx,0x2c(%esp) 1912d7: 89 5c 24 30 mov %ebx,0x30(%esp) 1912db: 8b 5c 24 2c mov 0x2c(%esp),%ebx 1912df: 8b 4c 24 30 mov 0x30(%esp),%ecx 1912e3: f0 0f c7 0e lock cmpxchg8b (%esi) 1912e7: 0f 85 f3 02 00 00 jne 1915e0 <...> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240410062957.322614-3-ubizjak@gmail.com	2024-04-10 15:04:55 +02:00
Uros Bizjak	e73c4e34a0	locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32 Introduce arch_atomic64_read_nonatomic() for 32-bit targets to load the value from atomic64_t location in a non-atomic way. This function is intended to be used in cases where a subsequent atomic operation will handle the torn value, and can be used to prime the first iteration of unconditional try_cmpxchg() loops. Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240410062957.322614-2-ubizjak@gmail.com	2024-04-10 15:04:54 +02:00
Uros Bizjak	276b893049	locking/atomic/x86: Introduce arch_atomic64_try_cmpxchg() to x86_32 Introduce arch_atomic64_try_cmpxchg() for 32-bit targets to use optimized target specific implementation instead of a generic one. This implementation eliminates dual-word compare after cmpxchg8b instruction and improves generated asm code from: 2273: f0 0f c7 0f lock cmpxchg8b (%edi) 2277: 8b 74 24 2c mov 0x2c(%esp),%esi 227b: 89 d3 mov %edx,%ebx 227d: 89 c2 mov %eax,%edx 227f: 89 5c 24 10 mov %ebx,0x10(%esp) 2283: 8b 7c 24 30 mov 0x30(%esp),%edi 2287: 89 44 24 1c mov %eax,0x1c(%esp) 228b: 31 f2 xor %esi,%edx 228d: 89 d0 mov %edx,%eax 228f: 89 da mov %ebx,%edx 2291: 31 fa xor %edi,%edx 2293: 09 d0 or %edx,%eax 2295: 0f 85 a5 00 00 00 jne 2340 <...> to: 2270: f0 0f c7 0f lock cmpxchg8b (%edi) 2274: 0f 85 a6 00 00 00 jne 2320 <...> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240410062957.322614-1-ubizjak@gmail.com	2024-04-10 15:04:54 +02:00
Uwe Kleine-König	801549ed6a	x86/platform/olpc-xo1-sci: Convert to platform remove callback returning void The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/63c9d1e6b07296916e4218c63f59a2dd6c6b6b16.1712732665.git.u.kleine-koenig@pengutronix.de	2024-04-10 14:59:31 +02:00
Uwe Kleine-König	c741ccfdc0	x86/platform/olpc-x01-pm: Convert to platform remove callback returning void The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/c7d669d8b0c994c77fb4c3bc7bec78aeb8659c74.1712732665.git.u.kleine-koenig@pengutronix.de	2024-04-10 14:59:31 +02:00
Uwe Kleine-König	bcc0403eb4	x86/platform/iris: Convert to platform remove callback returning void The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/adb9b0aca77d7aa8e54d0a664a4ee290834a60a1.1712732665.git.u.kleine-koenig@pengutronix.de	2024-04-10 14:59:30 +02:00
Zhang Rui	acf68d98ca	perf/x86/rapl: Add support for Intel Lunar Lake Lunar Lake RAPL support is the same as previous Sky Lake. Add Lunar Lake model for RAPL. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240410124554.448987-2-rui.zhang@intel.com	2024-04-10 14:48:18 +02:00
Zhang Rui	fb70fe74be	perf/x86/rapl: Add support for Intel Arrow Lake Arrow Lake RAPL support is the same as previous Sky Lake. Add Arrow Lake model for RAPL. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240410124554.448987-1-rui.zhang@intel.com	2024-04-10 14:48:18 +02:00
Daniel Sneddon	04f4230e2f	x86/bugs: Fix return type of spectre_bhi_state() The definition of spectre_bhi_state() incorrectly returns a const char * const. This causes the a compiler warning when building with W=1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2812 \| static const char * const spectre_bhi_state(void) Remove the const qualifier from the pointer. Fixes: `ec9404e40e` ("x86/bhi: Add BHI mitigation knob") Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240409230806.1545822-1-daniel.sneddon@linux.intel.com	2024-04-10 07:05:04 +02:00
Ingo Molnar	a40d2525ea	Merge branch 'linus' into x86/urgent, to pick up dependent commits Prepare to fix aspects of the new BHI code. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-10 07:04:04 +02:00
Ingo Molnar	dbbe13a6f6	x86/cpu: Improve readability of per-CPU cpumask initialization code In smp_prepare_cpus_common() and x2apic_prepare_cpu(): - use 'cpu' instead of 'i' - use 'node' instead of 'n' - use vertical alignment to improve readability - better structure basic blocks - reduce col80 checkpatch damage Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org	2024-04-10 07:02:33 +02:00
Li RongQing	e0a9ac192f	x86/cpu: Take NUMA node into account when allocating per-CPU cpumasks per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. [ mingo: Rewrote the changelog. ] Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240410030114.6201-1-lirongqing@baidu.com	2024-04-10 06:55:31 +02:00
Namhyung Kim	dec8ced871	perf/x86: Fix out of range data On x86 each struct cpu_hw_events maintains a table for counter assignment but it missed to update one for the deleted event in x86_pmu_del(). This can make perf_clear_dirty_counters() reset used counter if it's called before event scheduling or enabling. Then it would return out of range data which doesn't make sense. The following code can reproduce the problem. $ cat repro.c #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/perf_event.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <sys/syscall.h> struct perf_event_attr attr = { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .disabled = 1, }; void worker(void arg) { int cpu = (long)arg; int fd1 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0); int fd2 = syscall(SYS_perf_event_open, &attr, -1, cpu, -1, 0); void p; do { ioctl(fd1, PERF_EVENT_IOC_ENABLE, 0); p = mmap(NULL, 4096, PROT_READ, MAP_SHARED, fd1, 0); ioctl(fd2, PERF_EVENT_IOC_ENABLE, 0); ioctl(fd2, PERF_EVENT_IOC_DISABLE, 0); munmap(p, 4096); ioctl(fd1, PERF_EVENT_IOC_DISABLE, 0); } while (1); return NULL; } int main(void) { int i; int n = sysconf(_SC_NPROCESSORS_ONLN); pthread_t th = calloc(n, sizeof(th)); for (i = 0; i < n; i++) pthread_create(&th[i], NULL, worker, (void )(long)i); for (i = 0; i < n; i++) pthread_join(th[i], NULL); free(th); return 0; } And you can see the out of range data using perf stat like this. Probably it'd be easier to see on a large machine. $ gcc -o repro repro.c -pthread $ ./repro & $ sudo perf stat -A -I 1000 2>&1 \| awk '{ if (length($3) > 15) print }' 1.001028462 CPU6 196,719,295,683,763 cycles # 194290.996 GHz (71.54%) 1.001028462 CPU3 396,077,485,787,730 branch-misses # 15804359784.80% of all branches (71.07%) 1.001028462 CPU17 197,608,350,727,877 branch-misses # 14594186554.56% of all branches (71.22%) 2.020064073 CPU4 198,372,472,612,140 cycles # 194681.113 GHz (70.95%) 2.020064073 CPU6 199,419,277,896,696 cycles # 195720.007 GHz (70.57%) 2.020064073 CPU20 198,147,174,025,639 cycles # 194474.654 GHz (71.03%) 2.020064073 CPU20 198,421,240,580,145 stalled-cycles-frontend # 100.14% frontend cycles idle (70.93%) 3.037443155 CPU4 197,382,689,923,416 cycles # 194043.065 GHz (71.30%) 3.037443155 CPU20 196,324,797,879,414 cycles # 193003.773 GHz (71.69%) 3.037443155 CPU5 197,679,956,608,205 stalled-cycles-backend # 1315606428.66% backend cycles idle (71.19%) 3.037443155 CPU5 198,571,860,474,851 instructions # 13215422.58 insn per cycle It should move the contents in the cpuc->assign as well. Fixes: `5471eea5d3` ("perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task") Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240306061003.1894224-1-namhyung@kernel.org	2024-04-10 06:12:01 +02:00
Li RongQing	a952d608f0	KVM: Use vfree for memory allocated by vcalloc()/__vcalloc() commit 37b2a6510a48("KVM: use __vcalloc for very large allocations") replaced kvzalloc()/kvcalloc() with vcalloc(), but didn't replace kvfree() with vfree(). Signed-off-by: Li RongQing <lirongqing@baidu.com> Link: https://lore.kernel.org/r/20240131012357.53563-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 12:18:38 -07:00
Gerd Hoffmann	b628cb523c	KVM: x86: Advertise max mappable GPA in CPUID.0x80000008.GuestPhysBits Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max mappable GPA to userspace, i.e. the max GPA that is addressable by the CPU itself. Typically this is identical to the max effective GPA, except in the case where the CPU supports MAXPHYADDR > 48 but does not support 5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the fifth level TDP page table entry). Enumerating the max mappable GPA via CPUID will allow guest firmware to map resources like PCI bars in the highest possible address space, while ensuring that the GPA is addressable by the CPU. Without precise knowledge about the max mappable GPA, the guest must assume that 5-level paging is unsupported and thus restrict its mappings to the lower 48 bits. Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace doesn't have easy access to whether or not 5-level paging is supported, and to play nice with userspace VMMs that reflect the supported CPUID directly into the guest. AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Tom Lendacky confirmed that the purpose of GuestPhysBits is software use and KVM can use it as described above. Real hardware always returns zero. Leave GuestPhysBits as '0' when TDP is disabled in order to comply with the APM's statement that GuestPhysBits "applies only to guest using nested paging". As above, guest firmware will likely create suboptimal mappings, but that is a very minor issue and not a functional concern. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240313125844.912415-3-kraxel@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 12:18:37 -07:00
Gerd Hoffmann	6f5c960062	KVM: x86: Don't advertise guest.MAXPHYADDR as host.MAXPHYADDR in CPUID Drop KVM's propagation of GuestPhysBits (CPUID leaf 80000008, EAX[23:16]) to HostPhysBits (same leaf, EAX[7:0]) when advertising the address widths to userspace via KVM_GET_SUPPORTED_CPUID. Per AMD, GuestPhysBits is intended for software use, and physical CPUs do not set that field. I.e. GuestPhysBits will be non-zero if and only if KVM is running as a nested hypervisor, and in that case, GuestPhysBits is NOT guaranteed to capture the CPU's effective MAXPHYADDR when running with TDP enabled. E.g. KVM will soon use GuestPhysBits to communicate the CPU's maximum addressable guest physical address, which would result in KVM under- reporting PhysBits when running as an L1 on a CPU with MAXPHYADDR=52, but without 5-level paging. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240313125844.912415-2-kraxel@redhat.com [sean: rewrite changelog with --verbose, Cc stable@] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 12:18:22 -07:00
Sean Christopherson	23ffe4bbf8	KVM: nVMX: Add a sanity check that nested PML Full stems from EPT Violations Add a WARN_ON_ONCE() sanity check to verify that a nested PML Full VM-Exit is only synthesized when the original VM-Exit from L2 was an EPT Violation. While KVM can fallthrough to kvm_mmu_do_page_fault() if an EPT Misconfig occurs on a stale MMIO SPTE, KVM should not treat the access as a write (there isn't enough information to know what the access was), i.e. KVM should never try to insert a PML entry in that case. Link: https://lore.kernel.org/r/20240209221700.393189-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:24:36 -07:00
Sean Christopherson	a946607868	KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exception Move the exit_qualification field that is used to track information about in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception", i.e. associate the information with the actual nEPT violation instead of the vCPU. To handle bits that are pulled from vmcs.EXIT_QUALIFICATION, i.e. that are propagated from the "original" EPT violation VM-Exit, simply grab them from the VMCS on-demand when injecting a nEPT Violation or a PML Full VM-exit. Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch is outright dangerous, e.g. see commit `d7f0a00e43` ("KVM: VMX: Report up-to-date exit qualification to userspace"). Opportunstically add a comment to call out that PML Full and EPT Violation VM-Exits use the same bit to report NMI blocking information. Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:24:36 -07:00
Sean Christopherson	0c47651403	KVM: nVMX: Clear EXIT_QUALIFICATION when injecting an EPT Misconfig Explicitly clear the EXIT_QUALIFCATION field when injecting an EPT misconfig into L1, as required by the VMX architecture. Per the SDM: This field is saved for VM exits due to the following causes: debug exceptions; page-fault exceptions; start-up IPIs (SIPIs); system-management interrupts (SMIs) that arrive immediately after the execution of I/O instructions; task switches; INVEPT; INVLPG; INVPCID; INVVPID; LGDT; LIDT; LLDT; LTR; SGDT; SIDT; SLDT; STR; VMCLEAR; VMPTRLD; VMPTRST; VMREAD; VMWRITE; VMXON; WBINVD; WBNOINVD; XRSTORS; XSAVES; control-register accesses; MOV DR; I/O instructions; MWAIT; accesses to the APIC-access page; EPT violations; EOI virtualization; APIC-write emulation; page-modification log full; SPP-related events; and instruction timeout. For all other VM exits, this field is cleared. Generating EXIT_QUALIFICATION from vcpu->arch.exit_qualification is wrong for all (two) paths that lead to nested_ept_inject_page_fault(). For EPT violations (the common case), vcpu->arch.exit_qualification will have been set by handle_ept_violation() to vmcs02.EXIT_QUALIFICATION, i.e. contains the information of a EPT violation and thus is likely non-zero. For an EPT misconfig, which can reach FNAME(walk_addr_generic) and thus inject a nEPT misconfig if KVM created an MMIO SPTE that became stale, vcpu->arch.exit_qualification will hold the information from the last EPT violation VM-Exit, as vcpu->arch.exit_qualification is _only_ written by handle_ept_violation(). Fixes: `4704d0befb` ("KVM: nVMX: Exiting from L2 to L1") Link: https://lore.kernel.org/r/20240209221700.393189-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:24:36 -07:00
Alexei Starovoitov	d503a04f8b	bpf: Add support for certain atomics in bpf_arena to x86 JIT Support atomics in bpf_arena that can be JITed as a single x86 instruction. Instructions that are JITed as loops are not supported at the moment, since they require more complex extable and loop logic. JITs can choose to do smarter things with bpf_jit_supports_insn(). Like arm64 may decide to support all bpf atomics instructions when emit_lse_atomic is available and none in ll_sc mode. bpf_jit_supports_percpu_insn(), bpf_jit_supports_ptr_xchg() and other such callbacks can be replaced with bpf_jit_supports_insn() in the future. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20240405231134.17274-1-alexei.starovoitov@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-04-09 10:24:26 -07:00
Sean Christopherson	27ca867042	KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD to skip objtool's stack validation now that __svm_vcpu_run() and __svm_sev_es_vcpu_run() create stack frames (though the former's effectiveness is dubious). Note, due to a quirk in how OBJECT_FILES_NON_STANDARD was handled by the build system prior to commit `bf48d9b756` ("kbuild: change tool coverage variables to take the path relative to $(obj)"), vmx/vmenter.S got lumped in with svm/vmenter.S. __vmx_vcpu_run() already plays nice with frame pointers, i.e. it was collateral damage when commit `7f4b5cde24` ("kvm: Disable objtool frame pointer checking for vmenter.S") added the OBJECT_FILES_NON_STANDARD hack-a-fix. Link: https://lore.kernel.org/all/20240217055504.2059803-1-masahiroy@kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:21:44 -07:00
Sean Christopherson	4367a75887	KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run() Now that KVM uses the host save area to context switch RBP, i.e. preserves RBP for the entirety of __svm_sev_es_vcpu_run(), create a stack frame using the standared FRAME_{BEGIN,END} macros. Note, __svm_sev_es_vcpu_run() is subtly not a leaf function as it can call into ibpb_feature() via UNTRAIN_RET_VM. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:21:10 -07:00
Sean Christopherson	adac42bf42	KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area Use the host save area to preserve volatile registers that are used in __svm_sev_es_vcpu_run() to access function parameters after #VMEXIT. Like saving/restoring non-volatile registers, there's no reason not to take advantage of hardware restoring registers on #VMEXIT, as doing so shaves a few instructions and the save area is going to be accessed no matter what. Converting all register save/restore code to use the host save area also make it easier to follow the SEV-ES VMRUN flow in its entirety, as opposed to having a mix of stack-based versus host save area save/restore. Add a parameter to RESTORE_HOST_SPEC_CTRL_BODY so that the SEV-ES path doesn't need to write @spec_ctrl_intercepted to memory just to play nice with the common macro. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:21:10 -07:00
Sean Christopherson	c92be2fd8e	KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area Use the host save area to save/restore non-volatile (callee-saved) registers in __svm_sev_es_vcpu_run() to take advantage of hardware loading all registers from the save area on #VMEXIT. KVM still needs to save the registers it wants restored, but the loads are handled automatically by hardware. Aside from less assembly code, letting hardware do the restoration means stack frames are preserved for the entirety of __svm_sev_es_vcpu_run(). Opportunistically add a comment to call out why @svm needs to be saved across VMRUN->#VMEXIT, as it's not easy to decipher that from the macro hell. Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:20:29 -07:00
Sean Christopherson	87e8e360a0	KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted POP @spec_ctrl_intercepted into RAX instead of RBX when discarding it from the stack so that __svm_sev_es_vcpu_run() doesn't modify any non-volatile registers. __svm_sev_es_vcpu_run() doesn't return a value, and RAX is already are clobbered multiple times in the #VMEXIT path. This will allowing using the host save area to save/restore non-volatile registers in __svm_sev_es_vcpu_run(). Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:20:29 -07:00
Sean Christopherson	331282fdb1	KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run() Drop 32-bit "support" from __svm_sev_es_vcpu_run(), as SEV/SEV-ES firmly 64-bit only. The "support" was purely the result of bad copy+paste from __svm_vcpu_run(), which in turn was slightly less bad copy+paste from __vmx_vcpu_run(). Opportunistically convert to unadulterated register accesses so that it's easier (but still not easy) to follow which registers hold what arguments, and when. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:20:29 -07:00
Sean Christopherson	7774c8f32e	KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV Compile (and link) __svm_sev_es_vcpu_run() if and only if SEV support is actually enabled. This will allow dropping non-existent 32-bit "support" from __svm_sev_es_vcpu_run() without causing undue confusion. Intentionally don't provide a stub (but keep the declaration), as any sane compiler, even with things like KASAN enabled, should eliminate the call to __svm_sev_es_vcpu_run() since sev_es_guest() unconditionally returns "false" if CONFIG_KVM_AMD_SEV=n. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:20:28 -07:00
Sean Christopherson	19597a71a0	KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding Unconditionally create a stack frame in __svm_vcpu_run() to play nice with unwinding via frame pointers, at least until the point where RBP is loaded with the guest's value. Don't bother conditioning the code on CONFIG_FRAME_POINTER=y, as RBP needs to be saved and restored anyways (due to it being clobbered with the guest's value); omitting the "MOV RSP, RBP" is not worth the extra #ifdef. Creating a stack frame will allow removing the OBJECT_FILES_NON_STANDARD tag from vmenter.S once __svm_sev_es_vcpu_run() is fixed to not stomp all over RBP for no reason. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20240223204233.3337324-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:20:28 -07:00
Christophe JAILLET	4710e4fc3e	KVM: SVM: Remove a useless zeroing of allocated memory Remove KVM's unnecessary zeroing of memory when allocating the pages array in sev_pin_memory() via __vmalloc(), as the array is only used to hold kernel pointers. The kmalloc() path for "small" regions doesn't zero the array, and if KVM leaks state and/or accesses uninitialized data, then the kernel has bigger problems. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/c7619a3d3cbb36463531a7c73ccbde9db587986c.1710004509.git.christophe.jaillet@wanadoo.fr [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 10:15:30 -07:00
David Matlack	aca48556c5	KVM: x86/mmu: Process atomically-zapped SPTEs after TLB flush When zapping TDP MMU SPTEs under read-lock, processes zapped SPTEs after flushing TLBs and after replacing the special REMOVED_SPTE with '0'. When zapping an SPTE that points to a page table, processing SPTEs after flushing TLBs minimizes contention on the child SPTEs (e.g. vCPUs won't hit write-protection faults via stale, read-only child SPTEs), and processing after replacing REMOVED_SPTE with '0' minimizes the amount of time vCPUs will be blocked by the REMOVED_SPTE. Processing SPTEs after setting the SPTE to '0', i.e. in parallel with the SPTE potentially being replacing with a new SPTE, is safe because KVM does not depend on completing the processing before a new SPTE is installed, and the processing is done on a subset of the page tables that is disconnected from the root, and thus unreachable by other tasks (after the TLB flush). KVM already relies on similar logic, as kvm_mmu_zap_all_fast() can result in KVM processing all SPTEs in a given root after vCPUs create mappings in a new root. In VMs with a large (400+) number of vCPUs, it can take KVM multiple seconds to process a 1GiB region mapped with 4KiB entries, e.g. when disabling dirty logging in a VM backed by 1GiB HugeTLB. During those seconds, if a vCPU accesses the 1GiB region being zapped it will be stalled until KVM finishes processing the SPTE and replaces the REMOVED_SPTE with 0. Re-ordering the processing does speed up the atomic-zaps somewhat, but the main benefit is avoiding blocking vCPU threads. Before: $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e ... Disabling dirty logging time: 509.765146313s $ ./funclatency -m tdp_mmu_zap_spte_atomic msec : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 0 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 0 \| \| 128 -> 255 : 8 \| \| 256 -> 511 : 68 \|************** \| 512 -> 1023 : 129 \|****************************** \| 1024 -> 2047 : 151 \|************************************\| 2048 -> 4095 : 60 \|*********** \| After: $ ./dirty_log_perf_test -s anonymous_hugetlb_1gb -v 416 -b 1G -e ... Disabling dirty logging time: 336.516838548s $ ./funclatency -m tdp_mmu_zap_spte_atomic msec : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 0 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 0 \| \| 128 -> 255 : 12 \| \| 256 -> 511 : 166 \|**************************************\| 512 -> 1023 : 101 \|******************** \| 1024 -> 2047 : 137 \|******************************* \| Note, KVM's processing of collapsible SPTEs is still extremely slow and can be improved. For example, a significant amount of time is spent calling kvm_set_pfn_{accessed,dirty}() for every last-level SPTE, even when processing SPTEs that all map the same folio. But avoiding blocking vCPUs and contending SPTEs is valuable regardless of how fast KVM can process collapsible SPTEs. Link: https://lore.kernel.org/all/20240320005024.3216282-1-seanjc@google.com Cc: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20240307194059.1357377-1-dmatlack@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-09 09:37:47 -07:00
Borislav Petkov (AMD)	05d277c9a9	x86/alternatives: Sort local vars in apply_alternatives() In a reverse x-mas tree. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-5-bp@alien8.de	2024-04-09 18:16:57 +02:00
Borislav Petkov (AMD)	c3a3cb5c3d	x86/alternatives: Optimize optimize_nops() Return early if NOPs have already been optimized. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-4-bp@alien8.de	2024-04-09 18:15:03 +02:00
Borislav Petkov (AMD)	da8f9cf7e7	x86/alternatives: Get rid of __optimize_nops() There's no need to carve out bits of the NOP optimization functionality and look at JMP opcodes - simply do one more NOPs optimization pass at the end of patching. A lot simpler code. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-3-bp@alien8.de	2024-04-09 18:12:53 +02:00
Borislav Petkov (AMD)	f796c75837	x86/alternatives: Use a temporary buffer when optimizing NOPs Instead of optimizing NOPs in-place, use a temporary buffer like the usual alternatives patching flow does. This obviates the need to grab locks when patching, see `6778977590` ("x86/alternatives: Disable interrupts and sync when optimizing NOPs in place") While at it, add nomenclature definitions clarifying and simplifying the naming of function-local variables in the alternatives code. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240130105941.19707-2-bp@alien8.de	2024-04-09 18:08:11 +02:00
Borislav Petkov (AMD)	ee8962082a	x86/alternatives: Catch late X86_FEATURE modifiers After alternatives have been patched, changes to the X86_FEATURE flags won't take effect and could potentially even be wrong. Warn about it. This is something which has been long overdue. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Srikanth Aithal <sraithal@amd.com> Link: https://lore.kernel.org/r/20240327154317.29909-3-bp@alien8.de	2024-04-09 18:03:53 +02:00
Borislav Petkov (AMD)	a0c8cf9780	x86/alternatives: Remove a superfluous newline in _static_cpu_has() No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240327154317.29909-2-bp@alien8.de	2024-04-09 17:59:10 +02:00
Pawan Gupta	53bc516ade	x86/msr: Move ARCH_CAP_XAPIC_DISABLE bit definition to its rightful place The ARCH_CAP_XAPIC_DISABLE bit of MSR_IA32_ARCH_CAP is not in the correct sorted order. Move it where it belongs. No functional change. [ bp: Massage commit message. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/243317ff6c8db307b7701a45f71e5c21da80194b.1705632532.git.pawan.kumar.gupta@linux.intel.com	2024-04-09 17:39:54 +02:00
Lai Jiangshan	b767fe5de0	x86/entry: Merge thunk_64.S and thunk_32.S into thunk.S The code in thunk_64.S and thunk_32.S are exactly the same except for the comments. Merge them in to thunk.S. And since thunk_32.S was originated from thunk_64.S, the new merged thunk.S is actually renamed from thunk_64.S. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240407090558.3395-9-jiangshanlai@gmail.com	2024-04-09 09:57:56 +02:00
Uros Bizjak	aef95dac9c	locking/atomic/x86: Introduce arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64 Commit: `6d12c8d308` ("percpu: Wire up cmpxchg128") improved emulated cmpxchg8b_emu() library function to return success/failure in a ZF flag. Define arch_try_cmpxchg64() for !CONFIG_X86_CMPXCHG64 targets to override the generic archy_try_cmpxchg() with an optimized target specific implementation that handles ZF flag. The assembly code at the call sites improves from: bf56d: e8 fc ff ff ff call cmpxchg8b_emu bf572: 8b 74 24 28 mov 0x28(%esp),%esi bf576: 89 c3 mov %eax,%ebx bf578: 89 d1 mov %edx,%ecx bf57a: 8b 7c 24 2c mov 0x2c(%esp),%edi bf57e: 89 f0 mov %esi,%eax bf580: 89 fa mov %edi,%edx bf582: 31 d8 xor %ebx,%eax bf584: 31 ca xor %ecx,%edx bf586: 09 d0 or %edx,%eax bf588: 0f 84 e3 01 00 00 je bf771 <...> to: bf572: e8 fc ff ff ff call cmpxchg8b_emu bf577: 0f 84 b6 01 00 00 je bf733 <...> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240408091547.90111-4-ubizjak@gmail.com	2024-04-09 09:51:03 +02:00
Uros Bizjak	7016cc5def	locking/atomic/x86: Modernize x86_32 arch_{,try_}_cmpxchg64{,_local}() Commit: `b23e139d0b` ("arch: Introduce arch_{,try_}_cmpxchg128{,_local}()") introduced arch_{,try_}_cmpxchg128{,_local}() for x86_64 targets. Modernize existing x86_32 arch_{,try_}_cmpxchg64{,_local}() definitions to follow the same structure as the definitions introduced by the above commit. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240408091547.90111-3-ubizjak@gmail.com	2024-04-09 09:51:03 +02:00
Uros Bizjak	929ad065ba	locking/atomic/x86: Correct the definition of __arch_try_cmpxchg128() Correct the definition of __arch_try_cmpxchg128(), introduced by: `b23e139d0b` ("arch: Introduce arch_{,try_}_cmpxchg128{,_local}()") Fixes: `b23e139d0b` ("arch: Introduce arch_{,try_}_cmpxchg128{,_local}()") Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240408091547.90111-2-ubizjak@gmail.com	2024-04-09 09:51:02 +02:00
Ingo Molnar	d1eec383a8	Linux 6.9-rc3 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYTAJYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2bEH/jOBXd0ZCz+s9+F4 TbSvDEE8UjitQdEJ5WSBY9CEvFI8OuVQr23gYPUn+gfgqLX0Vsp8HfxL6bBP5Tj6 DSzAkwF/mvIfa6VLFmO1GmvyhYtmWkmbM825tNqKHSNTBc9cCLH3H+780wNtTMwQ VkB8O3KS/wZBGKSbFfiXW+fT3SkWIMLtdBAaox+vcxHXpiluXxSbxANRD5kTbdG0 UAW9S4+3A0jNk/KeXEvJDqkf7C3ASsjtNPbK+gFDfOXxdNYFTC2IUf93rL61VB4s C2rtUklcLE8gFDtvkQ8JtGWmDj4pWPEDIyhICKlzP/aKCjXcNzLaoM0GJQYJS+PN aNevw24= =318J -----END PGP SIGNATURE----- Merge tag 'v6.9-rc3' into locking/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-09 09:48:09 +02:00
Tony Luck	7911f145de	x86/mce: Implement recovery for errors in TDX/SEAM non-root mode Machine check SMIs (MSMI) signaled during SEAM operation (typically inside TDX guests), on a system with Intel eMCA enabled, might eventually be reported to the kernel #MC handler with the saved RIP on the stack pointing to the instruction in kernel code after the SEAMCALL instruction that entered the SEAM operation. Linux currently says that is a fatal error and shuts down. There is a new bit in IA32_MCG_STATUS that, when set to 1, indicates that the machine check didn't originally occur at that saved RIP, but during SEAM non-root operation. Add new entries to the severity table to detect this for both data load and instruction fetch that set the severity to "AR" (action required). Increase the width of the mcgmask/mcgres fields in "struct severity" from unsigned char to unsigned short since the new bit is in position 12. Action required for these errors is just mark the page as poisoned and return from the machine check handler. HW ABI notes: ============= The SEAM_NR bit in IA32_MCG_STATUS hasn't yet made it into the Intel Software Developers' Manual. But it is described in section 16.5.2 of "Intel(R) Trust Domain Extensions (Intel(R) TDX) Module Base Architecture Specification" downloadable from: https://cdrdv2.intel.com/v1/dl/getContent/733575 Backport notes: =============== Little value in backporting this patch to stable or LTS kernels as this is only relevant with support for TDX, which I assume won't be backported. But for anyone taking this to v6.1 or older, you also need commit: `a51cbd0d86` ("x86/mce: Use severity table to handle uncorrected errors in kernel") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240408180944.44638-1-tony.luck@intel.com	2024-04-09 09:30:36 +02:00
Ingo Molnar	0e6ebfd163	Linux 6.9-rc3 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmYTAJYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG2bEH/jOBXd0ZCz+s9+F4 TbSvDEE8UjitQdEJ5WSBY9CEvFI8OuVQr23gYPUn+gfgqLX0Vsp8HfxL6bBP5Tj6 DSzAkwF/mvIfa6VLFmO1GmvyhYtmWkmbM825tNqKHSNTBc9cCLH3H+780wNtTMwQ VkB8O3KS/wZBGKSbFfiXW+fT3SkWIMLtdBAaox+vcxHXpiluXxSbxANRD5kTbdG0 UAW9S4+3A0jNk/KeXEvJDqkf7C3ASsjtNPbK+gFDfOXxdNYFTC2IUf93rL61VB4s C2rtUklcLE8gFDtvkQ8JtGWmDj4pWPEDIyhICKlzP/aKCjXcNzLaoM0GJQYJS+PN aNevw24= =318J -----END PGP SIGNATURE----- Merge tag 'v6.9-rc3' into x86/cpu, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-09 09:28:41 +02:00
Linus Torvalds	2bb69f5fc7	x86 mitigations for the native BHI hardware vulnerabilty: Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Add mitigations against it either with the help of microcode or with software sequences for the affected CPUs. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmYUKPMTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYofT8EACJJix+GzGUcJjOvfWFZcxwziY152hO 5XSzHOZZL6oz5Yk/Rye/S9RVTN7aDjn1CEvI0cD/ULxaTP869sS9dDdUcHhEJ//5 6hjqWsWiKc1QmLjBy3Pcb97GZHQXM5a9D1f6jXnJD+0FMLbQHpzSEBit0H4tv/TC 75myGgYihvUbhN9/bL10M5fz+UADU42nChvPWDMr9ukljjCqa46tPTmKUIAW5TWj /xsyf+Nk+4kZpdaidKGhpof6KCV2rNeevvzUGN8Pv5y13iAmvlyplqTcQ6dlubnZ CuDX5Ji9spNF9WmhKpLgy5N+Ocb64oVHov98N2zw1sT1N8XOYcSM0fBj7SQIFURs L7T4jBZS+1c3ZGJPPFWIaGjV8w1ZMhelglwJxjY7ZgRD6fK3mwRx/ks54J8H4HjE FbirXaZLeKlscDIOKtnxxKoIGwpdGwLKQYi/wEw7F9NhCLSj9wMia+j3uYIUEEHr 6xEiYEtyjcV3ocxagH7eiHyrasOKG64vjx2h1XodusBA2Wrvgm/jXlchUu+wb6B4 LiiZJt+DmOdQ1h5j3r2rt3hw7+nWa7kyq34qfN6NSUCHiedp6q7BClueSaKiOCGk RoNibNiS+CqaxwGxj/RGuvajEJeEMCsLuCxzT3aeaDBsqscW6Ka/HkGA76Tpb5nJ E3JyjYE7AlG4rw== =W0W3 -----END PGP SIGNATURE----- Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mitigations from Thomas Gleixner: "Mitigations for the native BHI hardware vulnerabilty: Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Add mitigations against it either with the help of microcode or with software sequences for the affected CPUs" [ This also ends up enabling the full mitigation by default despite the system call hardening, because apparently there are other indirect calls that are still sufficiently reachable, and the 'auto' case just isn't hardened enough. We'll have some more inevitable tweaking in the future - Linus ] * tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: x86: Add BHI_NO x86/bhi: Mitigate KVM by default x86/bhi: Add BHI mitigation knob x86/bhi: Enumerate Branch History Injection (BHI) bug x86/bhi: Define SPEC_CTRL_BHI_DIS_S x86/bhi: Add support for clearing branch history at syscall entry x86/syscall: Don't force use of indirect calls for system calls x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file	2024-04-08 20:07:51 -07:00
Tao Su	7f2817ef52	KVM: VMX: Ignore MKTME KeyID bits when intercepting #PF for allow_smaller_maxphyaddr Use the raw/true host.MAXPHYADDR when deciding whether or not KVM must intercept #PFs when allow_smaller_maxphyaddr is enabled, as any adjustments the kernel makes to boot_cpu_data.x86_phys_bits to account for MKTME KeyID bits do not apply to the guest physical address space. I.e. the KeyID are off-limits for host physical addresses, but are not reserved for GPAs as far as hardware is concerned. Signed-off-by: Tao Su <tao1.su@linux.intel.com> Link: https://lore.kernel.org/r/20240319031111.495006-1-tao1.su@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 14:22:10 -07:00
Sean Christopherson	de120e1d69	KVM: x86/pmu: Set enable bits for GP counters in PERF_GLOBAL_CTRL at "RESET" Set the enable bits for general purpose counters in IA32_PERF_GLOBAL_CTRL when refreshing the PMU to emulate the MSR's architecturally defined post-RESET behavior. Per Intel's SDM: IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits. and Where "n" is the number of general-purpose counters available in the processor. AMD also documents this behavior for PerfMonV2 CPUs in one of AMD's many PPRs. Do not set any PERF_GLOBAL_CTRL bits if there are no general purpose counters, although a literal reading of the SDM would require the CPU to set either bits 63:0 or 31:0. The intent of the behavior is to globally enable all GP counters; honor the intent, if not the letter of the law. Leaving PERF_GLOBAL_CTRL '0' effectively breaks PMU usage in guests that haven't been updated to work with PMUs that support PERF_GLOBAL_CTRL. This bug was recently exposed when KVM added supported for AMD's PerfMonV2, i.e. when KVM started exposing a vPMU with PERF_GLOBAL_CTRL to guest software that only knew how to program v1 PMUs (that don't support PERF_GLOBAL_CTRL). Failure to emulate the post-RESET behavior results in such guests unknowingly leaving all general purpose counters globally disabled (the entire reason the post-RESET value sets the GP counter enable bits is to maintain backwards compatibility). The bug has likely gone unnoticed because PERF_GLOBAL_CTRL has been supported on Intel CPUs for as long as KVM has existed, i.e. hardly anyone is running guest software that isn't aware of PERF_GLOBAL_CTRL on Intel PMUs. And because up until v6.0, KVM _did_ emulate the behavior for Intel CPUs, although the old behavior was likely dumb luck. Because (a) that old code was also broken in its own way (the history of this code is a comedy of errors), and (b) PERF_GLOBAL_CTRL was documented as having a value of '0' post-RESET in all SDMs before March 2023. Initial vPMU support in commit `f5132b0138` ("KVM: Expose a version 2 architectural PMU to a guests") almost got it right (again likely by dumb luck), but for some reason only set the bits if the guest PMU was advertised as v1: if (pmu->version == 1) { pmu->global_ctrl = (1 << pmu->nr_arch_gp_counters) - 1; return; } Commit `f19a0c2c2e` ("KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset") then tried to remedy that goof, presumably because guest PMUs were leaving PERF_GLOBAL_CTRL '0', i.e. weren't enabling counters. pmu->global_ctrl = ((1 << pmu->nr_arch_gp_counters) - 1) \| (((1ull << pmu->nr_arch_fixed_counters) - 1) << X86_PMC_IDX_FIXED); pmu->global_ctrl_mask = ~pmu->global_ctrl; That was KVM's behavior up until commit `c49467a45f` ("KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshing") removed everything. However, it did so based on the behavior defined by the SDM , which at the time stated that "Global Perf Counter Controls" is '0' at Power-Up and RESET. But then the March 2023 SDM (325462-079US), stealthily changed its "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT" table to say: IA32_PERF_GLOBAL_CTRL: Sets bits n-1:0 and clears the upper bits. Note, kvm_pmu_refresh() can be invoked multiple times, i.e. it's not a "pure" RESET flow. But it can only be called prior to the first KVM_RUN, i.e. the guest will only ever observe the final value. Note #2, KVM has always cleared global_ctrl during refresh (see commit `f5132b0138` ("KVM: Expose a version 2 architectural PMU to a guests")), i.e. there is no danger of breaking existing setups by clobbering a value set by userspace. Reported-by: Babu Moger <babu.moger@amd.com> Cc: Sandipan Das <sandipan.das@amd.com> Cc: Like Xu <like.xu.linux@gmail.com> Cc: Mingwei Zhang <mizhang@google.com> Cc: Dapeng Mi <dapeng1.mi@linux.intel.com> Cc: stable@vger.kernel.org Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240309013641.1413400-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 13:20:27 -07:00
Rick Edgecombe	992b54bd08	KVM: x86/mmu: x86: Don't overflow lpage_info when checking attributes Fix KVM_SET_MEMORY_ATTRIBUTES to not overflow lpage_info array and trigger KASAN splat, as seen in the private_mem_conversions_test selftest. When memory attributes are set on a GFN range, that range will have specific properties applied to the TDP. A huge page cannot be used when the attributes are inconsistent, so they are disabled for those the specific huge pages. For internal KVM reasons, huge pages are also not allowed to span adjacent memslots regardless of whether the backing memory could be mapped as huge. What GFNs support which huge page sizes is tracked by an array of arrays 'lpage_info' on the memslot, of ‘kvm_lpage_info’ structs. Each index of lpage_info contains a vmalloc allocated array of these for a specific supported page size. The kvm_lpage_info denotes whether a specific huge page (GFN and page size) on the memslot is supported. These arrays include indices for unaligned head and tail huge pages. Preventing huge pages from spanning adjacent memslot is covered by incrementing the count in head and tail kvm_lpage_info when the memslot is allocated, but disallowing huge pages for memory that has mixed attributes has to be done in a more complicated way. During the KVM_SET_MEMORY_ATTRIBUTES ioctl KVM updates lpage_info for each memslot in the range that has mismatched attributes. KVM does this a memslot at a time, and marks a special bit, KVM_LPAGE_MIXED_FLAG, in the kvm_lpage_info for any huge page. This bit is essentially a permanently elevated count. So huge pages will not be mapped for the GFN at that page size if the count is elevated in either case: a huge head or tail page unaligned to the memslot or if KVM_LPAGE_MIXED_FLAG is set because it has mixed attributes. To determine whether a huge page has consistent attributes, the KVM_SET_MEMORY_ATTRIBUTES operation checks an xarray to make sure it consistently has the incoming attribute. Since level - 1 huge pages are aligned to level huge pages, it employs an optimization. As long as the level - 1 huge pages are checked first, it can just check these and assume that if each level - 1 huge page contained within the level sized huge page is not mixed, then the level size huge page is not mixed. This optimization happens in the helper hugepage_has_attrs(). Unfortunately, although the kvm_lpage_info array representing page size 'level' will contain an entry for an unaligned tail page of size level, the array for level - 1 will not contain an entry for each GFN at page size level. The level - 1 array will only contain an index for any unaligned region covered by level - 1 huge page size, which can be a smaller region. So this causes the optimization to overflow the level - 1 kvm_lpage_info and perform a vmalloc out of bounds read. In some cases of head and tail pages where an overflow could happen, callers skip the operation completely as KVM_LPAGE_MIXED_FLAG is not required to prevent huge pages as discussed earlier. But for memslots that are smaller than the 1GB page size, it does call hugepage_has_attrs(). In this case the huge page is both the head and tail page. The issue can be observed simply by compiling the kernel with CONFIG_KASAN_VMALLOC and running the selftest “private_mem_conversions_test”, which produces the output like the following: BUG: KASAN: vmalloc-out-of-bounds in hugepage_has_attrs+0x7e/0x110 Read of size 4 at addr ffffc900000a3008 by task private_mem_con/169 Call Trace: dump_stack_lvl print_report ? __virt_addr_valid ? hugepage_has_attrs ? hugepage_has_attrs kasan_report ? hugepage_has_attrs hugepage_has_attrs kvm_arch_post_set_memory_attributes kvm_vm_ioctl It is a little ambiguous whether the unaligned head page (in the bug case also the tail page) should be expected to have KVM_LPAGE_MIXED_FLAG set. It is not functionally required, as the unaligned head/tail pages will already have their kvm_lpage_info count incremented. The comments imply not setting it on unaligned head pages is intentional, so fix the callers to skip trying to set KVM_LPAGE_MIXED_FLAG in this case, and in doing so not call hugepage_has_attrs(). Cc: stable@vger.kernel.org Fixes: `90b4fe1798` ("KVM: x86: Disallow hugepages when memory attributes are mixed") Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Peng <chao.p.peng@linux.intel.com> Link: https://lore.kernel.org/r/20240314212902.2762507-1-rick.p.edgecombe@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 13:20:26 -07:00
Sean Christopherson	9e985cbf29	KVM: x86/pmu: Disable support for adaptive PEBS Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. Bug #1 is that KVM doesn't account for the upper 32 bits of IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters() stores local variables as u8s and truncates the upper bits too, etc. Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value for PEBS events, perf will _always_ generate an adaptive record, even if the guest requested a basic record. Note, KVM will also enable adaptive PEBS in individual counter, even if adaptive PEBS isn't exposed to the guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero, i.e. the guest will only ever see Basic records. Bug #3 is in perf. intel_pmu_disable_fixed() doesn't clear the upper bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE either. I.e. perf _always_ enables ADAPTIVE counters, regardless of what KVM requests. Bug #4 is that adaptive PEBS might effectively bypass event filters set by the host, as "Updated Memory Access Info Group" records information that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER. Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least zeros) when entering a vCPU with adaptive PEBS, which allows the guest to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries" records. Disable adaptive PEBS support as an immediate fix due to the severity of the LBR leak in particular, and because fixing all of the bugs will be non-trivial, e.g. not suitable for backporting to stable kernels. Note! This will break live migration, but trying to make KVM play nice with live migration would be quite complicated, wouldn't be guaranteed to work (i.e. KVM might still kill/confuse the guest), and it's not clear that there are any publicly available VMMs that support adaptive PEBS, let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't support PEBS in any capacity. Link: https://lore.kernel.org/all/20240306230153.786365-1-seanjc@google.com Link: https://lore.kernel.org/all/ZeepGjHCeSfadANM@google.com Fixes: `c59a1f106f` ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Cc: stable@vger.kernel.org Cc: Like Xu <like.xu.linux@gmail.com> Cc: Mingwei Zhang <mizhang@google.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhang Xiong <xiong.y.zhang@intel.com> Cc: Lv Zhiyuan <zhiyuan.lv@intel.com> Cc: Dapeng Mi <dapeng1.mi@intel.com> Cc: Jim Mattson <jmattson@google.com> Acked-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20240307005833.827147-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2024-04-08 13:20:25 -07:00
Daniel Sneddon	ed2e8d49b5	KVM: x86: Add BHI_NO Intel processors that aren't vulnerable to BHI will set MSR_IA32_ARCH_CAPABILITIES[BHI_NO] = 1;. Guests may use this BHI_NO bit to determine if they need to implement BHI mitigations or not. Allow this bit to be passed to the guests. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:06 +02:00
Pawan Gupta	95a6ccbdc7	x86/bhi: Mitigate KVM by default BHI mitigation mode spectre_bhi=auto does not deploy the software mitigation by default. In a cloud environment, it is a likely scenario where userspace is trusted but the guests are not trusted. Deploying system wide mitigation in such cases is not desirable. Update the auto mode to unconditionally mitigate against malicious guests. Deploy the software sequence at VMexit in auto mode also, when hardware mitigation is not available. Unlike the force =on mode, software sequence is not deployed at syscalls in auto mode. Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:06 +02:00
Pawan Gupta	ec9404e40e	x86/bhi: Add BHI mitigation knob Branch history clearing software sequences and hardware control BHI_DIS_S were defined to mitigate Branch History Injection (BHI). Add cmdline spectre_bhi={on\|off\|auto} to control BHI mitigation: auto - Deploy the hardware mitigation BHI_DIS_S, if available. on - Deploy the hardware mitigation BHI_DIS_S, if available, otherwise deploy the software sequence at syscall entry and VMexit. off - Turn off BHI mitigation. The default is auto mode which does not deploy the software sequence mitigation. This is because of the hardening done in the syscall dispatch path, which is the likely target of BHI. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:05 +02:00
Pawan Gupta	be482ff950	x86/bhi: Enumerate Branch History Injection (BHI) bug Mitigation for BHI is selected based on the bug enumeration. Add bits needed to enumerate BHI bug. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:05 +02:00
Daniel Sneddon	0f4a837615	x86/bhi: Define SPEC_CTRL_BHI_DIS_S Newer processors supports a hardware control BHI_DIS_S to mitigate Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel from userspace BHI attacks without having to manually overwrite the branch history. Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL. Mitigation is enabled later. Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:05 +02:00
Pawan Gupta	7390db8aea	x86/bhi: Add support for clearing branch history at syscall entry Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Alder Lake and new processors supports a hardware control BHI_DIS_S to mitigate BHI. For older processors Intel has released a software sequence to clear the branch history on parts that don't support BHI_DIS_S. Add support to execute the software sequence at syscall entry and VMexit to overwrite the branch history. For now, branch history is not cleared at interrupt entry, as malicious applications are not believed to have sufficient control over the registers, since previous register state is cleared at interrupt entry. Researchers continue to poke at this area and it may become necessary to clear at interrupt entry as well in the future. This mitigation is only defined here. It is enabled later. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:05 +02:00
Linus Torvalds	1e3ad78334	x86/syscall: Don't force use of indirect calls for system calls Make <asm/syscall.h> build a switch statement instead, and the compiler can either decide to generate an indirect jump, or - more likely these days due to mitigations - just a series of conditional branches. Yes, the conditional branches also have branch prediction, but the branch prediction is much more controlled, in that it just causes speculatively running the wrong system call (harmless), rather than speculatively running possibly wrong random less controlled code gadgets. This doesn't mitigate other indirect calls, but the system call indirection is the first and most easily triggered case. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>	2024-04-08 19:27:05 +02:00
Josh Poimboeuf	0cd01ac5dc	x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file Change the format of the 'spectre_v2' vulnerabilities sysfs file slightly by converting the commas to semicolons, so that mitigations for future variants can be grouped together and separated by commas. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2024-04-08 19:27:05 +02:00
Arnd Bergmann	e0ca9353a8	x86/math-emu: Fix function cast warnings clang-16 warns about casting function pointers with incompatible prototypes. The x86 math-emu code does this in a number of places to call some trivial functions that need no arguments: arch/x86/math-emu/fpu_etc.c:124:14: error: cast from 'void ()(void)' to 'FUNC_ST0' \ (aka 'void ()(struct fpu__reg , unsigned char)') converts to incompatible function \ type [-Werror,-Wcast-function-type-strict] 124 \| fchs, fabs, (FUNC_ST0) FPU_illegal, (FUNC_ST0) FPU_illegal, \| ^~~~~~~~~~~~~~~~~~~~~~ arch/x86/math-emu/fpu_trig.c:1634:19: error: cast from 'void ()(void)' to 'FUNC_ST0' \ (aka 'void ()(struct fpu__reg , unsigned char)') converts to incompatible function \ type [-Werror,-Wcast-function-type-strict] 1634 \| fxtract, fprem1, (FUNC_ST0) fdecstp, (FUNC_ST0) fincstp \| ^~~~~~~~~~~~~~~~~~ arch/x86/math-emu/reg_constant.c:112:53: error: cast from 'void ()(void)' to 'FUNC_RC' \ (aka 'void ()(int)') converts to incompatible function \ type [-Werror,-Wcast-function-type-strict] 112 \| fld1, fldl2t, fldl2e, fldpi, fldlg2, fldln2, fldz, (FUNC_RC) FPU_illegal Change the fdecstp() and fincstp() functions to actually have the correct prototypes based on the caller, and add wrappers around FPU_illegal() for adapting those. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/lkml/20240213095631.454543-1-arnd@kernel.org	2024-04-08 16:06:22 +02:00
Adam Dunlap	5ce344beac	x86/apic: Force native_apic_mem_read() to use the MOV instruction When done from a virtual machine, instructions that touch APIC memory must be emulated. By convention, MMIO accesses are typically performed via io.h helpers such as readl() or writeq() to simplify instruction emulation/decoding (ex: in KVM hosts and SEV guests) [0]. Currently, native_apic_mem_read() does not follow this convention, allowing the compiler to emit instructions other than the MOV instruction generated by readl(). In particular, when the kernel is compiled with clang and run as a SEV-ES or SEV-SNP guest, the compiler would emit a TESTL instruction which is not supported by the SEV-ES emulator, causing a boot failure in that environment. It is likely the same problem would happen in a TDX guest as that uses the same instruction emulator as SEV-ES. To make sure all emulators can emulate APIC memory reads via MOV, use the readl() function in native_apic_mem_read(). It is expected that any emulator would support MOV in any addressing mode as it is the most generic and is what is usually emitted currently. The TESTL instruction is emitted when native_apic_mem_read() is inlined into apic_mem_wait_icr_idle(). The emulator comes from insn_decode_mmio() in arch/x86/lib/insn-eval.c. It's not worth it to extend insn_decode_mmio() to support more instructions since, in theory, the compiler could choose to output nearly any instruction for such reads which would bloat the emulator beyond reason. [0] https://lore.kernel.org/all/20220405232939.73860-12-kirill.shutemov@linux.intel.com/ [ bp: Massage commit message, fix typos. ] Signed-off-by: Adam Dunlap <acdunlap@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Kevin Loughlin <kevinloughlin@google.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20240318230927.2191933-1-acdunlap@google.com	2024-04-08 15:37:57 +02:00
Adrian Hunter	7e90ffb716	x86/vdso: Make delta calculation overflow safe Kernel timekeeping is designed to keep the change in cycles (since the last timer interrupt) below max_cycles, which prevents multiplication overflow when converting cycles to nanoseconds. However, if timer interrupts stop, the calculation will eventually overflow. Add protection against that. Select GENERIC_VDSO_OVERFLOW_PROTECT so that max_cycles is made available in the VDSO data page. Check against max_cycles, falling back to a slower higher precision calculation. Take advantage of the opportunity to move masking and negative motion check into the slow path. The result is a calculation that has similar performance as before. Newer machines showed performance benefit, whereas older Skylake-based hardware such as Intel Kaby Lake was seen <1% worse. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-9-adrian.hunter@intel.com	2024-04-08 15:03:07 +02:00
Adrian Hunter	5b26ef660a	vdso: Consolidate nanoseconds calculation Consolidate nanoseconds calculation to simplify and reduce code duplication. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-3-adrian.hunter@intel.com	2024-04-08 15:03:06 +02:00
Linus Torvalds	9fe30842a9	Miscellaneous x86 fixes: - Fix MCE timer reinit locking - Fix/improve CoCo guest random entropy pool init - Fix SEV-SNP late disable bugs - Fix false positive objtool build warning - Fix header dependency bug - Fix resctrl CPU offlining bug Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSVU8RHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gUxA//cbhkLCyHnkM7XWPRhEMD5ukhys7rppeD 4UdRrMSwQPwZSjUVWPtY/0wL6MHerVq6ghZC483EAuwKXF2y2O3ug3MKAJP0jyaM Ap3sZ+RCZljNUg2Rs35LAIUMdRRrGBzWgi8rnxF7JYT9F11GftjNnzyTTKORJXA0 NQ7633ZBlIeDk21OOkdQq3GkwaYEbUwH69AdOtxKeyT1jOOoWbF6agt/nIw/mdgv UkHEVhw4ySvG5Gcj+h77XiWmY2nz/+iDJ763UerMACMlL2niKG9h3Q1ASFeSFDN+ TQejy/uHRjuHCCG83ebtKd921z6128e2p5g3wjAn7gXBx8ERb8+3CRxOalPrkXC2 OhcR74lxDqent2OupXqjp1ndLXwKCpnNYUiR/PIhFNjUAE+JQJ4wDTZ6VxL841sb t/U6V35/8SIQ52wMv48P2SlaYzUhZHDl4AZ0AVqrRNKAWLH0sOQneEMo+dYRnfwu L3uRCLkzs/r0g7dyJjzYdAbilFnIDdCBPzyf//gPBG31XJMZUmYdtQ8vqj8ZDfYd qVq98nowRaL16IHwHvoRiP7tBOvX+mV7ariTpASfIPz5mD9dbOsBYJ+N4KSeV5t4 9oOxwzoaEg9Z9erGfB52aG7Bwzi4XS4wfjrIxqqUXO1bpYFYsmvQe6+Ptgp3QMot OGzpalHZous= =CIt3 -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix MCE timer reinit locking - Fix/improve CoCo guest random entropy pool init - Fix SEV-SNP late disable bugs - Fix false positive objtool build warning - Fix header dependency bug - Fix resctrl CPU offlining bug * tag 'x86-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk x86/mce: Make sure to grab mce_sysfs_mutex in set_bank() x86/CPU/AMD: Track SNP host status with cc_platform_*() x86/cc: Add cc_platform_set/_clear() helpers x86/kvm/Kconfig: Have KVM_AMD_SEV select ARCH_HAS_CC_PLATFORM x86/coco: Require seeding RNG with RDRAND on CoCo systems x86/numa/32: Include missing <asm/pgtable_areas.h> x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offline	2024-04-07 09:33:21 -07:00
Linus Torvalds	e2948effa9	Fix a combined PEBS events bug on x86 Intel CPUs. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmYSRmYRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gGLg/9GQS5B/kub1ydjoFx2sCuHe5Zl9vLXpLg 515yOSKRWnpzjDJaMlAJg9MHiNlcR2Df5rcnVLaKpQSomajFQBKpsJkX4skqAUm3 wuuCwP0Wc+XJFNDzHGb5yqWaCo06XI80BnZyGHDHDD6jJ5fypzKVyJERnGBpLwjO h2YTomtLP5j5Gq7em2d+A9pVRpKwcBOCB8K2sBnJRlPNVR190MWTOm1wEQ+vYQeR x8nIx7WaaA6SInqWVGkhloasTmeWEH89Q2wjCltZpYFnRiEa1yS/VHVT6ZQKrpOy +mBqr92tXdxT2Y+8LNAMUg5PgRVMbZoY+Glin0Q0N4Cg92BZIl8NX8wtb3oacYgd XhiRyRWaw8JDCC2mEmhlEa01M2Y7PXtcBjvOVQwoZLS/711Zyf+fHjyX4FUG8Vcb T0PgaQoterlVnN4H2uWq8Za8ubjI0TW0nRBw2oQKlSv/5ldJ2IKJsQdsbl1q4wQr TtYJY2bq5Hrn+qlFZi6jFB2KvBOUV3molXlZAPJ0Nr/Y9mkMBRVcq6ufrAunpgUB l62Ls61HHZ9+hVNIIpM8/p/rTYjeVilA7vjHGiCJFcsclvPNBBkobtbpri/ioE0t 4+pH60LsMBIwAhOAKlJ6Jzf4LaWEikJfDPpj8yMKixGOxDT542rUN7A3NwdP2H6b 2fV3Nyr7sEw= =v7jY -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fix from Ingo Molnar: "Fix a combined PEBS events bug on x86 Intel CPUs" * tag 'perf-urgent-2024-04-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event	2024-04-07 09:14:46 -07:00
Borislav Petkov (AMD)	b377c66ae3	x86/retpoline: Add NOENDBR annotation to the SRSO dummy return thunk srso_alias_untrain_ret() is special code, even if it is a dummy which is called in the !SRSO case, so annotate it like its real counterpart, to address the following objtool splat: vmlinux.o: warning: objtool: .export_symbol+0x2b290: data relocation to !ENDBR: srso_alias_untrain_ret+0x0 Fixes: `4535e1a417` ("x86/bugs: Fix the SRSO mitigation on Zen3/4") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240405144637.17908-1-bp@kernel.org	2024-04-06 13:01:50 +02:00
Ingo Molnar	5f2ca44ed2	Merge branch 'linus' into x86/urgent, to pick up dependent commit We want to fix: `0e11073247` ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO") So merge in Linus's latest into x86/urgent to have it available. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-04-06 13:00:32 +02:00
Uros Bizjak	93cfa544cf	x86/percpu: Introduce raw_cpu_read_long() to reduce ifdeffery Introduce raw_cpu_read_long() macro to slightly reduce ifdeffery in <asm/percpu.h>. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240404094218.448963-3-ubizjak@gmail.com	2024-04-06 12:42:17 +02:00
Uros Bizjak	a3f8a3a2cf	x86/percpu: Rewrite x86_this_cpu_test_bit() and friends as macros Rewrite the whole family of x86_this_cpu_test_bit() functions as macros, so standard __my_cpu_var() and raw_cpu_read() macros can be used on percpu variables. This approach considerably simplifies implementation of functions and also introduces standard checks on accessed percpu variables. No functional changes intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240404094218.448963-2-ubizjak@gmail.com	2024-04-06 12:42:17 +02:00
Uros Bizjak	4c3677c077	x86/percpu: Fix x86_this_cpu_variable_test_bit() asm template Fix x86_this_cpu_variable_test_bit(), which is implemented with an incorrect asm template, where argument 2 (count argument) is considered a percpu variable. However, x86_this_cpu_test_bit() is currently used exclusively with constant bit number argument, so the called x86_this_cpu_variable_test_bit() function is never instantiated. The fix introduces named assembler operands to prevent this kind of error. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20240404094218.448963-1-ubizjak@gmail.com	2024-04-06 12:42:17 +02:00
Borislav Petkov (AMD)	3287c22957	x86/microcode/AMD: Remove unused PATCH_MAX_SIZE macro Orphaned after `05e91e7211` ("x86/microcode/AMD: Rip out static buffers") No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2024-04-05 23:33:08 +02:00
Arnd Bergmann	9e11fc78e2	x86/microcode/AMD: Avoid -Wformat warning with clang-15 Older versions of clang show a warning for amd.c after a fix for a gcc warning: arch/x86/kernel/cpu/microcode/amd.c:478:47: error: format specifies type \ 'unsigned char' but the argument has type 'u16' (aka 'unsigned short') [-Werror,-Wformat] "amd-ucode/microcode_amd_fam%02hhxh.bin", family); ~~~~~~ ^~~~~~ %02hx In clang-16 and higher, this warning is disabled by default, but clang-15 is still supported, and it's trivial to avoid by adapting the types according to the range of the passed data and the format string. [ bp: Massage commit message. ] Fixes: `2e9064facc` ("x86/microcode/amd: Fix snprintf() format string warning in W=1 build") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240405204919.1003409-1-arnd@kernel.org	2024-04-05 23:26:18 +02:00
Linus Torvalds	af709adfaa	8 hotfixes, 3 are cc:stable There are a couple of fixups for this cycle's vmalloc changes and one for the stackdepot changes. And a fix for a very old x86 PAT issue which can cause a warning splat. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZhBEXAAKCRDdBJ7gKXxA ju9fAQCxPdqApKQ49IAJpUtMcRJmI594dmB+/CfrFgiS+GaQcwEA1+2SidI9fQWT R/fcKrRr4+zlgQw0T0aSDR1HBLUPxw0= =rkxT -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "8 hotfixes, 3 are cc:stable There are a couple of fixups for this cycle's vmalloc changes and one for the stackdepot changes. And a fix for a very old x86 PAT issue which can cause a warning splat" * tag 'mm-hotfixes-stable-2024-04-05-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: stackdepot: rename pool_index to pool_index_plus_1 x86/mm/pat: fix VM_PAT handling in COW mappings MAINTAINERS: change vmware.com addresses to broadcom.com selftests/mm: include strings.h for ffsl mm: vmalloc: fix lockdep warning mm: vmalloc: bail out early in find_vmap_area() if vmap is not init init: open output files from cpio unpacking with O_LARGEFILE mm/secretmem: fix GUP-fast succeeding on secretmem folios	2024-04-05 13:30:01 -07:00
David Hildenbrand	04c35ab3bd	x86/mm/pat: fix VM_PAT handling in COW mappings PAT handling won't do the right thing in COW mappings: the first PTE (or, in fact, all PTEs) can be replaced during write faults to point at anon folios. Reliably recovering the correct PFN and cachemode using follow_phys() from PTEs will not work in COW mappings. Using follow_phys(), we might just get the address+protection of the anon folio (which is very wrong), or fail on swap/nonswap entries, failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and track_pfn_copy(), not properly calling free_pfn_range(). In free_pfn_range(), we either wouldn't call memtype_free() or would call it with the wrong range, possibly leaking memory. To fix that, let's update follow_phys() to refuse returning anon folios, and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings if we run into that. We will now properly handle untrack_pfn() with COW mappings, where we don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if the first page was replaced by an anon folio, though: we'd have to store the cachemode in the VMA to make this work, likely growing the VMA size. For now, lets keep it simple and let track_pfn_copy() just fail in that case: it would have failed in the past with swap/nonswap entries already, and it would have done the wrong thing with anon folios. Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn(): <--- C reproducer ---> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <liburing.h> int main(void) { struct io_uring_params p = {}; int ring_fd; size_t size; char map; ring_fd = io_uring_setup(1, &p); if (ring_fd < 0) { perror("io_uring_setup"); return 1; } size = p.sq_off.array + p.sq_entries sizeof(unsigned); /* Map the submission queue ring MAP_PRIVATE / map = mmap(0, size, PROT_READ \| PROT_WRITE, MAP_PRIVATE, ring_fd, IORING_OFF_SQ_RING); if (map == MAP_FAILED) { perror("mmap"); return 1; } / We have at least one page. Let's COW it. / map = 0; pause(); return 0; } <--- C reproducer ---> On a system with 16 GiB RAM and swap configured: # ./iouring & # memhog 16G # killall iouring [ 301.552930] ------------[ cut here ]------------ [ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100 [ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g [ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1 [ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4 [ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100 [ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000 [ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282 [ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047 [ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200 [ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000 [ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000 [ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000 [ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000 [ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0 [ 301.565725] PKRU: 55555554 [ 301.565944] Call Trace: [ 301.566148] <TASK> [ 301.566325] ? untrack_pfn+0xf4/0x100 [ 301.566618] ? __warn+0x81/0x130 [ 301.566876] ? untrack_pfn+0xf4/0x100 [ 301.567163] ? report_bug+0x171/0x1a0 [ 301.567466] ? handle_bug+0x3c/0x80 [ 301.567743] ? exc_invalid_op+0x17/0x70 [ 301.568038] ? asm_exc_invalid_op+0x1a/0x20 [ 301.568363] ? untrack_pfn+0xf4/0x100 [ 301.568660] ? untrack_pfn+0x65/0x100 [ 301.568947] unmap_single_vma+0xa6/0xe0 [ 301.569247] unmap_vmas+0xb5/0x190 [ 301.569532] exit_mmap+0xec/0x340 [ 301.569801] __mmput+0x3e/0x130 [ 301.570051] do_exit+0x305/0xaf0 ... Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Wupeng Ma <mawupeng1@huawei.com> Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com Fixes: `b1a86e15dc` ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines") Fixes: `5899329b19` ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3") Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-05 11:21:31 -07:00
Eric Biggers	aa2197f566	crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL extensions. This implementation uses zmm registers to operate on four AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. To avoid downclocking on older Intel CPU models, an exclusion list is used to prevent this 512-bit implementation from being used by default on some CPU models. They will use xts-aes-vaes-avx10_256 instead. For now, this exclusion list is simply coded into aesni-intel_glue.c. It may make sense to eventually move it into a more central location. xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on some current CPUs. E.g., on AMD Zen 4, AES-256-XTS decryption throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte inputs. On Intel Sapphire Rapids, AES-256-XTS decryption throughput increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs. Future CPUs may provide stronger 512-bit support, in which case a larger benefit should be seen. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-05 15:46:33 +08:00
Eric Biggers	ee63fea005	crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL extensions. This implementation avoids using zmm registers, instead using ymm registers to operate on two AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. This is the optimal implementation on CPUs that support VAES and AVX512 but where the zmm registers should not be used due to downclocking effects, for example Intel's Ice Lake. It should also be the optimal implementation on future CPUs that support AVX10/256 but not AVX10/512. The performance is slightly better than that of xts-aes-vaes-avx2, which uses the same 256-bit vector length, due to factors such as being able to use ymm16-ymm31 to cache the AES round keys, and being able to use the vpternlogd instruction to do XORs more efficiently. For example, on Ice Lake, the throughput of decrypting 4096-byte messages with AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with xts-aes-vaes-avx2. While this is a small improvement, it is straightforward to provide this implementation (xts-aes-vaes-avx10_256) as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512 anyway, due to the way the _aes_xts_crypt macro is structured. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-05 15:46:33 +08:00
Eric Biggers	e787060bdf	crypto: x86/aes-xts - wire up VAES + AVX2 implementation Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10. This implementation uses ymm registers to operate on two AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. This is the optimal implementation on AMD Zen 3. It should also be the optimal implementation on Intel Alder Lake, which similarly supports VAES but not AVX512. Comparing to xts-aes-aesni-avx on Zen 3, xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput with 4096-byte messages, or 23% higher with 512-byte messages. A large improvement is also seen with CPUs that do support AVX512 (e.g., 98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte messages), though the following patches add AVX512 optimized implementations to get a bit more performance on those CPUs. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2024-04-05 15:46:33 +08:00

... 3 4 5 6 7 ...

46619 Commits