Clear enable_mmio_caching if hardware can't support MMIO caching and use
the dedicated flag to detect if MMIO caching is enabled instead of
assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will
use a zero value even when caching is enabled, and is_mmio_spte() isn't
so hot that it needs to avoid an extra memory access, i.e. there's no
reason to be super clever. And the clever approach may not even be more
performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
uops for non-MMIO SPTEs.
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220420002747.3287931-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When determining whether or not a SPTE needs to have SME/SEV's memory
encryption flag set, do the moderately expensive host MMIO pfn check if
and only if the memory encryption mask is non-zero.
Note, KVM could further optimize the host MMIO checks by making a single
call to kvm_is_mmio_pfn(), but the tdp_enabled path (for EPT's memtype
handling) will likely be split out to a separate flow[*]. At that point,
a better approach would be to shove the call to kvm_is_mmio_pfn() into
VMX code so that AMD+NPT without SME doesn't get hit with an unnecessary
lookup.
[*] https://lkml.kernel.org/r/20220321224358.1305530-3-bgardon@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220415004909.2216670-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use
TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID
instructions. The TSC_AUX value is set using the WRMSR instruction to the
TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID
instructions. If the read/write of the TSC_AUX MSR is intercepted, then
RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization
is present. However, the RDPID instruction can't be intercepted. This means
that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR
read/write must not be intercepted for SEV-ES (or SEV-SNP) guests.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The TSC_AUX Virtualization feature allows AMD SEV-ES guests to securely use
TSC_AUX (auxiliary time stamp counter data) MSR in RDTSCP and RDPID
instructions.
The TSC_AUX MSR is typically initialized to APIC ID or another unique
identifier so that software can quickly associate returned TSC value
with the logical processor.
Add the feature bit and also include it in the kvm for detection.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Message-Id: <165040157111.1399644.6123821125319995316.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes for (relatively) old bugs, to be merged in both the -rc and next
development trees.
The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between
5.18 and commit c24a950ec7 ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata
for SEV-ES", 2022-04-13).
Drop lookup_address_in_mm() now that KVM is providing it's own variant
of lookup_address_in_pgd() that is safe for use with user addresses, e.g.
guards against page tables being torn down. A variant that provides a
non-init mm is inherently dangerous and flawed, as the only reason to use
an mm other than init_mm is to walk a userspace mapping, and
lookup_address_in_pgd() does not play nice with userspace mappings, e.g.
doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
to ensure an upper level entry isn't converted to a huge page between
checking the PAGE_SIZE bit and grabbing the address of the next level
down.
This reverts commit 13c72c060f.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <YmwIi3bXr/1yhYV/@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes for (relatively) old bugs, to be merged in both the -rc and next
development trees:
* Fix potential races when walking host page table
* Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
* Fix shadow page table leak when KVM runs nested
KVM uses lookup_address_in_mm() to detect the hugepage size that the host
uses to map a pfn. The function suffers from several issues:
- no usage of READ_ONCE(*). This allows multiple dereference of the same
page table entry. The TOCTOU problem because of that may cause KVM to
incorrectly treat a newly generated leaf entry as a nonleaf one, and
dereference the content by using its pfn value.
- the information returned does not match what KVM needs; for non-present
entries it returns the level at which the walk was terminated, as long
as the entry is not 'none'. KVM needs level information of only 'present'
entries, otherwise it may regard a non-present PXE entry as a present
large page mapping.
- the function is not safe for mappings that can be torn down, because it
does not disable IRQs and because it returns a PTE pointer which is never
safe to dereference after the function returns.
So implement the logic for walking host page tables directly in KVM, and
stop using lookup_address_in_mm().
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220429031757.2042406-1-mizhang@google.com>
[Inline in host_pfn_mapping_level, ensure no semantic change for its
callers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
member that at the time was unused. Unfortunately this extensibility
mechanism has several issues:
- x86 is not writing the member, so it would not be possible to use it
on x86 except for new events
- the member is not aligned to 64 bits, so the definition of the
uAPI struct is incorrect for 32- on 64-bit userspace. This is a
problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
usage of flags was only introduced in 5.18.
Since padding has to be introduced, place a new field in there
that tells if the flags field is valid. To allow further extensibility,
in fact, change flags to an array of 16 values, and store how many
of the values are valid. The availability of the new ndata field
is tied to a system capability; all architectures are changed to
fill in the field.
To avoid breaking compilation of userspace that was using the flags
field, provide a userspace-only union to overlap flags with data[0].
The new field is placed at the same offset for both 32- and 64-bit
userspace.
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):
WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Modules linked in: kvm_intel kvm irqbypass
CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
Call Trace:
<TASK>
kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
kvm_destroy_vm+0x162/0x2d0 [kvm]
kvm_vm_release+0x1d/0x30 [kvm]
__fput+0x82/0x240
task_work_run+0x5b/0x90
exit_to_user_mode_prepare+0xd2/0xe0
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.
Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.
A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).
The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.
Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e38 ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm64 fixes for 5.18, take #2
- Take care of faults occuring between the PARange and
IPA range by injecting an exception
- Fix S2 faults taken from a host EL0 in protected mode
- Work around Oops caused by a PMU access from a 32bit
guest when PMU has been created. This is a temporary
bodge until we fix it for good.
For reasons best known to the author, gss-proxy does not implement a
NULL procedure, and returns RPC_PROC_UNAVAIL. However we still want to
ensure that we connect to the service at setup time.
So add a quirk-flag specially for this case.
Fixes: 1d658336b0 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Merge series from Ranjani Sridharan <ranjani.sridharan@linux.intel.com>:
This series introduces IPC abstraction for FW loading in the SOF driver
in preparation for supporting the new IPC version in the SOF firmware.
KASAN report null-ptr-deref as follows:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:ipmi_unregister_smi+0x7d/0xd50 drivers/char/ipmi/ipmi_msghandler.c:3680
Call Trace:
ipmi_ipmb_remove+0x138/0x1a0 drivers/char/ipmi/ipmi_ipmb.c:443
ipmi_ipmb_probe+0x409/0xda1 drivers/char/ipmi/ipmi_ipmb.c:548
i2c_device_probe+0x959/0xac0 drivers/i2c/i2c-core-base.c:563
really_probe+0x3f3/0xa70 drivers/base/dd.c:541
In ipmi_ipmb_probe(), 'iidev->intf' is not set before
ipmi_register_smi() success. And in the error handling case,
ipmi_ipmb_remove() is called to release resources, ipmi_unregister_smi()
is called without check 'iidev->intf', this will cause KASAN
null-ptr-deref issue.
General kernel style is to allow NULL to be passed into unregister
calls, so fix it that way. This allows a NULL check to be removed in
other code.
Fixes: 57c9e3c9a3 ("ipmi:ipmi_ipmb: Unregister the SMI on remove")
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: stable@vger.kernel.org # v5.17+
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
A chunk was dropped when the code handling send messages was rewritten.
Those messages shouldn't be processed normally, they are just an
indication that the message was successfully sent and the timers should
be started for the real response that should be coming later.
Add back in the missing chunk to just discard the message and go on.
Fixes: 059747c245 ("ipmi: Add support for IPMB direct messages")
Reported-by: Joe Wiese <jwiese@rackspace.com>
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Corey Minyard <cminyard@mvista.com>
Tested-by: Joe Wiese <jwiese@rackspace.com>
In case the DTB provided by the bootloader/BootROM is before the kernel
image or outside /memory, we won't be able to access it through the
linear mapping, and get a segfault on setup_arch(). Currently OpenSBI
relocates DTB but that's not always the case (e.g. if FW_JUMP_FDT_ADDR
is not specified), and it's also not the most portable approach since
the default FW_JUMP_FDT_ADDR of the generic platform relocates the DTB
at a specific offset that may not be available. To avoid this situation
copy DTB so that it's visible through the linear mapping.
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Link: https://lore.kernel.org/r/20220322132839.3653682-1-mick@ics.forth.gr
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Fixes: f105aa940e ("riscv: add BUILTIN_DTB support for MMU-enabled targets")
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
The pvpanic driver relies on panic notifiers to execute a callback
on panic event. Such function is executed in atomic context - the
panic function disables local IRQs, preemption and all other CPUs
that aren't running the panic code.
With that said, it's dangerous to use regular spinlocks in such path,
as introduced by commit b3c0f87746 ("misc/pvpanic: probe multiple instances").
This patch fixes that by replacing regular spinlocks with the trylock
safer approach.
It also fixes an old comment (about a long gone framebuffer code) and
the notifier priority - we should execute hypervisor notifiers early,
deferring this way the panic action to the hypervisor, as expected by
the users that are setting up pvpanic.
Fixes: b3c0f87746 ("misc/pvpanic: probe multiple instances")
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
Cc: Wang ShaoBo <bobo.shaobowang@huawei.com>
Cc: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Link: https://lore.kernel.org/r/20220427224924.592546-6-gpiccoli@igalia.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix the CONFIGs around register_sysfs_loader(),
unregister_sysfs_loader(), register_firmware_config_sysctl(), and
unregister_firmware_config_sysctl(). The full definitions of the
register_sysfs_loader() and unregister_sysfs_loader() functions should
be used whenever CONFIG_FW_LOADER_SYSFS is defined. The
register_firmware_config_sysctl() and unregister_firmware_config_sysctl()
functions should be stubbed out unless CONFIG_FW_LOADER_USER_HELPER
CONFIG_SYSCTL are both defined.
Signed-off-by: Russ Weight <russell.h.weight@intel.com>
Link: https://lore.kernel.org/r/20220426200356.126085-2-russell.h.weight@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ARM: tegra: Default configuration fixes for v5.18
This contains two updates to the default configuration needed because of
a Kconfig symbol name change. This fixes a failure that was detected in
the NVIDIA automated test farm.
* tag 'tegra-for-5.18-arm-defconfig-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
ARM: config: multi v7: Enable NVIDIA Tegra video decoder driver
ARM: tegra_defconfig: Update CONFIG_TEGRA_VDE option
Link: https://lore.kernel.org/r/20220429080626.494150-1-thierry.reding@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fix return value in RSB bus driver
* tag 'sunxi-fixes-for-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create()
Link: https://lore.kernel.org/r/Ymbkd+/dDmRJz66w@kista.localdomain
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fix the discrepancy between the two places we check for the CP0 counter
erratum in along with the incorrect comparison of the R4400 revision
number against 0x30 which matches none and consistently consider all
R4000 and R4400 processors affected, as documented in processor errata
publications[1][2][3], following the mapping between CP0 PRId register
values and processor models:
PRId | Processor Model
---------+--------------------
00000422 | R4000 Revision 2.2
00000430 | R4000 Revision 3.0
00000440 | R4400 Revision 1.0
00000450 | R4400 Revision 2.0
00000460 | R4400 Revision 3.0
No other revision of either processor has ever been spotted.
Contrary to what has been stated in commit ce202cbb9e ("[MIPS] Assume
R4000/R4400 newer than 3.0 don't have the mfc0 count bug") marking the
CP0 counter as buggy does not preclude it from being used as either a
clock event or a clock source device. It just cannot be used as both at
a time, because in that case clock event interrupts will be occasionally
lost, and the use as a clock event device takes precedence.
Compare against 0x4ff in `can_use_mips_counter' so that a single machine
instruction is produced.
References:
[1] "MIPS R4000PC/SC Errata, Processor Revision 2.2 and 3.0", MIPS
Technologies Inc., May 10, 1994, Erratum 53, p.13
[2] "MIPS R4400PC/SC Errata, Processor Revision 1.0", MIPS Technologies
Inc., February 9, 1994, Erratum 21, p.4
[3] "MIPS R4400PC/SC Errata, Processor Revision 2.0 & 3.0", MIPS
Technologies Inc., January 24, 1995, Erratum 14, p.3
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Fixes: ce202cbb9e ("[MIPS] Assume R4000/R4400 newer than 3.0 don't have the mfc0 count bug")
Cc: stable@vger.kernel.org # v2.6.24+
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
The ffa core driver currently assigns its own driver information
to individual ffa device driver_data which is wrong. Firstly, it leaks
this core driver information to individual ffa_device and hence to
ffa_driver. Secondly the ffa_device driver_data is for use by individual
ffa_driver and not for this core driver managing all those devices.
Link: https://lore.kernel.org/r/20220429113946.2087145-2-sudeep.holla@arm.com
Fixes: d0c0bce831 ("firmware: arm_ffa: Setup in-kernel users of FFA partitions")
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
While we pass uuid_null intentionally to ffa_partition_probe in
ffa_setup_partitions to get the count of the partitions, it must not be
uuid_null in ffa_partition_info_get which is used by the ffa_drivers
to fetch the specific partition info passing the UUID of the partition.
Fix ffa_partition_info_get by passing the received uuid down to
ffa_partition_probe so that the correct partition information is fetched.
Link: https://lore.kernel.org/r/20220429113946.2087145-1-sudeep.holla@arm.com
Fixes: d0c0bce831 ("firmware: arm_ffa: Setup in-kernel users of FFA partitions")
Reported-by: Arunachalam Ganapathy <arunachalam.ganapathy@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Per toshiba,tc358767.yaml DT binding document, port@2 the output (e)DP
port is optional. In case this port is not described in DT, the bridge
driver operates in DPI-to-DP mode. The drm_of_find_panel_or_bridge()
call in tc_probe_edp_bridge_endpoint() returns -ENODEV in case port@2
is not present in DT and this specific return value is incorrectly
propagated outside of tc_probe_edp_bridge_endpoint() function. All
other error values must be propagated and are propagated correctly.
Return 0 in case the port@2 is missing instead, that reinstates the
original behavior before the commit this patch fixes.
Fixes: 8478095a8c ("drm/bridge: tc358767: Move (e)DP bridge endpoint parsing into dedicated function")
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Jonas Karlman <jonas@kwiboo.se>
Cc: Laurent Pinchart <Laurent.pinchart@ideasonboard.com>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Marek Vasut <marex@denx.de>
Cc: Maxime Ripard <maxime@cerno.tech>
Cc: Neil Armstrong <narmstrong@baylibre.com>
Cc: Robert Foss <robert.foss@linaro.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20220428213132.447890-1-marex@denx.de
When the rpcbind server closes the socket, we need to ensure that the
socket is closed by the kernel as soon as feasible, so add a
sk_state_change callback to trigger this close.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When a XEN_HVM guest uses the XEN PIRQ/Eventchannel mechanism, then
PCI/MSI[-X] masking is solely controlled by the hypervisor, but contrary to
XEN_PV guests this does not disable PCI/MSI[-X] masking in the PCI/MSI
layer.
This can lead to a situation where the PCI/MSI layer masks an MSI[-X]
interrupt and the hypervisor grants the write despite the fact that it
already requested the interrupt. As a consequence interrupt delivery on the
affected device is not happening ever.
Set pci_msi_ignore_mask to prevent that like it's done for XEN_PV guests
already.
Fixes: 809f9267bb ("xen: map MSIs into pirqs")
Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Reported-by: Dusty Mabe <dustymabe@redhat.com>
Reported-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Noah Meyerhans <noahm@debian.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuaduxj5.ffs@tglx