2019-05-19 12:08:55 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Copyright (C) 1995 Linus Torvalds
|
|
|
|
*
|
2019-11-18 14:49:22 +00:00
|
|
|
* This file contains the setup_arch() code, which handles the architecture-dependent
|
|
|
|
* parts of early kernel initialization.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2021-09-01 16:07:01 +00:00
|
|
|
#include <linux/acpi.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/console.h>
|
2024-03-18 18:12:45 +00:00
|
|
|
#include <linux/cpu.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <linux/crash_dump.h>
|
2020-09-11 08:56:52 +00:00
|
|
|
#include <linux/dma-map-ops.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/efi.h>
|
2022-06-30 08:36:12 +00:00
|
|
|
#include <linux/ima.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <linux/init_ohci1394_dma.h>
|
|
|
|
#include <linux/initrd.h>
|
2008-04-10 02:50:41 +00:00
|
|
|
#include <linux/iscsi_ibft.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <linux/memblock.h>
|
2021-07-01 01:54:59 +00:00
|
|
|
#include <linux/panic_notifier.h>
|
2008-01-30 12:30:16 +00:00
|
|
|
#include <linux/pci.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <linux/root_dev.h>
|
mm: hugetlb: optionally allocate gigantic hugepages using cma
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 21:32:45 +00:00
|
|
|
#include <linux/hugetlb.h>
|
2009-09-02 01:25:07 +00:00
|
|
|
#include <linux/tboot.h>
|
2017-03-21 08:01:31 +00:00
|
|
|
#include <linux/usb/xhci-dbgp.h>
|
2020-08-18 13:57:51 +00:00
|
|
|
#include <linux/static_call.h>
|
2020-09-22 13:31:03 +00:00
|
|
|
#include <linux/swiotlb.h>
|
x86/setup: Use rng seeds from setup_data
Currently, the only way x86 can get an early boot RNG seed is via EFI,
which is generally always used now for physical machines, but is very
rarely used in VMs, especially VMs that are optimized for starting
"instantaneously", such as Firecracker's MicroVM. For tiny fast booting
VMs, EFI is not something you generally need or want.
Rather, the image loader or firmware should be able to pass a single
random seed, exactly as device tree platforms do with the "rng-seed"
property. Additionally, this is something that bootloaders can append,
with their own seed file management, which is something every other
major OS ecosystem has that Linux does not (yet).
Add SETUP_RNG_SEED, similar to the other eight setup_data entries that
are parsed at boot. It also takes care to zero out the seed immediately
after using, in order to retain forward secrecy. This all takes about 7
trivial lines of code.
Then, on kexec_file_load(), a new fresh seed is generated and passed to
the next kernel, just as is done on device tree architectures when
using kexec. And, importantly, I've tested that QEMU is able to properly
pass SETUP_RNG_SEED as well, making this work for every step of the way.
This code too is pretty straight forward.
Together these measures ensure that VMs and nested kexec()'d kernels
always receive a proper boot time RNG seed at the earliest possible
stage from their parents:
- Host [already has strongly initialized RNG]
- QEMU [passes fresh seed in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- ...
I've verified in several scenarios that this works quite well from a
host kernel to QEMU and down inwards, mixing and matching loaders, with
every layer providing a seed to the next.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20220630113300.1892799-1-Jason@zx2c4.com
2022-07-10 17:29:21 +00:00
|
|
|
#include <linux/random.h>
|
2005-06-25 21:58:01 +00:00
|
|
|
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <uapi/linux/mount.h>
|
|
|
|
|
|
|
|
#include <xen/xen.h>
|
|
|
|
|
2005-06-25 21:57:41 +00:00
|
|
|
#include <asm/apic.h>
|
2022-10-01 16:38:15 +00:00
|
|
|
#include <asm/efi.h>
|
2020-08-06 12:34:32 +00:00
|
|
|
#include <asm/numa.h>
|
2008-03-17 19:08:17 +00:00
|
|
|
#include <asm/bios_ebda.h>
|
2008-06-16 23:11:08 +00:00
|
|
|
#include <asm/bugs.h>
|
2022-11-02 07:47:09 +00:00
|
|
|
#include <asm/cacheinfo.h>
|
x86/coco: Require seeding RNG with RDRAND on CoCo systems
There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.
If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.
So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.
Add this deliberately to be "just a CoCo x86 driver feature" and not
part of the RNG itself. Many device drivers and platforms have some
desire to contribute something to the RNG, and add_device_randomness()
is specifically meant for this purpose.
Any driver can call it with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse.
Rather than trying to build something into the core of the RNG, consider
the particular CoCo issue just a CoCo issue, and therefore separate it
all out into driver (well, arch/platform) code.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
2024-03-26 16:07:35 +00:00
|
|
|
#include <asm/coco.h>
|
2009-01-07 12:41:35 +00:00
|
|
|
#include <asm/cpu.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <asm/efi.h>
|
2008-11-27 17:39:15 +00:00
|
|
|
#include <asm/gart.h>
|
2008-10-27 17:41:46 +00:00
|
|
|
#include <asm/hypervisor.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <asm/io_apic.h>
|
|
|
|
#include <asm/kasan.h>
|
|
|
|
#include <asm/kaslr.h>
|
2009-11-10 01:38:24 +00:00
|
|
|
#include <asm/mce.h>
|
2021-12-15 16:56:11 +00:00
|
|
|
#include <asm/memtype.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <asm/mtrr.h>
|
2019-11-26 16:54:07 +00:00
|
|
|
#include <asm/realmode.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <asm/olpc_ofw.h>
|
|
|
|
#include <asm/pci-direct.h>
|
2011-02-22 20:07:37 +00:00
|
|
|
#include <asm/prom.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <asm/proto.h>
|
x86/thermal: Fix LVT thermal setup for SMI delivery mode
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):
[ 0.033858] read lvtthmr: 0x330, val: 0x200
which means:
- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.
Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:
setup_local_APIC:
...
/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);
which means (from the SDM):
"10.4.7.2 Local APIC State After It Has Been Software Disabled
...
* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."
And this happens too:
[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...
Now, before
4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.
But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.
Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.
Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-27 09:02:26 +00:00
|
|
|
#include <asm/thermal.h>
|
2017-07-24 23:36:57 +00:00
|
|
|
#include <asm/unwind.h>
|
2019-11-18 14:49:22 +00:00
|
|
|
#include <asm/vsyscall.h>
|
2019-11-29 07:17:25 +00:00
|
|
|
#include <linux/vmalloc.h>
|
2008-06-26 00:51:29 +00:00
|
|
|
|
2009-04-28 13:00:49 +00:00
|
|
|
/*
|
2019-11-18 15:03:39 +00:00
|
|
|
* max_low_pfn_mapped: highest directly mapped pfn < 4 GB
|
|
|
|
* max_pfn_mapped: highest directly mapped pfn > 4 GB
|
2012-11-17 03:38:52 +00:00
|
|
|
*
|
2017-01-28 16:09:33 +00:00
|
|
|
* The direct mapping only covers E820_TYPE_RAM regions, so the ranges and gaps are
|
2019-11-18 15:03:39 +00:00
|
|
|
* represented by pfn_mapped[].
|
2009-04-28 13:00:49 +00:00
|
|
|
*/
|
|
|
|
unsigned long max_low_pfn_mapped;
|
|
|
|
unsigned long max_pfn_mapped;
|
|
|
|
|
2010-02-09 23:38:45 +00:00
|
|
|
#ifdef CONFIG_DMI
|
2009-03-12 23:09:49 +00:00
|
|
|
RESERVE_BRK(dmi_alloc, 65536);
|
2010-02-09 23:38:45 +00:00
|
|
|
#endif
|
2009-03-12 23:09:49 +00:00
|
|
|
|
2009-01-27 16:13:05 +00:00
|
|
|
|
2019-11-18 15:03:39 +00:00
|
|
|
unsigned long _brk_start = (unsigned long)__brk_base;
|
|
|
|
unsigned long _brk_end = (unsigned long)__brk_base;
|
x86: add brk allocation for very, very early allocations
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-02-27 01:35:44 +00:00
|
|
|
|
2008-06-26 00:55:20 +00:00
|
|
|
struct boot_params boot_params;
|
|
|
|
|
2016-04-14 18:18:57 +00:00
|
|
|
/*
|
2019-11-18 15:03:39 +00:00
|
|
|
* These are the four main kernel memory regions, we put them into
|
|
|
|
* the resource tree so that kdump tools and other debugging tools
|
|
|
|
* recover it:
|
2016-04-14 18:18:57 +00:00
|
|
|
*/
|
2019-11-18 15:03:39 +00:00
|
|
|
|
2019-10-29 21:13:50 +00:00
|
|
|
static struct resource rodata_resource = {
|
|
|
|
.name = "Kernel rodata",
|
|
|
|
.start = 0,
|
|
|
|
.end = 0,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
|
|
|
|
};
|
|
|
|
|
2016-04-14 18:18:57 +00:00
|
|
|
static struct resource data_resource = {
|
|
|
|
.name = "Kernel data",
|
|
|
|
.start = 0,
|
|
|
|
.end = 0,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct resource code_resource = {
|
|
|
|
.name = "Kernel code",
|
|
|
|
.start = 0,
|
|
|
|
.end = 0,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct resource bss_resource = {
|
|
|
|
.name = "Kernel bss",
|
|
|
|
.start = 0,
|
|
|
|
.end = 0,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM
|
|
|
|
};
|
|
|
|
|
|
|
|
|
2008-06-26 00:50:06 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2019-11-18 15:03:39 +00:00
|
|
|
/* CPU data as detected by the assembly code in head_32.S */
|
2017-02-12 21:12:08 +00:00
|
|
|
struct cpuinfo_x86 new_cpu_data;
|
2008-03-27 20:55:04 +00:00
|
|
|
|
2008-06-26 00:50:06 +00:00
|
|
|
struct apm_info apm_info;
|
|
|
|
EXPORT_SYMBOL(apm_info);
|
|
|
|
|
|
|
|
#if defined(CONFIG_X86_SPEEDSTEP_SMI) || \
|
|
|
|
defined(CONFIG_X86_SPEEDSTEP_SMI_MODULE)
|
|
|
|
struct ist_info ist_info;
|
|
|
|
EXPORT_SYMBOL(ist_info);
|
|
|
|
#else
|
|
|
|
struct ist_info ist_info;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#endif
|
|
|
|
|
2022-06-01 12:29:14 +00:00
|
|
|
struct cpuinfo_x86 boot_cpu_data __read_mostly;
|
|
|
|
EXPORT_SYMBOL(boot_cpu_data);
|
2008-06-26 00:50:06 +00:00
|
|
|
|
|
|
|
#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
|
2016-08-08 23:29:06 +00:00
|
|
|
__visible unsigned long mmu_cr4_features __ro_after_init;
|
2008-06-26 00:50:06 +00:00
|
|
|
#else
|
2016-08-08 23:29:06 +00:00
|
|
|
__visible unsigned long mmu_cr4_features __ro_after_init = X86_CR4_PAE;
|
2008-06-26 00:50:06 +00:00
|
|
|
#endif
|
|
|
|
|
2022-06-30 08:36:12 +00:00
|
|
|
#ifdef CONFIG_IMA
|
|
|
|
static phys_addr_t ima_kexec_buffer_phys;
|
|
|
|
static size_t ima_kexec_buffer_size;
|
|
|
|
#endif
|
|
|
|
|
2009-05-07 23:54:11 +00:00
|
|
|
/* Boot loader ID and version as integers, for the benefit of proc_dointvec */
|
|
|
|
int bootloader_type, bootloader_version;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Setup options
|
|
|
|
*/
|
|
|
|
struct screen_info screen_info;
|
2005-06-23 07:08:33 +00:00
|
|
|
EXPORT_SYMBOL(screen_info);
|
2005-04-16 22:20:36 +00:00
|
|
|
struct edid_info edid_info;
|
2005-09-09 20:04:34 +00:00
|
|
|
EXPORT_SYMBOL_GPL(edid_info);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
extern int root_mountflags;
|
|
|
|
|
2008-04-10 21:28:10 +00:00
|
|
|
unsigned long saved_video_mode;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-01-30 12:32:51 +00:00
|
|
|
#define RAMDISK_IMAGE_START_MASK 0x07FF
|
2005-04-16 22:20:36 +00:00
|
|
|
#define RAMDISK_PROMPT_FLAG 0x8000
|
2008-01-30 12:32:51 +00:00
|
|
|
#define RAMDISK_LOAD_FLAG 0x4000
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-02-12 08:54:11 +00:00
|
|
|
static char __initdata command_line[COMMAND_LINE_SIZE];
|
2008-08-12 19:52:36 +00:00
|
|
|
#ifdef CONFIG_CMDLINE_BOOL
|
2024-07-30 14:15:12 +00:00
|
|
|
char builtin_cmdline[COMMAND_LINE_SIZE] = CONFIG_CMDLINE;
|
2024-04-08 17:46:03 +00:00
|
|
|
bool builtin_cmdline_added __ro_after_init;
|
2008-08-12 19:52:36 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
|
|
|
|
struct edd edd;
|
|
|
|
#ifdef CONFIG_EDD_MODULE
|
|
|
|
EXPORT_SYMBOL(edd);
|
|
|
|
#endif
|
|
|
|
/**
|
|
|
|
* copy_edd() - Copy the BIOS EDD information
|
|
|
|
* from boot_params into a safe place.
|
|
|
|
*
|
|
|
|
*/
|
2009-11-30 10:33:51 +00:00
|
|
|
static inline void __init copy_edd(void)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2007-10-16 00:13:22 +00:00
|
|
|
memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,
|
|
|
|
sizeof(edd.mbr_signature));
|
|
|
|
memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
|
|
|
|
edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;
|
|
|
|
edd.edd_info_nr = boot_params.eddbuf_entries;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
#else
|
2009-11-30 10:33:51 +00:00
|
|
|
static inline void __init copy_edd(void)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-03-15 00:19:51 +00:00
|
|
|
void * __init extend_brk(size_t size, size_t align)
|
|
|
|
{
|
|
|
|
size_t mask = align - 1;
|
|
|
|
void *ret;
|
|
|
|
|
|
|
|
BUG_ON(_brk_start == 0);
|
|
|
|
BUG_ON(align & mask);
|
|
|
|
|
|
|
|
_brk_end = (_brk_end + mask) & ~mask;
|
|
|
|
BUG_ON((char *)(_brk_end + size) > __brk_limit);
|
|
|
|
|
|
|
|
ret = (void *)_brk_end;
|
|
|
|
_brk_end += size;
|
|
|
|
|
|
|
|
memset(ret, 0, size);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2012-11-17 03:39:08 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2011-02-18 11:30:30 +00:00
|
|
|
static void __init cleanup_highmap(void)
|
2010-12-28 00:48:32 +00:00
|
|
|
{
|
|
|
|
}
|
2009-06-22 14:39:41 +00:00
|
|
|
#endif
|
|
|
|
|
2009-03-15 00:19:51 +00:00
|
|
|
static void __init reserve_brk(void)
|
|
|
|
{
|
|
|
|
if (_brk_end > _brk_start)
|
2012-11-16 21:57:13 +00:00
|
|
|
memblock_reserve(__pa_symbol(_brk_start),
|
|
|
|
_brk_end - _brk_start);
|
2009-03-15 00:19:51 +00:00
|
|
|
|
|
|
|
/* Mark brk area as locked down and no longer taking any
|
|
|
|
new allocations */
|
|
|
|
_brk_start = 0;
|
|
|
|
}
|
|
|
|
|
2008-01-30 12:32:51 +00:00
|
|
|
#ifdef CONFIG_BLK_DEV_INITRD
|
|
|
|
|
2013-01-24 20:19:56 +00:00
|
|
|
static u64 __init get_ramdisk_image(void)
|
|
|
|
{
|
|
|
|
u64 ramdisk_image = boot_params.hdr.ramdisk_image;
|
|
|
|
|
2013-01-29 04:16:44 +00:00
|
|
|
ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
|
|
|
|
|
x86/setup: Add an initrdmem= option to specify initrd physical address
Add the initrdmem option:
initrdmem=ss[KMG],nn[KMG]
which is used to specify the physical address of the initrd, almost
always an address in FLASH. Also add code for x86 to use the existing
phys_init_start and phys_init_size variables in the kernel.
This is useful in cases where a kernel and an initrd is placed in FLASH,
but there is no firmware file system structure in the FLASH.
One such situation occurs when unused FLASH space on UEFI systems has
been reclaimed by, e.g., taking it from the Management Engine. For
example, on many systems, the ME is given half the FLASH part; not only
is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
space can be used to contain an initrd, but need to tell Linux where it
is.
This space is "raw": due to, e.g., UEFI limitations: it can not be added
to UEFI firmware volumes without rebuilding UEFI from source or writing
a UEFI device driver. It can be referenced only as a physical address
and size.
At the same time, if a kernel can be "netbooted" or loaded from GRUB or
syslinux, the option of not using the physical address specification
should be available.
Then, it is easy to boot the kernel and provide an initrd; or boot the
the kernel and let it use the initrd in FLASH. In practice, this has
proven to be very helpful when integrating Linux into FLASH on x86.
Hence, the most flexible and convenient path is to enable the initrdmem
command line option in a way that it is the last choice tried.
For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
burnt in with a built-in command line which includes:
initrdmem=0xff968000,0x200000
which specifies a location and size.
[ bp: Massage commit message, make it passive. ]
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
2020-04-26 01:10:21 +00:00
|
|
|
if (ramdisk_image == 0)
|
|
|
|
ramdisk_image = phys_initrd_start;
|
|
|
|
|
2013-01-24 20:19:56 +00:00
|
|
|
return ramdisk_image;
|
|
|
|
}
|
|
|
|
static u64 __init get_ramdisk_size(void)
|
|
|
|
{
|
|
|
|
u64 ramdisk_size = boot_params.hdr.ramdisk_size;
|
|
|
|
|
2013-01-29 04:16:44 +00:00
|
|
|
ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
|
|
|
|
|
x86/setup: Add an initrdmem= option to specify initrd physical address
Add the initrdmem option:
initrdmem=ss[KMG],nn[KMG]
which is used to specify the physical address of the initrd, almost
always an address in FLASH. Also add code for x86 to use the existing
phys_init_start and phys_init_size variables in the kernel.
This is useful in cases where a kernel and an initrd is placed in FLASH,
but there is no firmware file system structure in the FLASH.
One such situation occurs when unused FLASH space on UEFI systems has
been reclaimed by, e.g., taking it from the Management Engine. For
example, on many systems, the ME is given half the FLASH part; not only
is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This
space can be used to contain an initrd, but need to tell Linux where it
is.
This space is "raw": due to, e.g., UEFI limitations: it can not be added
to UEFI firmware volumes without rebuilding UEFI from source or writing
a UEFI device driver. It can be referenced only as a physical address
and size.
At the same time, if a kernel can be "netbooted" or loaded from GRUB or
syslinux, the option of not using the physical address specification
should be available.
Then, it is easy to boot the kernel and provide an initrd; or boot the
the kernel and let it use the initrd in FLASH. In practice, this has
proven to be very helpful when integrating Linux into FLASH on x86.
Hence, the most flexible and convenient path is to enable the initrdmem
command line option in a way that it is the last choice tried.
For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be
burnt in with a built-in command line which includes:
initrdmem=0xff968000,0x200000
which specifies a location and size.
[ bp: Massage commit message, make it passive. ]
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Ronald G. Minnich <rminnich@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com
Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
2020-04-26 01:10:21 +00:00
|
|
|
if (ramdisk_size == 0)
|
|
|
|
ramdisk_size = phys_initrd_size;
|
|
|
|
|
2013-01-24 20:19:56 +00:00
|
|
|
return ramdisk_size;
|
|
|
|
}
|
|
|
|
|
2008-06-26 00:49:26 +00:00
|
|
|
static void __init relocate_initrd(void)
|
2008-01-30 12:32:51 +00:00
|
|
|
{
|
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 02:42:55 +00:00
|
|
|
/* Assume only end is not page aligned */
|
2013-01-24 20:19:56 +00:00
|
|
|
u64 ramdisk_image = get_ramdisk_image();
|
|
|
|
u64 ramdisk_size = get_ramdisk_size();
|
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 02:42:55 +00:00
|
|
|
u64 area_size = PAGE_ALIGN(ramdisk_size);
|
2008-01-30 12:32:51 +00:00
|
|
|
|
2012-11-17 03:38:51 +00:00
|
|
|
/* We need to move the initrd down into directly mapped mem */
|
2023-11-13 03:40:26 +00:00
|
|
|
u64 relocated_ramdisk = memblock_phys_alloc_range(area_size, PAGE_SIZE, 0,
|
2020-10-13 23:58:12 +00:00
|
|
|
PFN_PHYS(max_pfn_mapped));
|
2013-12-04 19:50:42 +00:00
|
|
|
if (!relocated_ramdisk)
|
2008-05-25 17:00:09 +00:00
|
|
|
panic("Cannot find place for new RAMDISK of size %lld\n",
|
2013-12-04 19:50:42 +00:00
|
|
|
ramdisk_size);
|
2008-05-25 17:00:09 +00:00
|
|
|
|
2013-12-04 19:50:42 +00:00
|
|
|
initrd_start = relocated_ramdisk + PAGE_OFFSET;
|
2008-01-30 12:32:51 +00:00
|
|
|
initrd_end = initrd_start + ramdisk_size;
|
2012-05-29 22:06:29 +00:00
|
|
|
printk(KERN_INFO "Allocated new RAMDISK: [mem %#010llx-%#010llx]\n",
|
2013-12-04 19:50:42 +00:00
|
|
|
relocated_ramdisk, relocated_ramdisk + ramdisk_size - 1);
|
2008-01-30 12:32:51 +00:00
|
|
|
|
2015-09-08 22:03:07 +00:00
|
|
|
copy_from_early_mem((void *)initrd_start, ramdisk_image, ramdisk_size);
|
|
|
|
|
2012-05-29 22:06:29 +00:00
|
|
|
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
|
|
|
|
" [mem %#010llx-%#010llx]\n",
|
2008-05-22 01:40:18 +00:00
|
|
|
ramdisk_image, ramdisk_image + ramdisk_size - 1,
|
2013-12-04 19:50:42 +00:00
|
|
|
relocated_ramdisk, relocated_ramdisk + ramdisk_size - 1);
|
2008-06-26 00:49:26 +00:00
|
|
|
}
|
2008-06-14 03:07:03 +00:00
|
|
|
|
2013-01-24 20:19:55 +00:00
|
|
|
static void __init early_reserve_initrd(void)
|
|
|
|
{
|
|
|
|
/* Assume only end is not page aligned */
|
2013-01-24 20:19:56 +00:00
|
|
|
u64 ramdisk_image = get_ramdisk_image();
|
|
|
|
u64 ramdisk_size = get_ramdisk_size();
|
2013-01-24 20:19:55 +00:00
|
|
|
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
|
|
|
|
|
|
|
|
if (!boot_params.hdr.type_of_loader ||
|
|
|
|
!ramdisk_image || !ramdisk_size)
|
|
|
|
return; /* No initrd provided by bootloader */
|
|
|
|
|
|
|
|
memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
|
|
|
|
}
|
2020-10-13 23:58:12 +00:00
|
|
|
|
2008-06-26 00:49:26 +00:00
|
|
|
static void __init reserve_initrd(void)
|
|
|
|
{
|
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 02:42:55 +00:00
|
|
|
/* Assume only end is not page aligned */
|
2013-01-24 20:19:56 +00:00
|
|
|
u64 ramdisk_image = get_ramdisk_image();
|
|
|
|
u64 ramdisk_size = get_ramdisk_size();
|
x86: Make sure free_init_pages() frees pages on page boundary
When CONFIG_NO_BOOTMEM=y, it could use memory more effiently, or
in a more compact fashion.
Example:
Allocated new RAMDISK: 00ec2000 - 0248ce57
Move RAMDISK from 000000002ea04000 - 000000002ffcee56 to 00ec2000 - 0248ce56
The new RAMDISK's end is not page aligned.
Last page could be shared with other users.
When free_init_pages are called for initrd or .init, the page
could be freed and we could corrupt other data.
code segment in free_init_pages():
| for (; addr < end; addr += PAGE_SIZE) {
| ClearPageReserved(virt_to_page(addr));
| init_page_count(virt_to_page(addr));
| memset((void *)(addr & ~(PAGE_SIZE-1)),
| POISON_FREE_INITMEM, PAGE_SIZE);
| free_page(addr);
| totalram_pages++;
| }
last half page could be used as one whole free page.
So page align the boundaries.
-v2: make the original initramdisk to be aligned, according to
Johannes, otherwise we have the chance to lose one page.
we still need to keep initrd_end not aligned, otherwise it could
confuse decompressor.
-v3: change to WARN_ON instead, suggested by Johannes.
-v4: use PAGE_ALIGN, suggested by Johannes.
We may fix that macro name later to PAGE_ALIGN_UP, and PAGE_ALIGN_DOWN
Add comments about assuming ramdisk start is aligned
in relocate_initrd(), change to re get ramdisk_image instead of save it
to make diff smaller. Add warning for wrong range, suggested by Johannes.
-v6: remove one WARN()
We need to align beginning in free_init_pages()
do not copy more than ramdisk_size, noticed by Johannes
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Tested-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-3-git-send-email-yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-29 02:42:55 +00:00
|
|
|
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
|
2008-06-26 00:49:26 +00:00
|
|
|
|
|
|
|
if (!boot_params.hdr.type_of_loader ||
|
|
|
|
!ramdisk_image || !ramdisk_size)
|
|
|
|
return; /* No initrd provided by bootloader */
|
|
|
|
|
|
|
|
initrd_start = 0;
|
|
|
|
|
2012-05-29 22:06:29 +00:00
|
|
|
printk(KERN_INFO "RAMDISK: [mem %#010llx-%#010llx]\n", ramdisk_image,
|
|
|
|
ramdisk_end - 1);
|
2008-06-26 00:49:26 +00:00
|
|
|
|
2012-11-17 03:38:53 +00:00
|
|
|
if (pfn_range_is_mapped(PFN_DOWN(ramdisk_image),
|
2012-11-17 03:38:51 +00:00
|
|
|
PFN_DOWN(ramdisk_end))) {
|
|
|
|
/* All are mapped, easy case */
|
2008-06-26 00:49:26 +00:00
|
|
|
initrd_start = ramdisk_image + PAGE_OFFSET;
|
|
|
|
initrd_end = initrd_start + ramdisk_size;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
relocate_initrd();
|
2009-06-05 02:14:22 +00:00
|
|
|
|
2021-11-05 20:43:19 +00:00
|
|
|
memblock_phys_free(ramdisk_image, ramdisk_end - ramdisk_image);
|
2008-01-30 12:32:51 +00:00
|
|
|
}
|
2016-04-11 02:13:27 +00:00
|
|
|
|
2008-06-22 09:46:58 +00:00
|
|
|
#else
|
2013-01-24 20:19:55 +00:00
|
|
|
static void __init early_reserve_initrd(void)
|
|
|
|
{
|
|
|
|
}
|
2008-06-26 00:49:26 +00:00
|
|
|
static void __init reserve_initrd(void)
|
2008-06-22 09:46:58 +00:00
|
|
|
{
|
|
|
|
}
|
2008-01-30 12:32:51 +00:00
|
|
|
#endif /* CONFIG_BLK_DEV_INITRD */
|
|
|
|
|
2022-06-30 08:36:12 +00:00
|
|
|
static void __init add_early_ima_buffer(u64 phys_addr)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_IMA
|
|
|
|
struct ima_setup_data *data;
|
|
|
|
|
|
|
|
data = early_memremap(phys_addr + sizeof(struct setup_data), sizeof(*data));
|
|
|
|
if (!data) {
|
|
|
|
pr_warn("setup: failed to memremap ima_setup_data entry\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (data->size) {
|
|
|
|
memblock_reserve(data->addr, data->size);
|
|
|
|
ima_kexec_buffer_phys = data->addr;
|
|
|
|
ima_kexec_buffer_size = data->size;
|
|
|
|
}
|
|
|
|
|
|
|
|
early_memunmap(data, sizeof(*data));
|
|
|
|
#else
|
|
|
|
pr_warn("Passed IMA kexec data, but CONFIG_IMA not set. Ignoring.\n");
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
#if defined(CONFIG_HAVE_IMA_KEXEC) && !defined(CONFIG_OF_FLATTREE)
|
|
|
|
int __init ima_free_kexec_buffer(void)
|
|
|
|
{
|
|
|
|
if (!ima_kexec_buffer_size)
|
|
|
|
return -ENOENT;
|
|
|
|
|
2023-08-17 17:55:58 +00:00
|
|
|
memblock_free_late(ima_kexec_buffer_phys,
|
|
|
|
ima_kexec_buffer_size);
|
2022-06-30 08:36:12 +00:00
|
|
|
|
|
|
|
ima_kexec_buffer_phys = 0;
|
|
|
|
ima_kexec_buffer_size = 0;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int __init ima_get_kexec_buffer(void **addr, size_t *size)
|
|
|
|
{
|
|
|
|
if (!ima_kexec_buffer_size)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
*addr = __va(ima_kexec_buffer_phys);
|
|
|
|
*size = ima_kexec_buffer_size;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-06-26 01:00:22 +00:00
|
|
|
static void __init parse_setup_data(void)
|
2008-06-26 00:56:22 +00:00
|
|
|
{
|
|
|
|
struct setup_data *data;
|
2013-08-13 21:46:41 +00:00
|
|
|
u64 pa_data, pa_next;
|
2008-06-26 00:56:22 +00:00
|
|
|
|
|
|
|
pa_data = boot_params.hdr.setup_data;
|
|
|
|
while (pa_data) {
|
2015-01-07 10:55:48 +00:00
|
|
|
u32 data_len, data_type;
|
2011-02-22 20:07:36 +00:00
|
|
|
|
2015-01-07 10:55:48 +00:00
|
|
|
data = early_memremap(pa_data, sizeof(*data));
|
2011-02-22 20:07:36 +00:00
|
|
|
data_len = data->len + sizeof(struct setup_data);
|
2013-08-13 21:46:41 +00:00
|
|
|
data_type = data->type;
|
|
|
|
pa_next = data->next;
|
2015-02-24 09:13:28 +00:00
|
|
|
early_memunmap(data, sizeof(*data));
|
2011-02-22 20:07:36 +00:00
|
|
|
|
2013-08-13 21:46:41 +00:00
|
|
|
switch (data_type) {
|
2008-06-26 00:56:22 +00:00
|
|
|
case SETUP_E820_EXT:
|
2017-01-28 12:18:40 +00:00
|
|
|
e820__memory_setup_extended(pa_data, data_len);
|
2008-06-26 00:56:22 +00:00
|
|
|
break;
|
2011-02-22 20:07:37 +00:00
|
|
|
case SETUP_DTB:
|
|
|
|
add_dtb(pa_data);
|
2008-06-26 00:56:22 +00:00
|
|
|
break;
|
2013-12-20 10:02:19 +00:00
|
|
|
case SETUP_EFI:
|
|
|
|
parse_efi_setup(pa_data, data_len);
|
|
|
|
break;
|
2022-06-30 08:36:12 +00:00
|
|
|
case SETUP_IMA:
|
|
|
|
add_early_ima_buffer(pa_data);
|
|
|
|
break;
|
x86/setup: Use rng seeds from setup_data
Currently, the only way x86 can get an early boot RNG seed is via EFI,
which is generally always used now for physical machines, but is very
rarely used in VMs, especially VMs that are optimized for starting
"instantaneously", such as Firecracker's MicroVM. For tiny fast booting
VMs, EFI is not something you generally need or want.
Rather, the image loader or firmware should be able to pass a single
random seed, exactly as device tree platforms do with the "rng-seed"
property. Additionally, this is something that bootloaders can append,
with their own seed file management, which is something every other
major OS ecosystem has that Linux does not (yet).
Add SETUP_RNG_SEED, similar to the other eight setup_data entries that
are parsed at boot. It also takes care to zero out the seed immediately
after using, in order to retain forward secrecy. This all takes about 7
trivial lines of code.
Then, on kexec_file_load(), a new fresh seed is generated and passed to
the next kernel, just as is done on device tree architectures when
using kexec. And, importantly, I've tested that QEMU is able to properly
pass SETUP_RNG_SEED as well, making this work for every step of the way.
This code too is pretty straight forward.
Together these measures ensure that VMs and nested kexec()'d kernels
always receive a proper boot time RNG seed at the earliest possible
stage from their parents:
- Host [already has strongly initialized RNG]
- QEMU [passes fresh seed in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- Linux [uses parent's seed and gathers entropy of its own]
- kexec [passes this in SETUP_RNG_SEED field]
- ...
I've verified in several scenarios that this works quite well from a
host kernel to QEMU and down inwards, mixing and matching loaders, with
every layer providing a seed to the next.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20220630113300.1892799-1-Jason@zx2c4.com
2022-07-10 17:29:21 +00:00
|
|
|
case SETUP_RNG_SEED:
|
|
|
|
data = early_memremap(pa_data, data_len);
|
|
|
|
add_bootloader_randomness(data->data, data->len);
|
|
|
|
/* Zero seed for forward secrecy. */
|
|
|
|
memzero_explicit(data->data, data->len);
|
|
|
|
/* Zero length in case we find ourselves back here by accident. */
|
|
|
|
memzero_explicit(&data->len, sizeof(data->len));
|
|
|
|
early_memunmap(data, data_len);
|
|
|
|
break;
|
2008-06-26 00:56:22 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2013-08-13 21:46:41 +00:00
|
|
|
pa_data = pa_next;
|
2008-06-26 00:56:22 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-08-25 20:39:17 +00:00
|
|
|
static void __init memblock_x86_reserve_range_setup_data(void)
|
2008-07-03 18:37:13 +00:00
|
|
|
{
|
2022-02-24 02:07:35 +00:00
|
|
|
struct setup_indirect *indirect;
|
2008-07-03 18:37:13 +00:00
|
|
|
struct setup_data *data;
|
2022-02-24 02:07:35 +00:00
|
|
|
u64 pa_data, pa_next;
|
|
|
|
u32 len;
|
2008-07-03 18:37:13 +00:00
|
|
|
|
|
|
|
pa_data = boot_params.hdr.setup_data;
|
|
|
|
while (pa_data) {
|
2008-09-07 22:21:16 +00:00
|
|
|
data = early_memremap(pa_data, sizeof(*data));
|
2022-02-24 02:07:35 +00:00
|
|
|
if (!data) {
|
|
|
|
pr_warn("setup: failed to memremap setup_data entry\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
len = sizeof(*data);
|
|
|
|
pa_next = data->next;
|
|
|
|
|
2011-07-12 09:16:06 +00:00
|
|
|
memblock_reserve(pa_data, sizeof(*data) + data->len);
|
2019-11-12 13:46:40 +00:00
|
|
|
|
2022-02-24 02:07:35 +00:00
|
|
|
if (data->type == SETUP_INDIRECT) {
|
|
|
|
len += data->len;
|
|
|
|
early_memunmap(data, sizeof(*data));
|
|
|
|
data = early_memremap(pa_data, len);
|
|
|
|
if (!data) {
|
|
|
|
pr_warn("setup: failed to memremap indirect setup_data\n");
|
|
|
|
return;
|
|
|
|
}
|
2019-11-12 13:46:40 +00:00
|
|
|
|
2022-02-24 02:07:35 +00:00
|
|
|
indirect = (struct setup_indirect *)data->data;
|
|
|
|
|
|
|
|
if (indirect->type != SETUP_INDIRECT)
|
|
|
|
memblock_reserve(indirect->addr, indirect->len);
|
|
|
|
}
|
|
|
|
|
|
|
|
pa_data = pa_next;
|
|
|
|
early_memunmap(data, len);
|
2008-07-03 18:37:13 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-09-14 03:31:39 +00:00
|
|
|
static void __init arch_reserve_crashkernel(void)
|
2013-01-24 20:20:11 +00:00
|
|
|
{
|
2023-09-14 03:31:39 +00:00
|
|
|
unsigned long long crash_base, crash_size, low_size = 0;
|
|
|
|
char *cmdline = boot_command_line;
|
2013-04-16 05:23:47 +00:00
|
|
|
bool high = false;
|
2008-06-26 00:57:13 +00:00
|
|
|
int ret;
|
|
|
|
|
2024-01-24 05:12:46 +00:00
|
|
|
if (!IS_ENABLED(CONFIG_CRASH_RESERVE))
|
2022-03-23 23:06:39 +00:00
|
|
|
return;
|
|
|
|
|
2023-09-14 03:31:39 +00:00
|
|
|
ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
|
|
|
|
&crash_size, &crash_base,
|
|
|
|
&low_size, &high);
|
|
|
|
if (ret)
|
|
|
|
return;
|
2008-06-26 19:54:08 +00:00
|
|
|
|
2018-04-25 10:08:35 +00:00
|
|
|
if (xen_pv_domain()) {
|
|
|
|
pr_info("Ignoring crashkernel for a Xen PV domain\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-09-14 03:31:39 +00:00
|
|
|
reserve_crashkernel_generic(cmdline, crash_size, crash_base,
|
|
|
|
low_size, high);
|
2008-06-26 00:57:13 +00:00
|
|
|
}
|
|
|
|
|
2008-06-26 00:58:02 +00:00
|
|
|
static struct resource standard_io_resources[] = {
|
|
|
|
{ .name = "dma1", .start = 0x00, .end = 0x1f,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "pic1", .start = 0x20, .end = 0x21,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "timer0", .start = 0x40, .end = 0x43,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "timer1", .start = 0x50, .end = 0x53,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "keyboard", .start = 0x60, .end = 0x60,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "keyboard", .start = 0x64, .end = 0x64,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "dma page reg", .start = 0x80, .end = 0x8f,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "pic2", .start = 0xa0, .end = 0xa1,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "dma2", .start = 0xc0, .end = 0xdf,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO },
|
|
|
|
{ .name = "fpu", .start = 0xf0, .end = 0xff,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_IO }
|
|
|
|
};
|
|
|
|
|
2009-08-19 12:55:50 +00:00
|
|
|
void __init reserve_standard_io_resources(void)
|
2008-06-26 00:58:02 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* request I/O space for devices used on all i[345]86 PCs */
|
|
|
|
for (i = 0; i < ARRAY_SIZE(standard_io_resources); i++)
|
|
|
|
request_resource(&ioport_resource, &standard_io_resources[i]);
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2012-11-14 20:43:31 +00:00
|
|
|
static bool __init snb_gfx_workaround_needed(void)
|
|
|
|
{
|
2013-01-14 04:56:41 +00:00
|
|
|
#ifdef CONFIG_PCI
|
2012-11-14 20:43:31 +00:00
|
|
|
int i;
|
|
|
|
u16 vendor, devid;
|
2013-01-14 04:36:39 +00:00
|
|
|
static const __initconst u16 snb_ids[] = {
|
2012-11-14 20:43:31 +00:00
|
|
|
0x0102,
|
|
|
|
0x0112,
|
|
|
|
0x0122,
|
|
|
|
0x0106,
|
|
|
|
0x0116,
|
|
|
|
0x0126,
|
|
|
|
0x010a,
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Assume no if something weird is going on with PCI */
|
|
|
|
if (!early_pci_allowed())
|
|
|
|
return false;
|
|
|
|
|
|
|
|
vendor = read_pci_config_16(0, 2, 0, PCI_VENDOR_ID);
|
|
|
|
if (vendor != 0x8086)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
devid = read_pci_config_16(0, 2, 0, PCI_DEVICE_ID);
|
|
|
|
for (i = 0; i < ARRAY_SIZE(snb_ids); i++)
|
|
|
|
if (devid == snb_ids[i])
|
|
|
|
return true;
|
2013-01-14 04:56:41 +00:00
|
|
|
#endif
|
2012-11-14 20:43:31 +00:00
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sandy Bridge graphics has trouble with certain ranges, exclude
|
|
|
|
* them from allocation.
|
|
|
|
*/
|
|
|
|
static void __init trim_snb_memory(void)
|
|
|
|
{
|
2013-01-14 04:36:39 +00:00
|
|
|
static const __initconst unsigned long bad_pages[] = {
|
2012-11-14 20:43:31 +00:00
|
|
|
0x20050000,
|
|
|
|
0x20110000,
|
|
|
|
0x20130000,
|
|
|
|
0x20138000,
|
|
|
|
0x40004000,
|
|
|
|
};
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!snb_gfx_workaround_needed())
|
|
|
|
return;
|
|
|
|
|
|
|
|
printk(KERN_DEBUG "reserving inaccessible SNB gfx pages\n");
|
|
|
|
|
|
|
|
/*
|
2021-04-13 18:08:39 +00:00
|
|
|
* SandyBridge integrated graphics devices have a bug that prevents
|
|
|
|
* them from accessing certain memory ranges, namely anything below
|
|
|
|
* 1M and in the pages listed in bad_pages[] above.
|
|
|
|
*
|
2021-06-01 07:53:52 +00:00
|
|
|
* To avoid these pages being ever accessed by SNB gfx devices reserve
|
|
|
|
* bad_pages that have not already been reserved at boot time.
|
|
|
|
* All memory below the 1 MB mark is anyway reserved later during
|
|
|
|
* setup_arch(), so there is no need to reserve it here.
|
2012-11-14 20:43:31 +00:00
|
|
|
*/
|
2021-04-13 18:08:39 +00:00
|
|
|
|
2012-11-14 20:43:31 +00:00
|
|
|
for (i = 0; i < ARRAY_SIZE(bad_pages); i++) {
|
|
|
|
if (memblock_reserve(bad_pages[i], PAGE_SIZE))
|
|
|
|
printk(KERN_WARNING "failed to reserve 0x%08lx\n",
|
|
|
|
bad_pages[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-01-22 03:21:04 +00:00
|
|
|
static void __init trim_bios_range(void)
|
|
|
|
{
|
2021-02-04 18:12:37 +00:00
|
|
|
/*
|
|
|
|
* A special case is the first 4Kb of memory;
|
|
|
|
* This is a BIOS owned area, not kernel ram, but generally
|
|
|
|
* not listed as such in the E820 table.
|
|
|
|
*
|
|
|
|
* This typically reserves additional memory (64KiB by default)
|
|
|
|
* since some BIOSes are known to corrupt low memory. See the
|
|
|
|
* Kconfig help text for X86_RESERVE_LOW.
|
|
|
|
*/
|
|
|
|
e820__range_update(0, PAGE_SIZE, E820_TYPE_RAM, E820_TYPE_RESERVED);
|
|
|
|
|
2010-01-22 03:21:04 +00:00
|
|
|
/*
|
2019-11-18 07:00:12 +00:00
|
|
|
* special case: Some BIOSes report the PC BIOS
|
|
|
|
* area (640Kb -> 1Mb) as RAM even though it is not.
|
2010-01-22 03:21:04 +00:00
|
|
|
* take them out.
|
|
|
|
*/
|
2017-01-28 16:09:33 +00:00
|
|
|
e820__range_remove(BIOS_BEGIN, BIOS_END - BIOS_BEGIN, E820_TYPE_RAM, 1);
|
2012-11-14 20:43:31 +00:00
|
|
|
|
x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);
But 90% of the usage is trivial:
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
arch/x86/xen/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.
Only one use is non-trivial:
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).
Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:
int e820__update_table(struct e820_table *table);
This visibly simplifies all the call sites:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_table *table);
arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates. The allowance
arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
arch/x86/kernel/e820.c: if (e820__update_table(e820_table))
arch/x86/kernel/e820.c: e820__update_table(e820_table_firmware);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table) < 0)
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
arch/x86/xen/setup.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 17:00:35 +00:00
|
|
|
e820__update_table(e820_table);
|
2010-01-22 03:21:04 +00:00
|
|
|
}
|
|
|
|
|
2013-01-24 20:19:45 +00:00
|
|
|
/* called before trim_bios_range() to spare extra sanitize */
|
|
|
|
static void __init e820_add_kernel_range(void)
|
|
|
|
{
|
|
|
|
u64 start = __pa_symbol(_text);
|
|
|
|
u64 size = __pa_symbol(_end) - start;
|
|
|
|
|
|
|
|
/*
|
2017-01-28 16:09:33 +00:00
|
|
|
* Complain if .text .data and .bss are not marked as E820_TYPE_RAM and
|
2013-01-24 20:19:45 +00:00
|
|
|
* attempt to fix it by adding the range. We may have a confused BIOS,
|
|
|
|
* or the user may have used memmap=exactmap or memmap=xxM$yyM to
|
|
|
|
* exclude kernel range. If we really are running on top non-RAM,
|
|
|
|
* we will crash later anyways.
|
|
|
|
*/
|
2017-01-28 16:09:33 +00:00
|
|
|
if (e820__mapped_all(start, start + size, E820_TYPE_RAM))
|
2013-01-24 20:19:45 +00:00
|
|
|
return;
|
|
|
|
|
2017-01-28 16:09:33 +00:00
|
|
|
pr_warn(".text .data .bss are not marked as E820_TYPE_RAM!\n");
|
|
|
|
e820__range_remove(start, size, E820_TYPE_RAM, 0);
|
|
|
|
e820__range_add(start, size, E820_TYPE_RAM);
|
2013-01-24 20:19:45 +00:00
|
|
|
}
|
|
|
|
|
2021-03-02 10:04:05 +00:00
|
|
|
static void __init early_reserve_memory(void)
|
2013-02-14 22:02:52 +00:00
|
|
|
{
|
2021-03-02 10:04:05 +00:00
|
|
|
/*
|
|
|
|
* Reserve the memory occupied by the kernel between _text and
|
|
|
|
* __end_of_kernel_reserve symbols. Any kernel sections after the
|
|
|
|
* __end_of_kernel_reserve symbol must be explicitly reserved with a
|
|
|
|
* separate memblock_reserve() or they will be discarded.
|
|
|
|
*/
|
|
|
|
memblock_reserve(__pa_symbol(_text),
|
|
|
|
(unsigned long)__end_of_kernel_reserve - (unsigned long)_text);
|
|
|
|
|
|
|
|
/*
|
2021-03-02 10:04:06 +00:00
|
|
|
* The first 4Kb of memory is a BIOS owned area, but generally it is
|
|
|
|
* not listed as such in the E820 table.
|
|
|
|
*
|
2021-06-01 07:53:52 +00:00
|
|
|
* Reserve the first 64K of memory since some BIOSes are known to
|
|
|
|
* corrupt low memory. After the real mode trampoline is allocated the
|
|
|
|
* rest of the memory below 640k is reserved.
|
2021-03-02 10:04:06 +00:00
|
|
|
*
|
|
|
|
* In addition, make sure page 0 is always reserved because on
|
|
|
|
* systems with L1TF its contents can be leaked to user processes.
|
2021-03-02 10:04:05 +00:00
|
|
|
*/
|
2021-06-01 07:53:52 +00:00
|
|
|
memblock_reserve(0, SZ_64K);
|
2021-03-02 10:04:05 +00:00
|
|
|
|
|
|
|
early_reserve_initrd();
|
|
|
|
|
|
|
|
memblock_x86_reserve_range_setup_data();
|
|
|
|
|
|
|
|
reserve_bios_regions();
|
2021-06-01 07:53:52 +00:00
|
|
|
trim_snb_memory();
|
2013-02-14 22:02:52 +00:00
|
|
|
}
|
2021-03-02 10:04:05 +00:00
|
|
|
|
2013-10-11 00:18:17 +00:00
|
|
|
/*
|
|
|
|
* Dump out kernel offset information on panic.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
dump_kernel_offset(struct notifier_block *self, unsigned long v, void *p)
|
|
|
|
{
|
2015-04-01 10:49:52 +00:00
|
|
|
if (kaslr_enabled()) {
|
|
|
|
pr_emerg("Kernel Offset: 0x%lx from 0x%lx (relocation range: 0x%lx-0x%lx)\n",
|
2015-04-27 11:17:19 +00:00
|
|
|
kaslr_offset(),
|
2015-04-01 10:49:52 +00:00
|
|
|
__START_KERNEL,
|
|
|
|
__START_KERNEL_map,
|
|
|
|
MODULES_VADDR-1);
|
|
|
|
} else {
|
|
|
|
pr_emerg("Kernel Offset: disabled\n");
|
|
|
|
}
|
2013-10-11 00:18:17 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-01-27 11:56:25 +00:00
|
|
|
void x86_configure_nx(void)
|
|
|
|
{
|
|
|
|
if (boot_cpu_has(X86_FEATURE_NX))
|
|
|
|
__supported_pte_mask |= _PAGE_NX;
|
|
|
|
else
|
|
|
|
__supported_pte_mask &= ~_PAGE_NX;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __init x86_report_nx(void)
|
|
|
|
{
|
|
|
|
if (!boot_cpu_has(X86_FEATURE_NX)) {
|
|
|
|
printk(KERN_NOTICE "Notice: NX (Execute Disable) protection "
|
|
|
|
"missing in CPU!\n");
|
|
|
|
} else {
|
|
|
|
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
|
|
|
|
printk(KERN_INFO "NX (Execute Disable) protection: active\n");
|
|
|
|
#else
|
|
|
|
/* 32bit non-PAE kernel, NX cannot be used */
|
|
|
|
printk(KERN_NOTICE "Notice: NX (Execute Disable) protection "
|
|
|
|
"cannot be enabled: non-PAE kernel!\n");
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Determine if we were loaded by an EFI loader. If so, then we have also been
|
|
|
|
* passed the efi memmap, systab, etc., so we should use these data structures
|
|
|
|
* for initialization. Note, the efi init code path is determined by the
|
|
|
|
* global efi_enabled. This allows the same kernel image to be used on existing
|
|
|
|
* systems (with a traditional BIOS) as well as on EFI systems.
|
|
|
|
*/
|
2008-06-26 00:52:35 +00:00
|
|
|
/*
|
|
|
|
* setup_arch - architecture-specific boot-time initializations
|
|
|
|
*
|
|
|
|
* Note: On x86_64, fixmaps are ready for use even before this is called.
|
|
|
|
*/
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
void __init setup_arch(char **cmdline_p)
|
|
|
|
{
|
2008-06-26 00:52:35 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2005-04-16 22:20:36 +00:00
|
|
|
memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
|
2010-08-28 13:58:33 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* copy kernel address range established so far and switch
|
|
|
|
* to the proper swapper page table
|
|
|
|
*/
|
|
|
|
clone_pgd_range(swapper_pg_dir + KERNEL_PGD_BOUNDARY,
|
|
|
|
initial_page_table + KERNEL_PGD_BOUNDARY,
|
|
|
|
KERNEL_PGD_PTRS);
|
|
|
|
|
|
|
|
load_cr3(swapper_pg_dir);
|
2014-10-07 00:19:48 +00:00
|
|
|
/*
|
|
|
|
* Note: Quark X1000 CPUs advertise PGE incorrectly and require
|
|
|
|
* a cr3 based tlb flush, so the following __flush_tlb_all()
|
2019-11-18 15:03:39 +00:00
|
|
|
* will not flush anything because the CPU quirk which clears
|
2014-10-07 00:19:48 +00:00
|
|
|
* X86_FEATURE_PGE has not been invoked yet. Though due to the
|
|
|
|
* load_cr3() above the TLB has been flushed already. The
|
|
|
|
* quirk is invoked before subsequent calls to __flush_tlb_all()
|
|
|
|
* so proper operation is guaranteed.
|
|
|
|
*/
|
2010-08-28 13:58:33 +00:00
|
|
|
__flush_tlb_all();
|
2008-06-26 00:52:35 +00:00
|
|
|
#else
|
|
|
|
printk(KERN_INFO "Command line: %s\n", boot_command_line);
|
2018-02-14 11:16:54 +00:00
|
|
|
boot_cpu_data.x86_phys_bits = MAX_PHYSMEM_BITS;
|
2008-06-26 00:52:35 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2024-03-28 15:42:12 +00:00
|
|
|
#ifdef CONFIG_CMDLINE_BOOL
|
|
|
|
#ifdef CONFIG_CMDLINE_OVERRIDE
|
|
|
|
strscpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
|
|
|
|
#else
|
|
|
|
if (builtin_cmdline[0]) {
|
|
|
|
/* append boot loader cmdline to builtin */
|
|
|
|
strlcat(builtin_cmdline, " ", COMMAND_LINE_SIZE);
|
|
|
|
strlcat(builtin_cmdline, boot_command_line, COMMAND_LINE_SIZE);
|
|
|
|
strscpy(boot_command_line, builtin_cmdline, COMMAND_LINE_SIZE);
|
|
|
|
}
|
|
|
|
#endif
|
2024-04-08 17:46:03 +00:00
|
|
|
builtin_cmdline_added = true;
|
2024-03-28 15:42:12 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
|
|
|
|
*cmdline_p = command_line;
|
|
|
|
|
2010-08-23 21:49:11 +00:00
|
|
|
/*
|
|
|
|
* If we have OLPC OFW, we might end up relocating the fixmap due to
|
|
|
|
* reserve_top(), so do this before touching the ioremap area.
|
|
|
|
*/
|
2010-06-18 21:46:53 +00:00
|
|
|
olpc_ofw_detect();
|
|
|
|
|
2017-08-28 06:47:50 +00:00
|
|
|
idt_setup_early_traps();
|
2008-07-21 23:49:54 +00:00
|
|
|
early_cpu_init();
|
2018-07-19 20:55:28 +00:00
|
|
|
jump_label_init();
|
2020-08-18 13:57:51 +00:00
|
|
|
static_call_init();
|
2008-06-30 03:02:44 +00:00
|
|
|
early_ioremap_init();
|
|
|
|
|
2010-06-18 21:46:53 +00:00
|
|
|
setup_olpc_ofw_pgd();
|
|
|
|
|
2007-10-16 00:13:22 +00:00
|
|
|
ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
|
|
|
|
screen_info = boot_params.screen_info;
|
|
|
|
edid_info = boot_params.edid_info;
|
2008-06-26 00:52:35 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2007-10-16 00:13:22 +00:00
|
|
|
apm_info.bios = boot_params.apm_bios_info;
|
|
|
|
ist_info = boot_params.ist_info;
|
2008-06-26 00:52:35 +00:00
|
|
|
#endif
|
|
|
|
saved_video_mode = boot_params.hdr.vid_mode;
|
2007-10-16 00:13:22 +00:00
|
|
|
bootloader_type = boot_params.hdr.type_of_loader;
|
2009-05-07 23:54:11 +00:00
|
|
|
if ((bootloader_type >> 4) == 0xe) {
|
|
|
|
bootloader_type &= 0xf;
|
|
|
|
bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;
|
|
|
|
}
|
|
|
|
bootloader_version = bootloader_type & 0xf;
|
|
|
|
bootloader_version |= boot_params.hdr.ext_loader_ver << 4;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_BLK_DEV_RAM
|
2007-10-16 00:13:22 +00:00
|
|
|
rd_image_start = boot_params.hdr.ram_size & RAMDISK_IMAGE_START_MASK;
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
2008-06-24 02:53:33 +00:00
|
|
|
#ifdef CONFIG_EFI
|
|
|
|
if (!strncmp((char *)&boot_params.efi_info.efi_loader_signature,
|
2014-06-30 17:53:03 +00:00
|
|
|
EFI32_LOADER_SIGNATURE, 4)) {
|
2014-01-15 13:21:22 +00:00
|
|
|
set_bit(EFI_BOOT, &efi.flags);
|
2012-02-12 21:24:29 +00:00
|
|
|
} else if (!strncmp((char *)&boot_params.efi_info.efi_loader_signature,
|
2014-06-30 17:53:03 +00:00
|
|
|
EFI64_LOADER_SIGNATURE, 4)) {
|
2014-01-15 13:21:22 +00:00
|
|
|
set_bit(EFI_BOOT, &efi.flags);
|
|
|
|
set_bit(EFI_64BIT, &efi.flags);
|
2008-06-24 02:53:33 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2009-08-20 11:04:10 +00:00
|
|
|
x86_init.oem.arch_setup();
|
2008-01-30 12:31:19 +00:00
|
|
|
|
2021-09-20 12:04:21 +00:00
|
|
|
/*
|
|
|
|
* Do some memory reservations *before* memory is added to memblock, so
|
|
|
|
* memblock allocations won't overwrite it.
|
|
|
|
*
|
|
|
|
* After this point, everything still needed from the boot loader or
|
|
|
|
* firmware or kernel text should be early reserved or marked not RAM in
|
|
|
|
* e820. All other memory is free game.
|
|
|
|
*
|
|
|
|
* This call needs to happen before e820__memory_setup() which calls the
|
|
|
|
* xen_memory_setup() on Xen dom0 which relies on the fact that those
|
|
|
|
* early reservations have happened already.
|
|
|
|
*/
|
|
|
|
early_reserve_memory();
|
|
|
|
|
2010-10-26 21:41:49 +00:00
|
|
|
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
|
2017-01-28 08:58:49 +00:00
|
|
|
e820__memory_setup();
|
2008-06-30 23:20:54 +00:00
|
|
|
parse_setup_data();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
copy_edd();
|
|
|
|
|
2007-10-16 00:13:22 +00:00
|
|
|
if (!boot_params.hdr.root_flags)
|
2005-04-16 22:20:36 +00:00
|
|
|
root_mountflags &= ~MS_RDONLY;
|
2021-07-08 01:09:03 +00:00
|
|
|
setup_initial_init_mm(_text, _etext, _edata, (void *)_brk_end);
|
x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set. If there is one patch to
review in the entire series, this is the one. There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner. (small FAQ below).
Long Description:
This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers. The prctl() is an
explicit signal from userspace.
PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.
PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.
PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address. Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.
Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive. Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.
==== Why does the hardware even have these Bounds Tables? ====
MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".
They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.
The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.
==== Why not do this in userspace? ====
This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.
Q: Can virtual space simply be reserved for the bounds tables so
that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
area needs 4*X GB of virtual space, plus 2GB for the bounds
directory. If we were to preallocate them for the 128TB of
user virtual address space, we would need to reserve 512TB+2GB,
which is larger than the entire virtual address space today.
This means they can not be reserved ahead of time. Also, a
single process's pre-popualated bounds directory consumes 2GB
of virtual *AND* physical memory. IOW, it's completely
infeasible to prepopulate bounds directories.
Q: Can we preallocate bounds table space at the same time memory
is allocated which might contain pointers that might eventually
need bounds tables?
A: This would work if we could hook the site of each and every
memory allocation syscall. This can be done for small,
constrained applications. But, it isn't practical at a larger
scale since a given app has no way of controlling how all the
parts of the app might allocate memory (think libraries). The
kernel is really the only place to intercept these calls.
Q: Could a bounds fault be handed to userspace and the tables
allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
handler functions and even if mmap() would work it still
requires locking or nasty tricks to keep track of the
allocation state there.
Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-11-14 15:18:29 +00:00
|
|
|
|
2016-04-14 18:18:57 +00:00
|
|
|
code_resource.start = __pa_symbol(_text);
|
|
|
|
code_resource.end = __pa_symbol(_etext)-1;
|
2019-10-29 21:13:50 +00:00
|
|
|
rodata_resource.start = __pa_symbol(__start_rodata);
|
|
|
|
rodata_resource.end = __pa_symbol(__end_rodata)-1;
|
|
|
|
data_resource.start = __pa_symbol(_sdata);
|
2016-04-14 18:18:57 +00:00
|
|
|
data_resource.end = __pa_symbol(_edata)-1;
|
|
|
|
bss_resource.start = __pa_symbol(__bss_start);
|
|
|
|
bss_resource.end = __pa_symbol(__bss_stop)-1;
|
|
|
|
|
2021-12-13 11:27:56 +00:00
|
|
|
/*
|
|
|
|
* x86_configure_nx() is called before parse_early_param() to detect
|
|
|
|
* whether hardware doesn't support NX (so that the early EHCI debug
|
2022-01-27 11:56:25 +00:00
|
|
|
* console setup can safely call set_fixmap()).
|
2021-12-13 11:27:56 +00:00
|
|
|
*/
|
|
|
|
x86_configure_nx();
|
|
|
|
|
|
|
|
parse_early_param();
|
|
|
|
|
2021-12-13 11:27:57 +00:00
|
|
|
if (efi_enabled(EFI_BOOT))
|
|
|
|
efi_memblock_x86_reserve_range();
|
|
|
|
|
mm: remove x86-only restriction of movable_node
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot
option"), the memblock allocation direction is changed to bottom-up and
then back to top-down like this:
1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
2. memblock_set_bottom_up(false), called by x86's numa_init().
Even though (1) occurs in generic mm code, it is wrapped by #ifdef
CONFIG_MOVABLE_NODE, which depends on X86_64.
This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
things will be unbalanced. (1) will happen for them, but (2) will not.
This toggle was added in the first place because x86 has a delay between
adding memblocks and marking them as hotpluggable. Since other arches
do this marking either immediately or not at all, they do not require
the bottom-up toggle.
So, resolve things by moving (1) from cmdline_parse_movable_node() to
x86's setup_arch(), immediately after the movable_node parameter has
been parsed.
Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alistair Popple <apopple@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 00:42:55 +00:00
|
|
|
#ifdef CONFIG_MEMORY_HOTPLUG
|
|
|
|
/*
|
|
|
|
* Memory used by the kernel cannot be hot-removed because Linux
|
|
|
|
* cannot migrate the kernel pages. When memory hotplug is
|
|
|
|
* enabled, we should prevent memblock from allocating memory
|
|
|
|
* for the kernel.
|
|
|
|
*
|
|
|
|
* ACPI SRAT records all hotpluggable memory ranges. But before
|
|
|
|
* SRAT is parsed, we don't know about it.
|
|
|
|
*
|
|
|
|
* The kernel image is loaded into memory at very early time. We
|
|
|
|
* cannot prevent this anyway. So on NUMA system, we set any
|
|
|
|
* node the kernel resides in as un-hotpluggable.
|
|
|
|
*
|
|
|
|
* Since on modern servers, one node could have double-digit
|
|
|
|
* gigabytes memory, we can assume the memory around the kernel
|
|
|
|
* image is also un-hotpluggable. So before SRAT is parsed, just
|
|
|
|
* allocate memory near the kernel image to try the best to keep
|
|
|
|
* the kernel away from hotpluggable memory.
|
|
|
|
*/
|
|
|
|
if (movable_node_is_enabled())
|
|
|
|
memblock_set_bottom_up(true);
|
|
|
|
#endif
|
|
|
|
|
2009-11-13 23:28:17 +00:00
|
|
|
x86_report_nx();
|
2008-09-11 23:42:00 +00:00
|
|
|
|
2023-08-08 22:04:19 +00:00
|
|
|
apic_setup_apic_calls();
|
|
|
|
|
2008-06-26 00:52:35 +00:00
|
|
|
if (acpi_mps_check()) {
|
2008-06-23 20:19:22 +00:00
|
|
|
#ifdef CONFIG_X86_LOCAL_APIC
|
2023-08-08 22:03:40 +00:00
|
|
|
apic_is_disabled = true;
|
2008-06-23 20:19:22 +00:00
|
|
|
#endif
|
2008-07-21 18:21:43 +00:00
|
|
|
setup_clear_cpu_cap(X86_FEATURE_APIC);
|
2008-06-20 23:11:20 +00:00
|
|
|
}
|
|
|
|
|
2017-01-28 21:27:28 +00:00
|
|
|
e820__reserve_setup_data();
|
2017-01-28 12:37:17 +00:00
|
|
|
e820__finish_early_params();
|
2006-09-26 08:52:32 +00:00
|
|
|
|
2012-11-14 09:42:35 +00:00
|
|
|
if (efi_enabled(EFI_BOOT))
|
2009-03-04 02:55:31 +00:00
|
|
|
efi_init();
|
|
|
|
|
2023-06-05 10:28:40 +00:00
|
|
|
reserve_ibft_region();
|
x86/sev: Skip ROM range scans and validation for SEV-SNP guests
SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.
The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:
* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
falls in the ROM range) prior to validation.
For example, Project Oak Stage 0 provides a minimal guest firmware
that currently meets these configuration conditions, meaning guests
booting atop Oak Stage 0 firmware encounter a problematic call chain
during dmi_setup() -> dmi_scan_machine() that results in a crash
during boot if SEV-SNP is enabled.
* With regards to necessity, SEV-SNP guests generally read garbage
(which changes across boots) from the ROM range, meaning these scans
are unnecessary. The guest reads garbage because the legacy ROM range
is unencrypted data but is accessed via an encrypted PMD during early
boot (where the PMD is marked as encrypted due to potentially mapping
actually-encrypted data in other PMD-contained ranges).
In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.
Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.
While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).
As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.
In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.
In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.
[ bp: Massage commit message, remove "we"s. ]
Fixes: 9704c07bf9f7 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
2024-03-13 12:15:46 +00:00
|
|
|
x86_init.resources.dmi_setup();
|
2008-09-22 09:52:26 +00:00
|
|
|
|
2008-10-27 17:41:46 +00:00
|
|
|
/*
|
|
|
|
* VMware detection requires dmi to be available, so this
|
2019-03-28 19:34:28 +00:00
|
|
|
* needs to be done after dmi_setup(), for the boot CPU.
|
2023-05-02 12:09:18 +00:00
|
|
|
* For some guest types (Xen PV, SEV-SNP, TDX) it is required to be
|
|
|
|
* called before cache_bp_init() for setting up MTRR state.
|
2008-10-27 17:41:46 +00:00
|
|
|
*/
|
2009-08-20 15:06:25 +00:00
|
|
|
init_hypervisor_platform();
|
2008-10-27 17:41:46 +00:00
|
|
|
|
2018-07-19 20:55:38 +00:00
|
|
|
tsc_early_init();
|
2009-08-19 12:43:56 +00:00
|
|
|
x86_init.resources.probe_roms();
|
2008-06-16 20:03:31 +00:00
|
|
|
|
2016-04-14 18:18:57 +00:00
|
|
|
/* after parse_early_param, so could debug it */
|
|
|
|
insert_resource(&iomem_resource, &code_resource);
|
2019-10-29 21:13:50 +00:00
|
|
|
insert_resource(&iomem_resource, &rodata_resource);
|
2016-04-14 18:18:57 +00:00
|
|
|
insert_resource(&iomem_resource, &data_resource);
|
|
|
|
insert_resource(&iomem_resource, &bss_resource);
|
|
|
|
|
2013-01-24 20:19:45 +00:00
|
|
|
e820_add_kernel_range();
|
2010-01-22 03:21:04 +00:00
|
|
|
trim_bios_range();
|
2008-06-26 00:52:35 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2008-06-16 23:11:08 +00:00
|
|
|
if (ppro_with_ram_bug()) {
|
2017-01-28 16:09:33 +00:00
|
|
|
e820__range_update(0x70000000ULL, 0x40000ULL, E820_TYPE_RAM,
|
|
|
|
E820_TYPE_RESERVED);
|
x86/boot/e820: Simplify the e820__update_table() interface
The e820__update_table() parameters are pretty complex:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_entry *biosmap, int max_nr_map, u32 *pnr_map);
But 90% of the usage is trivial:
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries))
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries) < 0)
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/kernel/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
arch/x86/xen/setup.c: e820__update_table(e820_table->entries, ARRAY_SIZE(e820_table->entries), &e820_table->nr_entries);
arch/x86/xen/setup.c: e820__update_table(xen_e820_table.entries, ARRAY_SIZE(xen_e820_table.entries),
as it only uses an exiting struct e820_table's entries array, its size and
its current number of entries as input and output arguments.
Only one use is non-trivial:
arch/x86/kernel/e820.c: e820__update_table(boot_params.e820_table, ARRAY_SIZE(boot_params.e820_table), &new_nr);
... which call updates the E820 table in the zeropage in-situ, and the layout there does not
match that of 'struct e820_table' (in particular nr_entries is at a different offset,
hardcoded by the boot protocol).
Simplify all this by introducing a low level __e820__update_table() API that
the zeropage update call can use, and simplifying the main e820__update_table()
call signature down to:
int e820__update_table(struct e820_table *table);
This visibly simplifies all the call sites:
arch/x86/include/asm/e820/api.h:extern int e820__update_table(struct e820_table *table);
arch/x86/include/asm/e820/types.h: * call to e820__update_table() to remove duplicates. The allowance
arch/x86/kernel/e820.c: * The return value from e820__update_table() is zero if it
arch/x86/kernel/e820.c:int __init e820__update_table(struct e820_table *table)
arch/x86/kernel/e820.c: if (e820__update_table(e820_table))
arch/x86/kernel/e820.c: e820__update_table(e820_table_firmware);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: e820__update_table(e820_table);
arch/x86/kernel/e820.c: if (e820__update_table(e820_table) < 0)
arch/x86/kernel/early-quirks.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/kernel/setup.c: e820__update_table(e820_table);
arch/x86/platform/efi/efi.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
arch/x86/xen/setup.c: e820__update_table(e820_table);
arch/x86/xen/setup.c: e820__update_table(&xen_e820_table);
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 17:00:35 +00:00
|
|
|
e820__update_table(e820_table);
|
2008-06-16 23:11:08 +00:00
|
|
|
printk(KERN_INFO "fixed physical RAM map:\n");
|
2017-01-28 13:24:02 +00:00
|
|
|
e820__print_table("bad_ppro");
|
2008-06-16 23:11:08 +00:00
|
|
|
}
|
2008-06-26 00:52:35 +00:00
|
|
|
#else
|
|
|
|
early_gart_iommu_check();
|
|
|
|
#endif
|
2008-06-16 23:11:08 +00:00
|
|
|
|
2008-06-04 02:35:04 +00:00
|
|
|
/*
|
|
|
|
* partially used pages are not usable - thus
|
|
|
|
* we are rounding upwards:
|
|
|
|
*/
|
2017-01-28 21:52:16 +00:00
|
|
|
max_pfn = e820__end_of_ram_pfn();
|
2008-06-04 02:35:04 +00:00
|
|
|
|
2008-01-30 12:33:32 +00:00
|
|
|
/* update e820 for memory not covered by WB MTRRs */
|
2022-11-02 07:47:10 +00:00
|
|
|
cache_bp_init();
|
2008-07-09 01:56:38 +00:00
|
|
|
if (mtrr_trim_uncached_memory(max_pfn))
|
2017-01-28 21:52:16 +00:00
|
|
|
max_pfn = e820__end_of_ram_pfn();
|
2008-03-23 07:16:49 +00:00
|
|
|
|
2015-12-04 13:07:05 +00:00
|
|
|
max_possible_pfn = max_pfn;
|
|
|
|
|
2016-08-09 17:11:04 +00:00
|
|
|
/*
|
|
|
|
* Define random base addresses for memory sections after max_pfn is
|
|
|
|
* defined and before each memory section base is used.
|
|
|
|
*/
|
|
|
|
kernel_randomize_memory();
|
|
|
|
|
2008-06-26 00:52:35 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2008-06-24 19:18:14 +00:00
|
|
|
/* max_low_pfn get updated here */
|
2008-06-23 10:05:30 +00:00
|
|
|
find_low_pfn_range();
|
2008-06-26 00:52:35 +00:00
|
|
|
#else
|
2009-02-17 01:29:58 +00:00
|
|
|
check_x2apic();
|
2008-06-26 00:52:35 +00:00
|
|
|
|
|
|
|
/* How many end-of-memory variables you have, grandma! */
|
|
|
|
/* need this before calling reserve_initrd */
|
2008-07-11 03:38:26 +00:00
|
|
|
if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
|
2017-01-28 21:52:16 +00:00
|
|
|
max_low_pfn = e820__end_of_low_ram_pfn();
|
2008-07-11 03:38:26 +00:00
|
|
|
else
|
|
|
|
max_low_pfn = max_pfn;
|
|
|
|
|
2008-06-26 00:52:35 +00:00
|
|
|
high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
|
2008-09-07 08:51:32 +00:00
|
|
|
#endif
|
|
|
|
|
2024-02-13 21:05:02 +00:00
|
|
|
/* Find and reserve MPTABLE area */
|
|
|
|
x86_init.mpparse.find_mptable();
|
2009-12-10 21:07:22 +00:00
|
|
|
|
2012-11-17 03:38:58 +00:00
|
|
|
early_alloc_pgt_buf();
|
|
|
|
|
2010-08-25 20:39:17 +00:00
|
|
|
/*
|
x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:
- nothing in its name tells us that it's part of the E820 subsystem ...
- The 'fill' wording is ambiguous and doesn't tell us whether it's a single
entry or some process - while the _real_ purpose of the function is hidden,
which is to do a complete setup of the (platform independent) memblock regions.
So rename it accordingly, to e820__memblock_setup().
Also translate this incomprehensible and misleading comment:
/*
* EFI may have more than 128 entries
* We are safe to enable resizing, beause memblock_x86_fill()
* is rather later for x86
*/
memblock_allow_resize();
The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.
But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...
So change the comment to a more comprehensible version:
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
*
* This is safe, because this call happens pretty late during x86 setup,
* so we know about reserved memory regions already. (This is important
* so that memblock resizing does no stomp over reserved areas.)
*/
memblock_allow_resize();
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 10:37:42 +00:00
|
|
|
* Need to conclude brk, before e820__memblock_setup()
|
2021-03-11 08:39:19 +00:00
|
|
|
* it could use memblock_find_in_range, could overlap with
|
|
|
|
* brk area.
|
2010-08-25 20:39:17 +00:00
|
|
|
*/
|
|
|
|
reserve_brk();
|
|
|
|
|
2011-02-18 11:30:30 +00:00
|
|
|
cleanup_highmap();
|
|
|
|
|
2013-08-14 03:44:04 +00:00
|
|
|
memblock_set_current_limit(ISA_END_ADDRESS);
|
x86/boot/e820: Rename memblock_x86_fill() to e820__memblock_setup() and improve the explanations
So memblock_x86_fill() is another E820 code misnomer:
- nothing in its name tells us that it's part of the E820 subsystem ...
- The 'fill' wording is ambiguous and doesn't tell us whether it's a single
entry or some process - while the _real_ purpose of the function is hidden,
which is to do a complete setup of the (platform independent) memblock regions.
So rename it accordingly, to e820__memblock_setup().
Also translate this incomprehensible and misleading comment:
/*
* EFI may have more than 128 entries
* We are safe to enable resizing, beause memblock_x86_fill()
* is rather later for x86
*/
memblock_allow_resize();
The worst aspect of this comment isn't even the sloppy typos, but that it
casually mentions a '128' number with no explanation, which makes one lead
to the assumption that this is related to the well-known limit of a maximum
of 128 E820 entries passed via legacy bootloaders.
But no, the _real_ meaning of 128 here is that of the memblock subsystem,
which too happens to have a 128 entries limit for very early memblock
regions (which is unrelated to E820), via INIT_MEMBLOCK_REGIONS ...
So change the comment to a more comprehensible version:
/*
* The bootstrap memblock region count maximum is 128 entries
* (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
* than that - so allow memblock resizing.
*
* This is safe, because this call happens pretty late during x86 setup,
* so we know about reserved memory regions already. (This is important
* so that memblock resizing does no stomp over reserved areas.)
*/
memblock_allow_resize();
No change in functionality.
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 10:37:42 +00:00
|
|
|
e820__memblock_setup();
|
2010-08-25 20:39:17 +00:00
|
|
|
|
2020-12-10 01:25:15 +00:00
|
|
|
/*
|
|
|
|
* Needs to run after memblock setup because it needs the physical
|
|
|
|
* memory size.
|
|
|
|
*/
|
2023-10-10 14:52:19 +00:00
|
|
|
mem_encrypt_setup_arch();
|
x86/coco: Require seeding RNG with RDRAND on CoCo systems
There are few uses of CoCo that don't rely on working cryptography and
hence a working RNG. Unfortunately, the CoCo threat model means that the
VM host cannot be trusted and may actively work against guests to
extract secrets or manipulate computation. Since a malicious host can
modify or observe nearly all inputs to guests, the only remaining source
of entropy for CoCo guests is RDRAND.
If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole
is meant to gracefully continue on gathering entropy from other sources,
but since there aren't other sources on CoCo, this is catastrophic.
This is mostly a concern at boot time when initially seeding the RNG, as
after that the consequences of a broken RDRAND are much more
theoretical.
So, try at boot to seed the RNG using 256 bits of RDRAND output. If this
fails, panic(). This will also trigger if the system is booted without
RDRAND, as RDRAND is essential for a safe CoCo boot.
Add this deliberately to be "just a CoCo x86 driver feature" and not
part of the RNG itself. Many device drivers and platforms have some
desire to contribute something to the RNG, and add_device_randomness()
is specifically meant for this purpose.
Any driver can call it with seed data of any quality, or even garbage
quality, and it can only possibly make the quality of the RNG better or
have no effect, but can never make it worse.
Rather than trying to build something into the core of the RNG, consider
the particular CoCo issue just a CoCo issue, and therefore separate it
all out into driver (well, arch/platform) code.
[ bp: Massage commit message. ]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240326160735.73531-1-Jason@zx2c4.com
2024-03-26 16:07:35 +00:00
|
|
|
cc_random_init();
|
2020-12-10 01:25:15 +00:00
|
|
|
|
2019-11-07 01:43:05 +00:00
|
|
|
efi_find_mirror();
|
|
|
|
efi_esrt_init();
|
2020-09-05 01:31:05 +00:00
|
|
|
efi_mokvar_table_init();
|
2016-08-10 09:29:13 +00:00
|
|
|
|
2019-11-07 01:43:05 +00:00
|
|
|
/*
|
|
|
|
* The EFI specification says that boot service code won't be
|
|
|
|
* called after ExitBootServices(). This is, in fact, a lie.
|
|
|
|
*/
|
|
|
|
efi_reserve_boot_services();
|
x86, efi: Retain boot service code until after switching to virtual mode
UEFI stands for "Unified Extensible Firmware Interface", where "Firmware"
is an ancient African word meaning "Why do something right when you can
do it so wrong that children will weep and brave adults will cower before
you", and "UEI" is Celtic for "We missed DOS so we burned it into your
ROMs". The UEFI specification provides for runtime services (ie, another
way for the operating system to be forced to depend on the firmware) and
we rely on these for certain trivial tasks such as setting up the
bootloader. But some hardware fails to work if we attempt to use these
runtime services from physical mode, and so we have to switch into virtual
mode. So far so dreadful.
The specification makes it clear that the operating system is free to do
whatever it wants with boot services code after ExitBootServices() has been
called. SetVirtualAddressMap() can't be called until ExitBootServices() has
been. So, obviously, a whole bunch of EFI implementations call into boot
services code when we do that. Since we've been charmingly naive and
trusted that the specification may be somehow relevant to the real world,
we've already stuffed a picture of a penguin or something in that address
space. And just to make things more entertaining, we've also marked it
non-executable.
This patch allocates the boot services regions during EFI init and makes
sure that they're executable. Then, after SetVirtualAddressMap(), it
discards them and everyone lives happily ever after. Except for the ones
who have to work on EFI, who live sad lives haunted by the knowledge that
someone's eventually going to write yet another firmware specification.
[ hpa: adding this to urgent with a stable tag since it fixes currently-broken
hardware. However, I do not know what the dependencies are and so I do
not know which -stable versions this may be a candidate for. ]
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1306331593-28715-1-git-send-email-mjg@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@kernel.org>
2011-05-25 13:53:13 +00:00
|
|
|
|
2010-08-25 20:39:17 +00:00
|
|
|
/* preallocate 4k for mptable mpc */
|
2017-01-28 12:46:28 +00:00
|
|
|
e820__memblock_alloc_reserved_mpc_new();
|
2010-08-25 20:39:17 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_X86_CHECK_BIOS_CORRUPTION
|
|
|
|
setup_bios_corruption_check();
|
|
|
|
#endif
|
|
|
|
|
2013-01-24 20:19:54 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
2012-05-29 22:06:29 +00:00
|
|
|
printk(KERN_DEBUG "initial memory mapped: [mem 0x00000000-%#010lx]\n",
|
|
|
|
(max_pfn_mapped<<PAGE_SHIFT) - 1);
|
2013-01-24 20:19:54 +00:00
|
|
|
#endif
|
2010-08-25 20:39:17 +00:00
|
|
|
|
2021-04-13 18:08:39 +00:00
|
|
|
/*
|
2021-06-08 20:17:10 +00:00
|
|
|
* Find free memory for the real mode trampoline and place it there. If
|
|
|
|
* there is not enough free memory under 1M, on EFI-enabled systems
|
|
|
|
* there will be additional attempt to reclaim the memory for the real
|
|
|
|
* mode trampoline at efi_free_boot_services().
|
2021-06-01 07:53:52 +00:00
|
|
|
*
|
2021-06-08 20:17:10 +00:00
|
|
|
* Unconditionally reserve the entire first 1M of RAM because BIOSes
|
|
|
|
* are known to corrupt low memory and several hundred kilobytes are not
|
|
|
|
* worth complex detection what memory gets clobbered. Windows does the
|
|
|
|
* same thing for very similar reasons.
|
|
|
|
*
|
|
|
|
* Moreover, on machines with SandyBridge graphics or in setups that use
|
|
|
|
* crashkernel the entire 1M is reserved anyway.
|
2023-12-08 17:07:27 +00:00
|
|
|
*
|
|
|
|
* Note the host kernel TDX also requires the first 1MB being reserved.
|
2021-04-13 18:08:39 +00:00
|
|
|
*/
|
2022-11-23 11:45:23 +00:00
|
|
|
x86_platform.realmode_reserve();
|
2012-11-14 20:43:31 +00:00
|
|
|
|
2012-11-17 03:38:41 +00:00
|
|
|
init_mem_mapping();
|
2011-10-20 21:15:26 +00:00
|
|
|
|
2024-07-09 15:40:48 +00:00
|
|
|
/*
|
|
|
|
* init_mem_mapping() relies on the early IDT page fault handling.
|
|
|
|
* Now either enable FRED or install the real page fault handler
|
|
|
|
* for 64-bit in the IDT.
|
|
|
|
*/
|
|
|
|
cpu_init_replace_early_idt();
|
2011-10-20 21:15:26 +00:00
|
|
|
|
2016-08-10 09:29:14 +00:00
|
|
|
/*
|
|
|
|
* Update mmu_cr4_features (and, indirectly, trampoline_cr4_features)
|
|
|
|
* with the current CR4 value. This may not be necessary, but
|
|
|
|
* auditing all the early-boot CR4 manipulation would be needed to
|
|
|
|
* rule it out.
|
2017-09-11 00:48:27 +00:00
|
|
|
*
|
|
|
|
* Mask off features that don't work outside long mode (just
|
|
|
|
* PCIDE for now).
|
2016-08-10 09:29:14 +00:00
|
|
|
*/
|
2017-09-11 00:48:27 +00:00
|
|
|
mmu_cr4_features = __read_cr4() & ~X86_CR4_PCIDE;
|
2016-08-10 09:29:14 +00:00
|
|
|
|
2014-01-28 01:06:50 +00:00
|
|
|
memblock_set_current_limit(get_max_mapped());
|
2008-06-24 19:18:14 +00:00
|
|
|
|
2008-06-26 04:51:28 +00:00
|
|
|
/*
|
|
|
|
* NOTE: On x86-32, only from this point on, fixmaps are ready for use.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
|
|
|
|
if (init_ohci1394_dma_early)
|
|
|
|
init_ohci1394_dma_on_all_controllers();
|
|
|
|
#endif
|
2011-05-25 00:13:20 +00:00
|
|
|
/* Allocate bigger log buffer */
|
|
|
|
setup_log_buf(1);
|
2008-06-26 04:51:28 +00:00
|
|
|
|
2017-02-06 11:22:45 +00:00
|
|
|
if (efi_enabled(EFI_BOOT)) {
|
|
|
|
switch (boot_params.secure_boot) {
|
|
|
|
case efi_secureboot_mode_disabled:
|
|
|
|
pr_info("Secure boot disabled\n");
|
|
|
|
break;
|
|
|
|
case efi_secureboot_mode_enabled:
|
|
|
|
pr_info("Secure boot enabled\n");
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
pr_info("Secure boot could not be determined\n");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-06-23 10:05:30 +00:00
|
|
|
reserve_initrd();
|
|
|
|
|
2016-06-20 10:56:10 +00:00
|
|
|
acpi_table_upgrade();
|
2021-04-13 14:01:00 +00:00
|
|
|
/* Look for ACPI tables and reserve memory occupied by them. */
|
|
|
|
acpi_boot_table_init();
|
2012-09-30 22:23:54 +00:00
|
|
|
|
2008-06-26 00:52:35 +00:00
|
|
|
vsmp_init();
|
|
|
|
|
2008-06-17 22:41:45 +00:00
|
|
|
io_delay_init();
|
|
|
|
|
2017-08-01 12:10:41 +00:00
|
|
|
early_platform_quirks();
|
|
|
|
|
2024-02-13 21:05:16 +00:00
|
|
|
/* Some platforms need the APIC registered for NUMA configuration */
|
x86, ACPI, mm: Revert movablemem_map support
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5
Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")
It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.
If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.
We have workaround patch that could fix some problems, but some can not
be fixed.
So just remove that offending commit and related ones including:
f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")
01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")
27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")
e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")
fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")
42f47e27e761 ("page_alloc: make movablemem_map have higher priority")
6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")
34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")
4d59a75125d5 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-01 22:51:27 +00:00
|
|
|
early_acpi_boot_init();
|
2024-02-13 21:05:16 +00:00
|
|
|
x86_init.mpparse.early_parse_smp_cfg();
|
x86, ACPI, mm: Revert movablemem_map support
Tim found:
WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
Hardware name: S2600CP
sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
smpboot: Booting Node 1, Processors #1
Modules linked in:
Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
Call Trace:
set_cpu_sibling_map+0x279/0x449
start_secondary+0x11d/0x1e5
Don Morris reproduced on a HP z620 workstation, and bisected it to
commit e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock
is ready")
It turns out movable_map has some problems, and it breaks several things
1. numa_init is called several times, NOT just for srat. so those
nodes_clear(numa_nodes_parsed)
memset(&numa_meminfo, 0, sizeof(numa_meminfo))
can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy.
and make fall back path working.
2. simply split acpi_numa_init to early_parse_srat.
a. that early_parse_srat is NOT called for ia64, so you break ia64.
b. for (i = 0; i < MAX_LOCAL_APIC; i++)
set_apicid_to_node(i, NUMA_NO_NODE)
still left in numa_init. So it will just clear result from early_parse_srat.
it should be moved before that....
c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
early before override from INITRD is settled.
3. that patch TITLE is total misleading, there is NO x86 in the title,
but it changes critical x86 code. It caused x86 guys did not
pay attention to find the problem early. Those patches really should
be routed via tip/x86/mm.
4. after that commit, following range can not use movable ram:
a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
b. initrd... it will be freed after booting, so it could be on movable...
c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
anymore.
d. init_mem_mapping: can not put page table high anymore.
e. initmem_init: vmemmap can not be high local node anymore. That is
not good.
If node is hotplugable, the mem related range like page table and
vmemmap could be on the that node without problem and should be on that
node.
We have workaround patch that could fix some problems, but some can not
be fixed.
So just remove that offending commit and related ones including:
f7210e6c4ac7 ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
protect movablecore_map in memblock_overlaps_region().")
01a178a94e8e ("acpi, memory-hotplug: support getting hotplug info from
SRAT")
27168d38fa20 ("acpi, memory-hotplug: extend movablemem_map ranges to
the end of node")
e8d195525809 ("acpi, memory-hotplug: parse SRAT before memblock is
ready")
fb06bc8e5f42 ("page_alloc: bootmem limit with movablecore_map")
42f47e27e761 ("page_alloc: make movablemem_map have higher priority")
6981ec31146c ("page_alloc: introduce zone_movable_limit[] to keep
movable limit for nodes")
34b71f1e04fc ("page_alloc: add movable_memmap kernel parameter")
4d59a75125d5 ("x86: get pg_data_t's memory from other node")
Later we should have patches that will make sure kernel put page table
and vmemmap on local node ram instead of push them down to node0. Also
need to find way to put other kernel used ram to local node ram.
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: Don Morris <don.morris@hp.com>
Bisected-by: Don Morris <don.morris@hp.com>
Tested-by: Don Morris <don.morris@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-01 22:51:27 +00:00
|
|
|
|
2023-08-25 07:47:36 +00:00
|
|
|
x86_flattree_get_config();
|
|
|
|
|
2011-02-16 11:13:06 +00:00
|
|
|
initmem_init();
|
2014-10-24 09:00:34 +00:00
|
|
|
dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
|
2013-11-12 23:08:07 +00:00
|
|
|
|
mm: hugetlb: optionally allocate gigantic hugepages using cma
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the run-time allocation of gigantic pages.
However it actually works only at early stages of the system loading,
when the majority of memory is free. After some time the memory gets
fragmented by non-movable pages, so the chances to find a contiguous 1GB
block are getting close to zero. Even dropping caches manually doesn't
help a lot.
At large scale rebooting servers in order to allocate gigantic hugepages
is quite expensive and complex. At the same time keeping some constant
percentage of memory in reserved hugepages even if the workload isn't
using it is a big waste: not all workloads can benefit from using 1 GB
pages.
The following solution can solve the problem:
1) On boot time a dedicated cma area* is reserved. The size is passed
as a kernel argument.
2) Run-time allocations of gigantic hugepages are performed using the
cma allocator and the dedicated cma area
In this case gigantic hugepages can be allocated successfully with a
high probability, however the memory isn't completely wasted if nobody
is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
etc.
* On a multi-node machine a per-node cma area is allocated on each node.
Following gigantic hugetlb allocation are using the first available
numa node if the mask isn't specified by a user.
Usage:
1) configure the kernel to allocate a cma area for hugetlb allocations:
pass hugetlb_cma=10G as a kernel argument
2) allocate hugetlb pages as usual, e.g.
echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
If the option isn't enabled or the allocation of the cma area failed,
the current behavior of the system is preserved.
x86 and arm-64 are covered by this patch, other architectures can be
trivially added later.
The patch contains clean-ups and fixes proposed and implemented by Aslan
Bakirov and Randy Dunlap. It also contains ideas and suggestions
proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andreas Schaufler <andreas.schaufler@gmx.de>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Aslan Bakirov <aslan@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 21:32:45 +00:00
|
|
|
if (boot_cpu_has(X86_FEATURE_GBPAGES))
|
|
|
|
hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
|
|
|
|
|
2013-11-12 23:08:07 +00:00
|
|
|
/*
|
|
|
|
* Reserve memory for crash kernel after SRAT is parsed so that it
|
|
|
|
* won't consume hotpluggable memory.
|
|
|
|
*/
|
2023-09-14 03:31:39 +00:00
|
|
|
arch_reserve_crashkernel();
|
2013-11-12 23:08:07 +00:00
|
|
|
|
2017-09-11 18:51:11 +00:00
|
|
|
if (!early_xdbc_setup_hardware())
|
|
|
|
early_xdbc_register_console();
|
|
|
|
|
2012-08-21 20:22:38 +00:00
|
|
|
x86_init.paging.pagetable_init();
|
x86: early boot debugging via FireWire (ohci1394_dma=early)
This patch adds a new configuration option, which adds support for a new
early_param which gets checked in arch/x86/kernel/setup_{32,64}.c:setup_arch()
to decide wether OHCI-1394 FireWire controllers should be initialized and
enabled for physical DMA access to allow remote debugging of early problems
like issues ACPI or other subsystems which are executed very early.
If the config option is not enabled, no code is changed, and if the boot
paramenter is not given, no new code is executed, and independent of that,
all new code is freed after boot, so the config option can be even enabled
in standard, non-debug kernels.
With specialized tools, it is then possible to get debugging information
from machines which have no serial ports (notebooks) such as the printk
buffer contents, or any data which can be referenced from global pointers,
if it is stored below the 4GB limit and even memory dumps of of the physical
RAM region below the 4GB limit can be taken without any cooperation from the
CPU of the host, so the machine can be crashed early, it does not matter.
In the extreme, even kernel debuggers can be accessed in this way. I wrote
a small kgdb module and an accompanying gdb stub for FireWire which allows
to gdb to talk to kgdb using remote remory reads and writes over FireWire.
An version of the gdb stub fore FireWire is able to read all global data
from a system which is running a a normal kernel without any kernel debugger,
without any interruption or support of the system's CPU. That way, e.g. the
task struct and so on can be read and even manipulated when the physical DMA
access is granted.
A HOWTO is included in this patch, in Documentation/debugging-via-ohci1394.txt
and I've put a copy online at
ftp://ftp.suse.de/private/bk/firewire/docs/debugging-via-ohci1394.txt
It also has links to all the tools which are available to make use of it
another copy of it is online at:
ftp://ftp.suse.de/private/bk/firewire/kernel/ohci1394_dma_early-v2.diff
Signed-Off-By: Bernhard Kaindl <bk@suse.de>
Tested-By: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 12:34:11 +00:00
|
|
|
|
2015-02-13 22:39:25 +00:00
|
|
|
kasan_init();
|
|
|
|
|
2017-05-09 00:09:10 +00:00
|
|
|
/*
|
2018-02-28 20:14:26 +00:00
|
|
|
* Sync back kernel address range.
|
|
|
|
*
|
|
|
|
* FIXME: Can the later sync in setup_cpu_entry_areas() replace
|
|
|
|
* this call?
|
2017-05-09 00:09:10 +00:00
|
|
|
*/
|
2018-02-28 20:14:26 +00:00
|
|
|
sync_initial_page_table();
|
2017-05-09 00:09:10 +00:00
|
|
|
|
x86, intel_txt: Intel TXT boot support
This patch adds kernel configuration and boot support for Intel Trusted
Execution Technology (Intel TXT).
Intel's technology for safer computing, Intel Trusted Execution
Technology (Intel TXT), defines platform-level enhancements that
provide the building blocks for creating trusted platforms.
Intel TXT was formerly known by the code name LaGrande Technology (LT).
Intel TXT in Brief:
o Provides dynamic root of trust for measurement (DRTM)
o Data protection in case of improper shutdown
o Measurement and verification of launched environment
Intel TXT is part of the vPro(TM) brand and is also available some
non-vPro systems. It is currently available on desktop systems based on
the Q35, X38, Q45, and Q43 Express chipsets (e.g. Dell Optiplex 755, HP
dc7800, etc.) and mobile systems based on the GM45, PM45, and GS45
Express chipsets.
For more information, see http://www.intel.com/technology/security/.
This site also has a link to the Intel TXT MLE Developers Manual, which
has been updated for the new released platforms.
A much more complete description of how these patches support TXT, how to
configure a system for it, etc. is in the Documentation/intel_txt.txt file
in this patch.
This patch provides the TXT support routines for complete functionality,
documentation for TXT support and for the changes to the boot_params structure,
and boot detection of a TXT launch. Attempts to shutdown (reboot, Sx) the system
will result in platform resets; subsequent patches will support these shutdown modes
properly.
Documentation/intel_txt.txt | 210 +++++++++++++++++++++
Documentation/x86/zero-page.txt | 1
arch/x86/include/asm/bootparam.h | 3
arch/x86/include/asm/fixmap.h | 3
arch/x86/include/asm/tboot.h | 197 ++++++++++++++++++++
arch/x86/kernel/Makefile | 1
arch/x86/kernel/setup.c | 4
arch/x86/kernel/tboot.c | 379 +++++++++++++++++++++++++++++++++++++++
security/Kconfig | 30 +++
9 files changed, 827 insertions(+), 1 deletion(-)
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Gang Wei <gang.wei@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-07-01 02:30:59 +00:00
|
|
|
tboot_probe();
|
|
|
|
|
2008-06-26 00:52:35 +00:00
|
|
|
map_vsyscall();
|
|
|
|
|
2023-08-08 22:04:01 +00:00
|
|
|
x86_32_probe_apic();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-10-19 18:35:03 +00:00
|
|
|
early_quirks();
|
2006-06-08 07:43:38 +00:00
|
|
|
|
2024-02-13 21:05:53 +00:00
|
|
|
topology_apply_cmdline_limits_early();
|
|
|
|
|
2008-06-24 02:55:05 +00:00
|
|
|
/*
|
2024-02-13 21:05:14 +00:00
|
|
|
* Parse SMP configuration. Try ACPI first and then the platform
|
|
|
|
* specific parser.
|
2008-06-24 02:55:05 +00:00
|
|
|
*/
|
2005-04-16 22:20:36 +00:00
|
|
|
acpi_boot_init();
|
2024-02-13 21:05:14 +00:00
|
|
|
x86_init.mpparse.parse_smp_cfg();
|
2008-06-26 00:52:35 +00:00
|
|
|
|
2024-02-13 21:05:53 +00:00
|
|
|
/* Last opportunity to detect and map the local APIC */
|
x86/smpboot: Init apic mapping before usage
The recent changes, which forced the registration of the boot cpu on UP
systems, which do not have ACPI tables, have been fixed for systems w/o
local APIC, but left a wreckage for systems which have neither ACPI nor
mptables, but the CPU has an APIC, e.g. virtualbox.
The boot process crashes in prefill_possible_map() as it wants to register
the boot cpu, which needs to access the local apic, but the local APIC is
not yet mapped.
There is no reason why init_apic_mapping() can't be invoked before
prefill_possible_map(). So instead of playing another silly early mapping
game, as the ACPI/mptables code does, we just move init_apic_mapping()
before the call to prefill_possible_map().
In hindsight, I should have noticed that combination earlier.
Sorry for the churn (also in stable)!
Fixes: ff8560512b8d ("x86/boot/smp: Don't try to poke disabled/non-existent APIC")
Reported-and-debugged-by: Michal Necasek <michal.necasek@oracle.com>
Reported-and-tested-by: Wolfgang Bauer <wbauer@tmo.at>
Cc: prarit@redhat.com
Cc: ville.syrjala@linux.intel.com
Cc: michael.thayer@oracle.com
Cc: knut.osmundsen@oracle.com
Cc: frank.mehnert@oracle.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610282114380.5053@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-29 11:42:42 +00:00
|
|
|
init_apic_mappings();
|
|
|
|
|
2024-02-13 21:05:53 +00:00
|
|
|
topology_init_possible_cpus();
|
2008-08-20 03:50:02 +00:00
|
|
|
|
2008-07-03 01:53:44 +00:00
|
|
|
init_cpu_to_node();
|
2020-09-30 14:05:43 +00:00
|
|
|
init_gi_nodes();
|
2008-07-03 01:53:44 +00:00
|
|
|
|
2015-04-24 11:57:48 +00:00
|
|
|
io_apic_init_mappings();
|
2008-08-20 03:50:52 +00:00
|
|
|
|
2017-11-09 13:27:38 +00:00
|
|
|
x86_init.hyper.guest_late_init();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-01-28 21:41:14 +00:00
|
|
|
e820__reserve_resources();
|
2018-09-21 06:26:24 +00:00
|
|
|
e820__register_nosave_regions(max_pfn);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-08-19 12:55:50 +00:00
|
|
|
x86_init.resources.reserve_resources();
|
2008-06-16 20:03:31 +00:00
|
|
|
|
2017-01-28 13:16:38 +00:00
|
|
|
e820__setup_pci_gap();
|
2008-06-16 20:03:31 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
#ifdef CONFIG_VT
|
|
|
|
#if defined(CONFIG_VGA_CONSOLE)
|
2012-11-14 09:42:35 +00:00
|
|
|
if (!efi_enabled(EFI_BOOT) || (efi_mem_type(0xa0000) != EFI_CONVENTIONAL_MEMORY))
|
2023-10-09 21:18:41 +00:00
|
|
|
vgacon_register_screen(&screen_info);
|
2005-04-16 22:20:36 +00:00
|
|
|
#endif
|
|
|
|
#endif
|
2009-08-20 11:19:57 +00:00
|
|
|
x86_init.oem.banner();
|
2009-11-10 01:38:24 +00:00
|
|
|
|
2011-02-14 16:13:31 +00:00
|
|
|
x86_init.timers.wallclock_init();
|
|
|
|
|
x86/thermal: Fix LVT thermal setup for SMI delivery mode
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):
[ 0.033858] read lvtthmr: 0x330, val: 0x200
which means:
- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.
Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:
setup_local_APIC:
...
/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);
which means (from the SDM):
"10.4.7.2 Local APIC State After It Has Been Software Disabled
...
* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."
And this happens too:
[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...
Now, before
4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.
But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.
Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.
Fixes: 4f432e8bb15b ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
2021-05-27 09:02:26 +00:00
|
|
|
/*
|
|
|
|
* This needs to run before setup_local_APIC() which soft-disables the
|
|
|
|
* local APIC temporarily and that masks the thermal LVT interrupt,
|
|
|
|
* leading to softlockups on machines which have configured SMI
|
|
|
|
* interrupt delivery.
|
|
|
|
*/
|
|
|
|
therm_lvt_init();
|
|
|
|
|
2009-11-10 01:38:24 +00:00
|
|
|
mcheck_init();
|
2010-09-17 15:08:51 +00:00
|
|
|
|
jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-04 16:42:27 +00:00
|
|
|
register_refined_jiffies(CLOCK_TICK_RATE);
|
2012-10-24 17:00:44 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_EFI
|
2014-03-04 16:02:17 +00:00
|
|
|
if (efi_enabled(EFI_BOOT))
|
|
|
|
efi_apply_memmap_quirks();
|
2012-10-24 17:00:44 +00:00
|
|
|
#endif
|
2017-07-24 23:36:57 +00:00
|
|
|
|
|
|
|
unwind_init();
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2008-09-16 07:29:09 +00:00
|
|
|
|
2009-02-17 22:12:48 +00:00
|
|
|
#ifdef CONFIG_X86_32
|
|
|
|
|
2009-08-19 12:55:50 +00:00
|
|
|
static struct resource video_ram_resource = {
|
|
|
|
.name = "Video RAM area",
|
|
|
|
.start = 0xa0000,
|
|
|
|
.end = 0xbffff,
|
|
|
|
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
|
2009-02-17 22:12:48 +00:00
|
|
|
};
|
|
|
|
|
2009-08-19 12:55:50 +00:00
|
|
|
void __init i386_reserve_resources(void)
|
2009-02-17 22:12:48 +00:00
|
|
|
{
|
2009-08-19 12:55:50 +00:00
|
|
|
request_resource(&iomem_resource, &video_ram_resource);
|
|
|
|
reserve_standard_io_resources();
|
2009-02-17 22:12:48 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_X86_32 */
|
2013-10-11 00:18:17 +00:00
|
|
|
|
|
|
|
static struct notifier_block kernel_offset_notifier = {
|
|
|
|
.notifier_call = dump_kernel_offset
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init register_kernel_offset_dumper(void)
|
|
|
|
{
|
|
|
|
atomic_notifier_chain_register(&panic_notifier_list,
|
|
|
|
&kernel_offset_notifier);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
__initcall(register_kernel_offset_dumper);
|
2024-03-18 18:12:45 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_HOTPLUG_CPU
|
|
|
|
bool arch_cpu_is_hotpluggable(int cpu)
|
|
|
|
{
|
|
|
|
return cpu > 0;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_HOTPLUG_CPU */
|