The memblock current limit value is used to limit early boot memory
allocations below max low memory address by default, as the kernel can
access only to the low memory.
Hence, set memblock current limit value to the max mapped low memory
address instead of max mapped memory address.
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Walmsley <paul@pwsan.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc. This is primarily being done for
the cgroups filesystem, but the goal is to also move debugfs to it when
it is ready, solving all of the known issues in that filesystem as well.
The code isn't completed yet, but all should be stable now (there is a
big section that was reverted due to problems found when testing.)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be using
soon (it's in this tree to make merges after -rc1 easier.)
All of this has been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEABECAAYFAlLdh0cACgkQMUfUDdst+ylv4QCfeDKDgLo4LsaBIIrFSxLoH/c7
UUsAoMPRwA0h8wy+BQcJAg4H4J4maKj3
=0pc0
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core and sysfs patch set for 3.14-rc1.
There's a lot of work here moving sysfs logic out into a "kernfs" to
allow other subsystems to also have a virtual filesystem with the same
attributes of sysfs (handle device disconnect, dynamic creation /
removal as needed / unneeded, etc)
This is primarily being done for the cgroups filesystem, but the goal
is to also move debugfs to it when it is ready, solving all of the
known issues in that filesystem as well. The code isn't completed
yet, but all should be stable now (there is a big section that was
reverted due to problems found when testing)
There's also some other smaller fixes, and a driver core addition that
allows for a "collection" of objects, that the DRM people will be
using soon (it's in this tree to make merges after -rc1 easier)
All of this has been in linux-next with no reported issues"
* tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
kernfs: associate a new kernfs_node with its parent on creation
kernfs: add struct dentry declaration in kernfs.h
kernfs: fix get_active failure handling in kernfs_seq_*()
Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
Revert "kernfs: remove KERNFS_REMOVED"
Revert "kernfs: restructure removal path to fix possible premature return"
Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
Revert "kernfs: remove kernfs_addrm_cxt"
Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
kernfs: remove unnecessary NULL check in __kernfs_remove()
drivers/base: provide an infrastructure for componentised subsystems
...
Pull x86 cpufeature and mpx updates from Peter Anvin:
"This includes the basic infrastructure for MPX (Memory Protection
Extensions) support, but does not include MPX support itself. It is,
however, a prerequisite for KVM support for MPX, which I believe will
be pushed later this merge window by the KVM team.
This includes moving the functionality in
futex_atomic_cmpxchg_inatomic() into a new function in uaccess.h so it
can be reused - this will be used by the final MPX patches.
The actual MPX functionality (map management and so on) will be pushed
in a future merge window, when ready"
* 'x86/mpx' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel/mpx: Remove unused LWP structure
x86, mpx: Add MPX related opcodes to the x86 opcode map
x86: replace futex_atomic_cmpxchg_inatomic() with user_atomic_cmpxchg_inatomic
x86: add user_atomic_cmpxchg_inatomic at uaccess.h
x86, xsave: Support eager-only xsave features, add MPX support
x86, cpufeature: Define the Intel MPX feature flag
Pull x86 kernel address space randomization support from Peter Anvin:
"This enables kernel address space randomization for x86"
* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kaslr: Clarify RANDOMIZE_BASE_MAX_OFFSET
x86, kaslr: Remove unused including <linux/version.h>
x86, kaslr: Use char array to gain sizeof sanity
x86, kaslr: Add a circular multiply for better bit diffusion
x86, kaslr: Mix entropy sources together as needed
x86/relocs: Add percpu fixup for GNU ld 2.23
x86, boot: Rename get_flags() and check_flags() to *_cpuflags()
x86, kaslr: Raise the maximum virtual address to -1 GiB on x86_64
x86, kaslr: Report kernel offset on panic
x86, kaslr: Select random position from e820 maps
x86, kaslr: Provide randomness functions
x86, kaslr: Return location from decompress_kernel
x86, boot: Move CPU flags out of cpucheck
x86, relocs: Add more per-cpu gold special cases
Pull leftover x86 fixes from Ingo Molnar:
"Two leftover fixes that did not make it into v3.13"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Add check for number of available vectors before CPU down
x86, cpu, amd: Add workaround for family 16h, erratum 793
Pull x86 RAS changes from Ingo Molnar:
- SCI reporting for other error types not only correctable ones
- GHES cleanups
- Add the functionality to override error reporting agents as some
machines are sporting a new extended error logging capability which,
if done properly in the BIOS, makes a corresponding EDAC module
redundant
- PCIe AER tracepoint severity levels fix
- Error path correction for the mce device init
- MCE timer fix
- Add more flexibility to the error injection (EINJ) debugfs interface
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, mce: Fix mce_start_timer semantics
ACPI, APEI, GHES: Cleanup ghes memory error handling
ACPI, APEI: Cleanup alignment-aware accesses
ACPI, APEI, GHES: Do not report only correctable errors with SCI
ACPI, APEI, EINJ: Changes to the ACPI/APEI/EINJ debugfs interface
ACPI, eMCA: Combine eMCA/EDAC event reporting priority
EDAC, sb_edac: Modify H/W event reporting policy
EDAC: Add an edac_report parameter to EDAC
PCI, AER: Fix severity usage in aer trace event
x86, mce: Call put_device on device_register failure
Pull x86 microcode loader updates from Ingo Molnar:
"There are two main changes in this tree:
- AMD microcode early loading fixes
- some microcode loader source files reorganization"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode: Move to a proper location
x86, microcode, AMD: Fix early ucode loading
x86, microcode: Share native MSR accessing variants
x86, ramdisk: Export relocated ramdisk VA
Pull x86 EFI changes from Ingo Molnar:
"This consists of two main parts:
- New static EFI runtime services virtual mapping layout which is
groundwork for kexec support on EFI (Borislav Petkov)
- EFI kexec support itself (Dave Young)"
* 'x86-efi-kexec-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/efi: parse_efi_setup() build fix
x86: ksysfs.c build fix
x86/efi: Delete superfluous global variables
x86: Reserve setup_data ranges late after parsing memmap cmdline
x86: Export x86 boot_params to sysfs
x86: Add xloadflags bit for EFI runtime support on kexec
x86/efi: Pass necessary EFI data for kexec via setup_data
efi: Export EFI runtime memory mapping to sysfs
efi: Export more EFI table variables to sysfs
x86/efi: Cleanup efi_enter_virtual_mode() function
x86/efi: Fix off-by-one bug in EFI Boot Services reservation
x86/efi: Add a wrapper function efi_map_region_fixed()
x86/efi: Remove unused variables in __map_region()
x86/efi: Check krealloc return value
x86/efi: Runtime services virtual mapping
x86/mm/cpa: Map in an arbitrary pgd
x86/mm/pageattr: Add last levels of error path
x86/mm/pageattr: Add a PUD error unwinding path
x86/mm/pageattr: Add a PTE pagetable populating function
x86/mm/pageattr: Add a PMD pagetable populating function
...
Pull x86 TLB detection update from Ingo Molnar:
"A single change that extends our TLB cache size detection+reporting
code"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu: Detect more TLB configuration
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpu, amd: Fix a shadowed variable situation
um, x86: Fix vDSO build
x86: Delete non-required instances of include <linux/init.h>
x86, realmode: Pointer walk cleanups, pull out invariant use of __pa()
x86/traps: Clean up error exception handler definitions
Pull scheduler changes from Ingo Molnar:
- Add the initial implementation of SCHED_DEADLINE support: a real-time
scheduling policy where tasks that meet their deadlines and
periodically execute their instances in less than their runtime quota
see real-time scheduling and won't miss any of their deadlines.
Tasks that go over their quota get delayed (Available to privileged
users for now)
- Clean up and fix preempt_enable_no_resched() abuse all around the
tree
- Do sched_clock() performance optimizations on x86 and elsewhere
- Fix and improve auto-NUMA balancing
- Fix and clean up the idle loop
- Apply various cleanups and fixes
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
sched: Fix __sched_setscheduler() nice test
sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
sched: Fix up attr::sched_priority warning
sched: Fix up scheduler syscall LTP fails
sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
sched/core: Fix htmldocs warnings
sched/deadline: No need to check p if dl_se is valid
sched/deadline: Remove unused variables
sched/deadline: Fix sparse static warnings
m68k: Fix build warning in mac_via.h
sched, thermal: Clean up preempt_enable_no_resched() abuse
sched, net: Fixup busy_loop_us_clock()
sched, net: Clean up preempt_enable_no_resched() abuse
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
sched/preempt, locking: Rework local_bh_{dis,en}able()
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
sched/clock: Fix up clear_sched_clock_stable()
sched/clock, x86: Use a static_key for sched_clock_stable
sched/clock: Remove local_irq_disable() from the clocks
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
...
Pull perf updates from Ingo Molnar:
"Kernel side changes:
- Add Intel RAPL energy counter support (Stephane Eranian)
- Clean up uprobes (Oleg Nesterov)
- Optimize ring-buffer writes (Peter Zijlstra)
Tooling side changes, user visible:
- 'perf diff':
- Add column colouring improvements (Ramkumar Ramachandra)
- 'perf kvm':
- Add guest related improvements, including allowing to specify a
directory with guest specific /proc information (Dongsheng Yang)
- Add shell completion support (Ramkumar Ramachandra)
- Add '-v' option (Dongsheng Yang)
- Support --guestmount (Dongsheng Yang)
- 'perf probe':
- Support showing source code, asking for variables to be collected
at probe time and other 'perf probe' operations that use DWARF
information.
This supports only binaries with debugging information at this
time, detached debuginfo (aka debuginfo packages) support should
come in later patches (Masami Hiramatsu)
- 'perf record':
- Rename --no-delay option to --no-buffering, better reflecting its
purpose and freeing up '--delay' to take the place of
'--initial-delay', so that 'record' and 'stat' are consistent
(Arnaldo Carvalho de Melo)
- Default the -t/--thread option to no inheritance (Adrian Hunter)
- Make per-cpu mmaps the default (Adrian Hunter)
- 'perf report':
- Improve callchain processing performance (Frederic Weisbecker)
- Retain bfd reference to lookup source line numbers, greatly
optimizing, among other use cases, 'perf report -s srcline'
(Adrian Hunter)
- Improve callchain processing performance even more (Namhyung Kim)
- Add a perf.data file header window in the 'perf report' TUI,
associated with the 'i' hotkey, providing a counterpart to the
--header option in the stdio UI (Namhyung Kim)
- 'perf script':
- Add an option in 'perf script' to print the source line number
(Adrian Hunter)
- Add --header/--header-only options to 'script' and 'report', the
default is not tho show the header info, but as this has been the
default for some time, leave a single line explaining how to
obtain that information (Jiri Olsa)
- Add options to show comm, fork, exit and mmap PERF_RECORD_ events
(Namhyung Kim)
- Print callchains and symbols if they exist (David Ahern)
- 'perf timechart'
- Add backtrace support to CPU info
- Print pid along the name
- Add support for CPU topology
- Add new option --highlight'ing threads, be it by name or, if a
numeric value is provided, that run more than given duration
(Stanislav Fomichev)
- 'perf top':
- Make 'perf top -g' refer to callchains, for consistency with
other tools (David Ahern)
- 'perf trace':
- Handle old kernels where the "raw_syscalls" tracepoints were
called plain "syscalls" (David Ahern)
- Remove thread summary coloring, by Pekka Enberg.
- Honour -m option in 'trace', the tool was offering the option to
set the mmap size, but wasn't using it when doing the actual mmap
on the events file descriptors (Jiri Olsa)
- generic:
- Backport libtraceevent plugin support (trace-cmd repository, with
plugins for jbd2, hrtimer, kmem, kvm, mac80211, sched_switch,
function, xen, scsi, cfg80211 (Jiri Olsa)
- Print session information only if --stdio is given (Namhyung Kim)
Tooling side changes, developer visible (plumbing):
- Improve 'perf probe' exit path, release resources (Masami
Hiramatsu)
- Improve libtraceevent plugins exit path, allowing the registering
of an unregister handler to be called at exit time (Namhyung Kim)
- Add an alias to the build test makefile (make -C tools/perf
build-test) (Namhyung Kim)
- Get rid of die() and friends (good riddance!) in libtraceevent
(Namhyung Kim)
- Fix cross build problems related to pkgconfig and CROSS_COMPILE not
being propagated to the feature tests, leading to features being
tested in the host and then being enabled on the target (Mark
Rutland)
- Improve forked workload error reporting by sending the errno in the
signal data queueing integer field, using sigqueue and by doing the
signal setup in the evlist methods, removing open coded equivalents
in various tools (Arnaldo Carvalho de Melo)
- Do more auto exit cleanup chores in the 'evlist' destructor, so
that the tools don't have to all do that sequence (Arnaldo Carvalho
de Melo)
- Pack 'struct perf_session_env' and 'struct trace' (Arnaldo Carvalho
de Melo)
- Add test for building detached source tarballs (Arnaldo Carvalho de
Melo)
- Move some header files (tools/perf/ to tools/include/ to make them
available to other tools/ dwelling codebases (Namhyung Kim)
- Move logic to warn about kptr_restrict'ed kernels to separate
function in 'report' (Arnaldo Carvalho de Melo)
- Move hist browser selection code to separate function (Arnaldo
Carvalho de Melo)
- Move histogram entries collapsing to separate function (Arnaldo
Carvalho de Melo)
- Introduce evlist__for_each() & friends (Arnaldo Carvalho de Melo)
- Automate setup of FEATURE_CHECK_(C|LD)FLAGS-all variables (Jiri
Olsa)
- Move arch setup into seprate Makefile (Jiri Olsa)
- Make libtraceevent install target quieter (Jiri Olsa)
- Make tests/make output more compact (Jiri Olsa)
- Ignore generated files in feature-checks (Chunwei Chen)
- Introduce pevent_filter_strerror() in libtraceevent, similar in
purpose to libc's strerror() function (Namhyung Kim)
- Use perf_data_file methods to write output file in 'record' and
'inject' (Jiri Olsa)
- Use pr_*() functions where applicable in 'report' (Namhyumg Kim)
- Add 'machine' 'addr_location' struct to have full picture (machine,
thread, map, symbol, addr) for a (partially) resolved address,
reducing function signatures (Arnaldo Carvalho de Melo)
- Reduce code duplication in the histogram entry creation/insertion
(Arnaldo Carvalho de Melo)
- Auto allocate annotation histogram data structures (Arnaldo
Carvalho de Melo)
- No need to test against NULL before calling free, also set freed
memory in struct pointers to NULL, to help fixing use after free
bugs (Arnaldo Carvalho de Melo)
- Rename some struct DSO binary_type related members and methods, to
clarify its purpose and need for differentiation (symtab_type, ie
one is about the files .text, CFI, etc, i.e. its binary contents,
and the other is about where the symbol table came from (Arnaldo
Carvalho de Melo)
- Convert to new topic libraries, starting with an API one (sysfs,
debugfs, etc), renaming liblk in the process (Borislav Petkov)
- Get rid of some more panic() like error handling in libtraceevent.
(Namhyung Kim)
- Get rid of panic() like calls in libtraceevent (Namyung Kim)
- Start carving out symbol parsing routines (perf, just moving
routines to topic files in tools/lib/symbol/, tools that want to
use it need to integrate it directly, ie no
tools/lib/symbol/Makefile is provided (Arnaldo Carvalho de Melo)
- Assorted refactoring patches, moving code around and adding utility
evlist methods that will be used in the IPT patchset (Adrian
Hunter)
- Assorted mmap_pages handling fixes (Adrian Hunter)
- Several man pages typo fixes (Dongsheng Yang)
- Get rid of several die() calls in libtraceevent (Namhyung Kim)
- Use basename() in a more robust way, to avoid problems related to
different system library implementations for that function
(Stephane Eranian)
- Remove open coded management of short_name_allocated member (Adrian
Hunter)
- Several cleanups in the "dso" methods, constifying some parameters
and renaming some fields to clarify its purpose (Arnaldo Carvalho
de Melo)
- Add per-feature check flags, fixing libunwind related build
problems on some architectures (Jean Pihet)
- Do not disable source line lookup just because of one failure.
(Adrian Hunter)
- Several 'perf kvm' man page corrections (Dongsheng Yang)
- Correct the message in feature-libnuma checking, swowing the right
devel package names for various distros (Dongsheng Yang)
- Polish 'readn()' function and introduce its counterpart,
'writen()' (Jiri Olsa)
- Start moving timechart state from global variables to a 'perf_tool'
derived 'timechart' struct (Arnaldo Carvalho de Melo)
... and lots of fixes and improvements I forgot to list"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
perf tools: Remove unnecessary callchain cursor state restore on unmatch
perf callchain: Spare double comparison of callchain first entry
perf tools: Do proper comm override error handling
perf symbols: Export elf_section_by_name and reuse
perf probe: Release all dynamically allocated parameters
perf probe: Release allocated probe_trace_event if failed
perf tools: Add 'build-test' make target
tools lib traceevent: Unregister handler when xen plugin is unloaded
tools lib traceevent: Unregister handler when scsi plugin is unloaded
tools lib traceevent: Unregister handler when jbd2 plugin is is unloaded
tools lib traceevent: Unregister handler when cfg80211 plugin is unloaded
tools lib traceevent: Unregister handler when mac80211 plugin is unloaded
tools lib traceevent: Unregister handler when sched_switch plugin is unloaded
tools lib traceevent: Unregister handler when kvm plugin is unloaded
tools lib traceevent: Unregister handler when kmem plugin is unloaded
tools lib traceevent: Unregister handler when hrtimer plugin is unloaded
tools lib traceevent: Unregister handler when function plugin is unloaded
tools lib traceevent: Add pevent_unregister_print_function()
tools lib traceevent: Add pevent_unregister_event_handler()
tools lib traceevent: fix pointer-integer size mismatch
...
Pull IRQ changes from Ingo Molnar:
"The only change in this cycle is a CPU hotplug related spurious
warning fix"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Fix kbuild warning in smp_irq_move_cleanup_interrupt()
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs
If we aren't going to use the local APIC anyway, we obviously don't
care about its timer frequency.
Link: http://lkml.kernel.org/r/tip-rgm7xmg7k6qnjlw3ynkcjsmh@git.kernel.org
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bin Gao <bin.gao@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
On AMD family 10h we see following error messages while waking up from
S3 for all non-boot CPUs leading to a failed IBS initialization:
Enabling non-boot CPUs ...
smpboot: Booting Node 0 Processor 1 APIC 0x1
[Firmware Bug]: cpu 1, try to use APIC500 (LVT offset 0) for vector 0x400, but the register is already in use for vector 0xf9 on another cpu
perf: IBS APIC setup failed on cpu #1
process: Switch to broadcast mode on CPU1
CPU1 is up
...
ACPI: Waking up from system sleep state S3
Reason for this is that during suspend the LVT offset for the IBS
vector gets lost and needs to be reinialized while resuming.
The offset is read from the IBSCTL msr. On family 10h the offset needs
to be 1 as offset 0 is used for the MCE threshold interrupt, but
firmware assings it for IBS to 0 too. The kernel needs to reprogram
the vector. The msr is a readonly node msr, but a new value can be
written via pci config space access. The reinitialization is
implemented for family 10h in setup_ibs_ctl() which is forced during
IBS setup.
This patch fixes IBS setup after waking up from S3 by adding
resume/supend hooks for the boot cpu which does the offset
reinitialization.
Marking it as stable to let distros pick up this fix.
Signed-off-by: Robert Richter <rric@kernel.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> v3.2..
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1389797849-5565-1-git-send-email-rric.net@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On SoCs that have the calibration MSRs available, either there is no
PIT, HPET or PMTIMER to calibrate against, or the PIT/HPET/PMTIMER is
driven from the same clock as the TSC, so calibration is redundant and
just slows down the boot.
TSC rate is caculated by this formula:
<maximum core-clock to bus-clock ratio> * <maximum resolved frequency>
The ratio and the resolved frequency ID can be obtained from MSR.
See Intel 64 and IA-32 System Programming Guid section 16.12 and 30.11.5
for details.
Signed-off-by: Bin Gao <bin.gao@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-rgm7xmg7k6qnjlw3ynkcjsmh@git.kernel.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64791
When a cpu is downed on a system, the irqs on the cpu are assigned to
other cpus. It is possible, however, that when a cpu is downed there
aren't enough free vectors on the remaining cpus to account for the
vectors from the cpu that is being downed.
This results in an interesting "overflow" condition where irqs are
"assigned" to a CPU but are not handled.
For example, when downing cpus on a 1-64 logical processor system:
<snip>
[ 232.021745] smpboot: CPU 61 is now offline
[ 238.480275] smpboot: CPU 62 is now offline
[ 245.991080] ------------[ cut here ]------------
[ 245.996270] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x246/0x250()
[ 246.005688] NETDEV WATCHDOG: p786p1 (ixgbe): transmit queue 0 timed out
[ 246.013070] Modules linked in: lockd sunrpc iTCO_wdt iTCO_vendor_support sb_edac ixgbe microcode e1000e pcspkr joydev edac_core lpc_ich ioatdma ptp mdio mfd_core i2c_i801 dca pps_core i2c_core wmi acpi_cpufreq isci libsas scsi_transport_sas
[ 246.037633] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0+ #14
[ 246.044451] Hardware name: Intel Corporation S4600LH ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 246.057371] 0000000000000009 ffff88081fa03d40 ffffffff8164fbf6 ffff88081fa0ee48
[ 246.065728] ffff88081fa03d90 ffff88081fa03d80 ffffffff81054ecc ffff88081fa13040
[ 246.074073] 0000000000000000 ffff88200cce0000 0000000000000040 0000000000000000
[ 246.082430] Call Trace:
[ 246.085174] <IRQ> [<ffffffff8164fbf6>] dump_stack+0x46/0x58
[ 246.091633] [<ffffffff81054ecc>] warn_slowpath_common+0x8c/0xc0
[ 246.098352] [<ffffffff81054fb6>] warn_slowpath_fmt+0x46/0x50
[ 246.104786] [<ffffffff815710d6>] dev_watchdog+0x246/0x250
[ 246.110923] [<ffffffff81570e90>] ? dev_deactivate_queue.constprop.31+0x80/0x80
[ 246.119097] [<ffffffff8106092a>] call_timer_fn+0x3a/0x110
[ 246.125224] [<ffffffff8106280f>] ? update_process_times+0x6f/0x80
[ 246.132137] [<ffffffff81570e90>] ? dev_deactivate_queue.constprop.31+0x80/0x80
[ 246.140308] [<ffffffff81061db0>] run_timer_softirq+0x1f0/0x2a0
[ 246.146933] [<ffffffff81059a80>] __do_softirq+0xe0/0x220
[ 246.152976] [<ffffffff8165fedc>] call_softirq+0x1c/0x30
[ 246.158920] [<ffffffff810045f5>] do_softirq+0x55/0x90
[ 246.164670] [<ffffffff81059d35>] irq_exit+0xa5/0xb0
[ 246.170227] [<ffffffff8166062a>] smp_apic_timer_interrupt+0x4a/0x60
[ 246.177324] [<ffffffff8165f40a>] apic_timer_interrupt+0x6a/0x70
[ 246.184041] <EOI> [<ffffffff81505a1b>] ? cpuidle_enter_state+0x5b/0xe0
[ 246.191559] [<ffffffff81505a17>] ? cpuidle_enter_state+0x57/0xe0
[ 246.198374] [<ffffffff81505b5d>] cpuidle_idle_call+0xbd/0x200
[ 246.204900] [<ffffffff8100b7ae>] arch_cpu_idle+0xe/0x30
[ 246.210846] [<ffffffff810a47b0>] cpu_startup_entry+0xd0/0x250
[ 246.217371] [<ffffffff81646b47>] rest_init+0x77/0x80
[ 246.223028] [<ffffffff81d09e8e>] start_kernel+0x3ee/0x3fb
[ 246.229165] [<ffffffff81d0989f>] ? repair_env_string+0x5e/0x5e
[ 246.235787] [<ffffffff81d095a5>] x86_64_start_reservations+0x2a/0x2c
[ 246.242990] [<ffffffff81d0969f>] x86_64_start_kernel+0xf8/0xfc
[ 246.249610] ---[ end trace fb74fdef54d79039 ]---
[ 246.254807] ixgbe 0000:c2:00.0 p786p1: initiating reset due to tx timeout
[ 246.262489] ixgbe 0000:c2:00.0 p786p1: Reset adapter
Last login: Mon Nov 11 08:35:14 from 10.18.17.119
[root@(none) ~]# [ 246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
[ 249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
[ 249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
(last lines keep repeating. ixgbe driver is dead until module reload.)
If the downed cpu has more vectors than are free on the remaining cpus on the
system, it is possible that some vectors are "orphaned" even though they are
assigned to a cpu. In this case, since the ixgbe driver had a watchdog, the
watchdog fired and notified that something was wrong.
This patch adds a function, check_vectors(), to compare the number of vectors
on the CPU going down and compares it to the number of vectors available on
the system. If there aren't enough vectors for the CPU to go down, an
error is returned and propogated back to userspace.
v2: Do not need to look at percpu irqs
v3: Need to check affinity to prevent counting of MSIs in IOAPIC Lowest
Priority Mode
v4: Additional changes suggested by Gong Chen.
v5/v6/v7/v8: Updated comment text
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1389613861-3853-1-git-send-email-prarit@redhat.com
Reviewed-by: Gong Chen <gong.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Janet Morgan <janet.morgan@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ruiv Wang <ruiv.wang@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Make disabled_cpu_apicid static and read_mostly, and fix a couple of
typos.
Reported-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/20140115182511.GA22737@gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Add disable_cpu_apicid kernel parameter. To use this kernel parameter,
specify an initial APIC ID of the corresponding CPU you want to
disable.
This is mostly used for the kdump 2nd kernel to disable BSP to wake up
multiple CPUs without causing system reset or hang due to sending INIT
from AP to BSP.
Kdump users first figure out initial APIC ID of the BSP, CPU0 in the
1st kernel, for example from /proc/cpuinfo and then set up this kernel
parameter for the 2nd kernel using the obtained APIC ID.
However, doing this procedure at each boot time manually is awkward,
which should be automatically done by user-land service scripts, for
example, kexec-tools on fedora/RHEL distributions.
This design is more flexible than disabling BSP in kernel boot time
automatically in that in kernel boot time we have no choice but
referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning
that the method is not applicable to the systems without such BIOS
tables.
One assumption behind this design is that users get initial APIC ID of
the BSP in still healthy state and so BSP is uniquely kept in
CPU0. Thus, through the kernel parameter, only one initial APIC ID can
be specified.
In a comparison with disabled_cpu_apicid, we use read_apic_id(), not
boot_cpu_physical_apicid, because on some platforms, the variable is
modified to the apicid reported as BSP through MP table and this
function is executed with the temporarily modified
boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel
parameter doesn't work well for apicids of APs.
Fixing the wrong handling of boot_cpu_physical_apicid requires some
reviews and tests beyond some platforms and it could take some
time. The fix here is a kind of workaround to focus on the main topic
of this patch.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Having u32 and struct cpuinfo_x86 * by the same name is not very smart,
although it was ok in this case due to the limited scope of u32 c and it
being used only once in there.
Fix this.
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1389786735-16751-1-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This adds the workaround for erratum 793 as a precaution in case not
every BIOS implements it. This addresses CVE-2013-6885.
Erratum text:
[Revision Guide for AMD Family 16h Models 00h-0Fh Processors,
document 51810 Rev. 3.04 November 2013]
793 Specific Combination of Writes to Write Combined Memory Types and
Locked Instructions May Cause Core Hang
Description
Under a highly specific and detailed set of internal timing
conditions, a locked instruction may trigger a timing sequence whereby
the write to a write combined memory type is not flushed, causing the
locked instruction to stall indefinitely.
Potential Effect on System
Processor core hang.
Suggested Workaround
BIOS should set MSR
C001_1020[15] = 1b.
Fix Planned
No fix planned
[ hpa: updated description, fixed typo in MSR name ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/20140114230711.GS29865@pd.tnic
Tested-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently we do a read, a dummy write and a final read to fetch
the error code. The value from the final read is taken.
This is not the recommended way and leads to corrupted/lost ESR
values.
Intel(c) 64 and IA-32 Architectures Software Developer's Manual,
Combined Volumes 1, 2ABC, 3ABC, Section 10.5.3 states:
Before attempt to read from the ESR, software should first
write to it. (The value written does not affect the values read
subsequently; only zero may be written in x2APIC mode.) This
write clears any previously logged errors and updates the ESR
with any errors detected since the last write to the ESR.
This write also rearms the APIC error interrupt triggering
mechanism.
This patch removes the first read such that we are conform with
the manual.
On my (very old) Pentium MMX SMP system this patch fixes the
issue that APIC errors:
a) are not always reported and
b) are reported with false error numbers.
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: seiji.aguchi@hds.com
Cc: rientjes@google.com
Cc: konrad.wilk@oracle.com
Cc: bp@alien8.de
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1389685487-20872-1-git-send-email-richard@nod.at
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We've grown a bunch of microcode loader files all prefixed with
"microcode_". They should be under cpu/ because this is strictly
CPU-related functionality so do that and drop the prefix since they're
in their own directory now which gives that prefix. :)
While at it, drop MICROCODE_INTEL_LIB config item and stash the
functionality under CONFIG_MICROCODE_INTEL as it was its only user.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
The original idea to use the microcode cache for the APs doesn't pan out
because we do memory allocation there very early and with IRQs disabled
and we don't want to involve GFP_ATOMIC allocations. Not if it can be
helped.
Thus, extend the caching of the BSP patch approach to the APs and
iterate over the ucode in the initrd instead of using the cache. We
still save the relevant patches to it but later, right before we
jettison the initrd.
While at it, fix early ucode loading on 32-bit too.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
We want to use those in AMD's early loading path too. Also, add a
native_wrmsrl variant.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
The ramdisk can possibly get relocated if the whole image is not mapped.
And since we're going over it in the microcode loader and fishing out
the relevant microcode patches, we want access it at its new location.
Thus, export it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Use a ring-buffer like multi-version object structure which allows
always having a coherent object; we use this to avoid having to
disable IRQs while reading sched_clock() and avoids a problem when
getting an NMI while changing the cyc2ns data.
MAINLINE PRE POST
sched_clock_stable: 1 1 1
(cold) sched_clock: 329841 331312 257223
(cold) local_clock: 301773 310296 309889
(warm) sched_clock: 38375 38247 25280
(warm) local_clock: 100371 102713 85268
(warm) rdtsc: 27340 27289 24247
sched_clock_stable: 0 0 0
(cold) sched_clock: 382634 372706 301224
(cold) local_clock: 396890 399275 399870
(warm) sched_clock: 38194 38124 25630
(warm) local_clock: 143452 148698 129629
(warm) rdtsc: 27345 27365 24307
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-s567in1e5ekq2nlyhn8f987r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fengguang Wu's 0day kernel build service reported the following build warning:
arch/x86/kernel/apic/io_apic.c:2211
smp_irq_move_cleanup_interrupt() warn: always true condition '(irq <= -1) => (0-u32max <= (-1))'
because irq is defined as an unsigned int instead of an int.
Fix this trivial error by redefining irq as a signed int. The
remaining consumers of the int are okay.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Link: http://lkml.kernel.org/r/1389620420-7110-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are no __cycles_2_ns() users outside of arch/x86/kernel/tsc.c,
so move it there.
There are no cycles_2_ns() users.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-01lslnavfgo3kmbo4532zlcj@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So mce_start_timer() has a 'cpu' argument which is supposed to mean to
start a timer on that cpu. However, the code currently starts a timer on
the *current* cpu the function runs on and causes the sanity-check in
mce_timer_fn to fire:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:1286 mce_timer_fn
because it is running on the wrong cpu.
This was triggered by Prarit Bhargava <prarit@redhat.com> by offlining
all the cpus in succession.
Then, we were fiddling with the CMCI storm settings when starting the
timer whereas there's no need for that - if there's storm happening
on this newly restarted cpu, we're going to be in normal CMCI mode
initially and then when the CMCI interrupt starts firing, we're going to
go to the polling mode with the timer real soon.
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Reviewed-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1387722156-5511-1-git-send-email-prarit@redhat.com
During heavy CPU-hotplug operations the following spurious kernel warnings
can trigger:
do_IRQ: No ... irq handler for vector (irq -1)
[ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ]
When downing a cpu it is possible that there are unhandled irqs
left in the APIC IRR register. The following code path shows
how the problem can occur:
1. CPU 5 is to go down.
2. cpu_disable() on CPU 5 executes with interrupt flag cleared
by local_irq_save() via stop_machine().
3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because
interrupt flag is cleared (CPU unabled to handle the irq)
4. IRQs are migrated off of CPU 5, and the vectors' irqs are set
to -1. 5. stop_machine() finishes cpu_disable()
6. cpu_die() for CPU 5 executes in normal context.
7. CPU 5 attempts to handle IRQ 12 because the IRR is set for
IRQ 12. The code attempts to find the vector's IRQ and cannot
because it has been set to -1. 8. do_IRQ() warning displays
warning about CPU 5 IRQ 12.
I added a debug printk to output which CPU & vector was
retriggered and discovered that that we are getting bogus
events. I see a 100% correlation between this debug printk in
fixup_irqs() and the do_IRQ() warning.
This patchset resolves this by adding definitions for
VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying
the code to use them.
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Rui Wang <rui.y.wang@intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: janet.morgan@Intel.com
Cc: tony.luck@Intel.com
Cc: ruiv.wang@gmail.com
Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com
[ Cleaned up the code a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds support for the Intel RAPL energy counter
PP1 (Power Plane 1).
On client processors, it usually corresponds to the
energy consumption of the builtin graphic card. That
is why the sysfs event is called energy-gpu.
New event:
- name: power/energy-gpu/
- code: event=0x4
- unit: 2^-32 Joules
On processors without graphics, this should count 0.
The patch only enables this event on client processors.
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: ak@linux.intel.com
Cc: acme@redhat.com
Cc: jolsa@redhat.com
Cc: zheng.z.yan@intel.com
Cc: bp@alien8.de
Cc: vincent.weaver@maine.edu
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1389176153-3128-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Function tracing callbacks expect to have the ftrace_ops that registered it
passed to them, not the address of the variable that holds the ftrace_ops
that registered it.
Use a mov instead of a lea to store the ftrace_ops into the parameter
of the function tracing callback.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20131113152004.459787f9@gandalf.local.home
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.8+
Current Intel SOC cores use a MailBox Interface (MBI) to provide access to
configuration registers on devices (called units) connected to the system
fabric. This is a support driver that implements access to this interface on
those platforms that can enumerate the device using PCI. Initial support is for
BayTrail, for which port definitons are provided. This is a requirement for
implementing platform specific features (e.g. RAPL driver requires this to
perform platform specific power management using the registers in PUNIT).
Dependant modules should select IOSF_MBI in their respective Kconfig
configuraiton. Serialized access is handled by all exported routines with
spinlocks.
The API includes 3 functions for access to unit registers:
int iosf_mbi_read(u8 port, u8 opcode, u32 offset, u32 *mdr)
int iosf_mbi_write(u8 port, u8 opcode, u32 offset, u32 mdr)
int iosf_mbi_modify(u8 port, u8 opcode, u32 offset, u32 mdr, u32 mask)
port: indicating the unit being accessed
opcode: the read or write port specific opcode
offset: the register offset within the port
mdr: the register data to be read, written, or modified
mask: bit locations in mdr to change
Returns nonzero on error
Note: GPU code handles access to the GFX unit. Therefore access to that unit
with this driver is disallowed to avoid conflicts.
Signed-off-by: David E. Box <david.e.box@linux.intel.com>
Link: http://lkml.kernel.org/r/1389216471-734-1-git-send-email-david.e.box@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
[ hpa: undid incorrect removal from arch/x86/kernel/head_32.S ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1389054026-12947-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
kbuild test robot report below error for randconfig:
arch/x86/kernel/ksysfs.c: In function 'get_setup_data_paddr':
arch/x86/kernel/ksysfs.c:81:3: error: implicit declaration of function 'ioremap_cache' [-Werror=implicit-function-declaration]
arch/x86/kernel/ksysfs.c:86:3: error: implicit declaration of function 'iounmap' [-Werror=implicit-function-declaration]
Fix it by including <asm/io.h> in ksysfs.c
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Pull x86 fixes from Peter Anvin:
"There is a small EFI fix and a big power regression fix in this batch.
My queue also had a fix for downing a CPU when there are insufficient
number of IRQ vectors available, but I'm holding that one for now due
to recent bug reports"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/efi: Don't select EFI from certain special ACPI drivers
x86 idle: Repair large-server 50-watt idle-power regression
Currently e820_reserve_setup_data() is called before parsing early
params, it works in normal case. But for memmap=exactmap, the final
memory ranges are created after parsing memmap= cmdline params, so the
previous e820_reserve_setup_data() has no effect. For example,
setup_data ranges will still be marked as normal system ram, thus when
later sysfs driver ioremap them kernel will warn about mapping normal
ram.
This patch fix it by moving the e820_reserve_setup_data() callback after
parsing early params so they can be set as reserved ranges and later
ioremap will be fine with it.
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
kexec-tools use boot_params for getting the 1st kernel hardware_subarch,
the kexec kernel EFI runtime support also needs to read the old efi_info
from boot_params. Currently it exists in debugfs which is not a good
place for such infomation. Per HPA, we should avoid "sploit debugfs".
In this patch /sys/kernel/boot_params are exported, also the setup_data is
exported as a subdirectory. kexec-tools is using debugfs for hardware_subarch
for a long time now so we're not removing it yet.
Structure is like below:
/sys/kernel/boot_params
|__ data /* boot_params in binary*/
|__ setup_data
| |__ 0 /* the first setup_data node */
| | |__ data /* setup_data node 0 in binary*/
| | |__ type /* setup_data type of setup_data node 0, hex string */
[snip]
|__ version /* boot protocal version (in hex, "0x" prefixed)*/
Signed-off-by: Dave Young <dyoung@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Add a new setup_data type SETUP_EFI for kexec use. Passing the saved
fw_vendor, runtime, config tables and EFI runtime mappings.
When entering virtual mode, directly mapping the EFI runtime regions
which we passed in previously. And skip the step to call
SetVirtualAddressMap().
Specially for HP z420 workstation we need save the smbios physical
address. The kernel boot sequence proceeds in the following order.
Step 2 requires efi.smbios to be the physical address. However, I found
that on HP z420 EFI system table has a virtual address of SMBIOS in step
1. Hence, we need set it back to the physical address with the smbios
in efi_setup_data. (When it is still the physical address, it simply
sets the same value.)
1. efi_init() - Set efi.smbios from EFI system table
2. dmi_scan_machine() - Temporary map efi.smbios to access SMBIOS table
3. efi_enter_virtual_mode() - Map EFI ranges
Tested on ovmf+qemu, lenovo thinkpad, a dell laptop and an
HP z420 workstation.
Signed-off-by: Dave Young <dyoung@redhat.com>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>