linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-08 05:01:48 +00:00

Author	SHA1	Message	Date
Rusty Russell	7e1941444f	lguest: remove remaining vmcall We switch back from using vmcall in `091ebf07a2` because it was unreliable under kvm, but I missed one (rarely-used) place. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-07-22 14:39:49 +09:30
Rusty Russell	5dea1c88ed	lguest: use a special 1:1 linear pagetable mode until first switch. The Host used to create some page tables for the Guest to use at the top of Guest memory; it would then tell the Guest where this was. In particular, it created linear mappings for 0 and 0xC0000000 addresses because lguest used to switch to its real page tables quite late in boot. However, since `d50d8fe19` Linux initialized boot page tables in head_32.S even before the "are we lguest?" boot jump. So, now we can simplify things: the Host pagetable code assumes 1:1 linear mapping until it first calls the LHCALL_NEW_PGTABLE hypercall, which we now do before we reach C code. This also means that the Host doesn't need to know anything about the Guest's PAGE_OFFSET. (Non-Linux guests might not even have such a thing). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-07-22 14:39:48 +09:30
Andy Lutomirski	aafade242f	x86-64, vdso: Do not allocate memory for the vDSO We can map the vDSO straight from kernel data, saving a few page allocations. As an added bonus, the deleted code contained a memory leak. Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/2c4ed5c2c2e93603790229e0c3403ae506ccc0cb.1311277573.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-07-21 13:41:53 -07:00
H. Peter Anvin	ae7bd11b47	clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option The machinery for __ARCH_HAS_CLOCKSOURCE_DATA assumed a file in asm-generic would be the default for architectures without their own file in asm/, but that is not how it works. Replace it with a Kconfig option instead. Link: http://lkml.kernel.org/r/4E288AA6.7090804@zytor.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tony Luck <tony.luck@intel.com>	2011-07-21 13:34:05 -07:00
H. Peter Anvin	a536877e77	x86: Make Dell Latitude E6420 use reboot=pci Yet another variant of the Dell Latitude series which requires reboot=pci. From the E5420 bug report by Daniel J Blueman: > The E6420 is affected also (same platform, different casing and > features), which provides an external confirmation of the issue; I can > submit a patch for that later or include it if you prefer: > http://linux.koolsolutions.com/2009/08/04/howto-fix-linux-hangfreeze-during-reboots-and-restarts/ Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>	2011-07-21 11:47:17 -07:00
Daniel J Blueman	b7798d28ec	x86: Make Dell Latitude E5420 use reboot=pci Rebooting on the Dell E5420 often hangs with the keyboard or ACPI methods, but is reliable via the PCI method. [ hpa: this was deferred because we believed for a long time that the recent reshuffling of the boot priorities in commit `660e34cebf` fixed this platform. Unfortunately that turned out to be incorrect. ] Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com> Link: http://lkml.kernel.org/r/1305248699-2347-1-git-send-email-daniel.blueman@gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>	2011-07-21 11:45:49 -07:00
Robert Richter	1ac2e6ca44	x86, perf: Make copy_from_user_nmi() a library function copy_from_user_nmi() is used in oprofile and perf. Moving it to other library functions like copy_from_user(). As this is x86 code for 32 and 64 bits, create a new file usercopy.c for unified code. Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110607172413.GJ20052@erda.amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 20:41:57 +02:00
Cyrill Gorcunov	f53173e47d	x86, perf: P4 PMU - Fix typos in comments and style cleanup This patch: - fixes typos in comments and clarifies the text - renames obscure p4_event_alias::original and ::alter members to ::original and ::alternative as appropriate - drops parenthesis from the return of p4_get_alias_event() No functional changes. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Link: http://lkml.kernel.org/r/20110721160625.GX7492@sun Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 20:41:54 +02:00
Phil Carmody	497888cf69	treewide: fix potentially dangerous trailing ';' in #defined values/expressions All these are instances of #define NAME value; or #define NAME(params_opt) value; These of course fail to build when used in contexts like if(foo $OP NAME) while(bar $OP NAME) and may silently generate the wrong code in contexts such as foo = NAME + 1; /* foo = value; + 1; / bar = NAME - 1; / bar = value; - 1; / baz = NAME & quux; / baz = value; & quux; */ Reported on comp.lang.c, Message-ID: <ab0d55fe-25e5-482b-811e-c475aa6065c3@c29g2000yqd.googlegroups.com> Initial analysis of the dangers provided by Keith Thompson in that thread. There are many more instances of more complicated macros having unnecessary trailing semicolons, but this pile seems to be all of the cases of simple values suffering from the problem. (Thus things that are likely to be found in one of the contexts above, more complicated ones aren't.) Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-21 14:10:00 +02:00
Huang Ying	050438ed5a	kexec, x86: Fix incorrect jump back address if not preserving context In kexec jump support, jump back address passed to the kexeced kernel via function calling ABI, that is, the function call return address is the jump back entry. Furthermore, jump back entry == 0 should be used to signal that the jump back or preserve context is not enabled in the original kernel. But in the current implementation the stack position used for function call return address is not cleared context preservation is disabled. The patch fixes this bug. Reported-and-tested-by: Yin Kangkai <kangkai.yin@intel.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1310607277-25029-1-git-send-email-ying.huang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 11:19:28 +02:00
Alan Cox	43605ef188	x86, config: Introduce an INTEL_MID configuration We need to carve up the configuration between: - MID general - Moorestown specific - Medfield specific - Future devices As a base point create an INTEL_MID configuration property. We make the existing MRST configuration a sub-option. This means that the rest of the kernel config can still use X86_MRST checks without anything going backwards. After this is merged future patches will tidy up which devices are MID and which are X86_MRST, as well as add options for Medfield. Signed-off-by: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/20110712164859.7642.84136.stgit@bob.linux.org.uk Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 10:35:14 +02:00
Sergei Shtylyov	38175051f8	x86, quirks: Use pci_dev->revision This code uses PCI_CLASS_REVISION instead of PCI_REVISION_ID, so it wasn't converted by commit `44c10138fd` ("PCI: Change all drivers to use pci_device->revision") before being moved to arch/x86/... Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/201107111901.39281.sshtylyov@ru.mvista.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 10:26:00 +02:00
Greg Dietsche	a6c23905ff	x86, smpboot: Mark the names[] array in __inquire_remote_apic() as const This array is read-only. Make it explicit by marking as const. Signed-off-by: Greg Dietsche <Gregory.Dietsche@cuw.edu> Link: http://lkml.kernel.org/r/1309482653-23648-1-git-send-email-Gregory.Dietsche@cuw.edu Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 10:04:51 +02:00
Jan Beulich	ef68c8f87e	x86: Serialize EFI time accesses on rtc_lock The EFI specification requires that callers of the time related runtime functions serialize with other CMOS accesses in the kernel, as the EFI time functions may choose to also use the legacy CMOS RTC. Besides fixing a latent bug, this is a prerequisite to safely enable the rtc-efi driver for x86, which ought to be preferred over rtc-cmos on all EFI platforms. Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Matthew Garrett <mjg59@srcf.ucam.org> Cc: <mjg@redhat.com> Link: http://lkml.kernel.org/r/4E257E33020000780004E319@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Matthew Garrett <mjg@redhat.com>	2011-07-21 09:21:00 +02:00
Jan Beulich	ac619f4eba	x86: Serialize SMP bootup CMOS accesses on rtc_lock With CPU hotplug, there is a theoretical race between other CMOS (namely RTC) accesses and those done in the SMP secondary processor bringup path. I am unware of the problem having been noticed by anyone in practice, but it would very likely be rather spurious and very hard to reproduce. So to be on the safe side, acquire rtc_lock around those accesses. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/4E257AE7020000780004E2FF@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 09:20:59 +02:00
Jan Beulich	a750036f35	x86: Fix write lock scalability 64-bit issue With the write lock path simply subtracting RW_LOCK_BIAS there is, on large systems, the theoretical possibility of overflowing the 32-bit value that was used so far (namely if 128 or more CPUs manage to do the subtraction, but don't get to do the inverse addition in the failure path quickly enough). A first measure is to modify RW_LOCK_BIAS itself - with the new value chosen, it is good for up to 2048 CPUs each allowed to nest over 2048 times on the read path without causing an issue. Quite possibly it would even be sufficient to adjust the bias a little further, assuming that allowing for significantly less nesting would suffice. However, as the original value chosen allowed for even more nesting levels, to support more than 2048 CPUs (possible currently only for 64-bit kernels) the lock itself gets widened to 64 bits. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4E258E0D020000780004E3F0@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 09:03:36 +02:00
Jan Beulich	a738669464	x86: Unify rwsem assembly implementation Rather than having two functionally identical implementations for 32- and 64-bit configurations, use the previously extended assembly abstractions to fold the rwsem two implementations into a shared one. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4E258DF3020000780004E3ED@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 09:03:32 +02:00
Jan Beulich	4625cd6379	x86: Unify rwlock assembly implementation Rather than having two functionally identical implementations for 32- and 64-bit configurations, extend the existing assembly abstractions enough to fold the two rwlock implementations into a shared one. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4E258DD7020000780004E3EA@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-21 09:03:31 +02:00
Linus Torvalds	919d25a710	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86. reboot: Make Dell Latitude E6320 use reboot=pci x86, doc only: Correct real-mode kernel header offset for init_size x86: Disable AMD_NUMA for 32bit for now	2011-07-20 15:33:59 -07:00
Jeremy Fitzhardinge	2a6f6d0955	xen/multicall: move *idx fields to start of mc_buffer The CPU would prefer small offsets. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:46 -07:00
Jeremy Fitzhardinge	eac303bf2e	xen/multicall: special-case singleton hypercalls Singleton calls seem to end up being pretty common, so just directly call the hypercall rather than going via multicall. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:45 -07:00
Jeremy Fitzhardinge	4a7b005dbf	xen/multicalls: add unlikely around slowpath in __xen_mc_entry() Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:45 -07:00
Jeremy Fitzhardinge	ffc78767f2	xen/multicalls: disable MC_DEBUG It's useful - and probably should be a config - but its very heavyweight, especially with the tracing stuff to help sort out problems. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:28 -07:00
Jeremy Fitzhardinge	bc7fe1d977	xen/mmu: tune pgtable alloc/release Make sure the fastpath code is inlined. Batch the page permission change and the pin/unpin, and make sure that it can be batched with any adjacent set_pte/pmd/etc operations. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:28 -07:00
Jeremy Fitzhardinge	dcf7435cfe	xen/mmu: use extend_args for more mmuext updates Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge	c8eed1719a	xen/trace: add tlb flush tracepoints Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge	ab78f7ad2c	xen/trace: add segment desc tracing Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge	5f94fb5b8e	xen/trace: add xen_pgd_(un)pin tracepoints Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge	c2ba050d2e	xen/trace: add ptpage alloc/release tracepoints Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge	8470880791	xen/trace: add mmu tracepoints Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:27 -07:00
Jeremy Fitzhardinge	c796f213a6	xen/trace: add multicall tracing Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:26 -07:00
Jeremy Fitzhardinge	f04e2ee41d	xen/trace: set up tracepoint skeleton Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:04 -07:00
Jeremy Fitzhardinge	84cdee76b1	xen/multicalls: remove debugfs stats Remove debugfs stats to make way for tracing. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>	2011-07-18 15:43:04 -07:00
Borislav Petkov	8c400f6ce0	x86, vdso: Drop now wrong comment Now that `1b3f2a72bb` is in, it is very important that the below lying comment be removed! :-) Signed-off-by: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20110718191054.GA18359@liondog.tnic Acked-by: Andy Lutomirski <luto@mit.edu> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-18 12:29:50 -07:00
Len Brown	17edf2d79f	x86, intel, power: Correct the MSR_IA32_ENERGY_PERF_BIAS message Fix the printk_once() so that it actually prints (didn't print before due to a stray comma.) [ hpa: changed to an incremental patch and adjusted the description accordingly. ] Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1107151732480.18606@x980 Cc: <table@kernel.org> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-15 15:13:55 -07:00
Oleg Nesterov	73d382decc	x86: Kill handle_signal()->set_fs() handle_signal()->set_fs() has a nice comment which explains what set_fs() is, but it doesn't explain why it is needed and why it depends on CONFIG_X86_64. Afaics, the history of this confusion is: 1. I guess today nobody can explain why it was needed in arch/i386/kernel/signal.c, perhaps it was always wrong. This predates 2.4.0 kernel. 2. then it was copy-and-past'ed to the new x86_64 arch. 3. then it was removed from i386 (but not from x86_64) by `b93b6ca3` "i386: remove unnecessary code". 4. then it was reintroduced under CONFIG_X86_64 when x86 unified i386 and x86_64, because the patch above didn't touch x86_64. Remove it. ->addr_limit should be correct. Even if it was possible that it is wrong, it is too late to fix it after setup_rt_frame(). Linus commented in: http://lkml.kernel.org/r/alpine.LFD.0.999.0707170902570.19166@woody.linux-foundation.org ... about the equivalent bit from i386: Heh. I think it's entirely historical. Please realize that the whole reason that function is called "set_fs()" is that it literally used to set the %fs segment register, not "->addr_limit". So I think the "set_fs(USER_DS)" is there _only_ to match the other regs->xds = __USER_DS; regs->xes = __USER_DS; regs->xss = __USER_DS; regs->xcs = __USER_CS; things, and never mattered. And now it matters even less, and has been copied to all other architectures where it is just totally insane. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20110710164424.GA20261@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-14 21:46:20 -07:00
Oleg Nesterov	9b42962074	x86, do_signal: Simplify the TS_RESTORE_SIGMASK logic 1. do_signal() looks at TS_RESTORE_SIGMASK and calculates the mask which should be stored in the signal frame, then it passes "oldset" to the callees, down to setup_rt_frame(). This is ugly, setup_rt_frame() can do this itself and nobody else needs this sigset_t. Move this code into setup_rt_frame. 2. do_signal() also clears TS_RESTORE_SIGMASK if handle_signal() succeeds. We can move this to setup_rt_frame() as well, this avoids the unnecessary checks and makes the logic more clear. 3. use set_current_blocked() instead of sigprocmask(SIG_SETMASK), sigprocmask() should be avoided. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20110710182203.GA27979@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-14 21:22:11 -07:00
Oleg Nesterov	3982294b03	x86, signals: Convert the X86_32 code to use set_current_blocked() sys_sigsuspend() and sys_sigreturn() change ->blocked directly. This is not correct, see the changelog in `e6fa16ab` "signal: sigprocmask() should do retarget_shared_pending()" Change them to use set_current_blocked(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20110710192727.GA31759@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-14 21:21:57 -07:00
Oleg Nesterov	905f29e2aa	x86, signals: Convert the IA32_EMULATION code to use set_current_blocked() sys32_sigsuspend() and sys32_*sigreturn() change ->blocked directly. This is not correct, see the changelog in `e6fa16ab` "signal: sigprocmask() should do retarget_shared_pending()" Change them to use set_current_blocked(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/20110710192724.GA31755@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-14 21:21:31 -07:00
Andy Lutomirski	98d0ac38ca	x86-64: Move vread_tsc and vread_hpet into the vDSO The vsyscall page now consists entirely of trap instructions. Cc: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/637648f303f2ef93af93bae25186e9a1bea093f5.1310639973.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-14 17:57:05 -07:00
H. Peter Anvin	4bb82178f5	x86, msr: Fix typo in ENERGY_PERF_BIAS_POWERSAVE Fix a trivial typo in the name of the constant ENERGY_PERF_BIAS_POWERSAVE. This didn't cause trouble because this constant is not currently used for anything. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/tip-abe48b108247e9b90b4c6739662a2e5c765ed114@git.kernel.org	2011-07-14 14:58:44 -07:00
Cyrill Gorcunov	f912987097	perf, x86: P4 PMU - Introduce event alias feature Instead of hw_nmi_watchdog_set_attr() weak function and appropriate x86_pmu::hw_watchdog_set_attr() call we introduce even alias mechanism which allow us to drop this routines completely and isolate quirks of Netburst architecture inside P4 PMU code only. The main idea remains the same though -- to allow nmi-watchdog and perf top run simultaneously. Note the aliasing mechanism applies to generic PERF_COUNT_HW_CPU_CYCLES event only because arbitrary event (say passed as RAW initially) might have some additional bits set inside ESCR register changing the behaviour of event and we can't guarantee anymore that alias event will give the same result. P.S. Thanks a huge to Don and Steven for for testing and early review. Acked-by: Don Zickus <dzickus@redhat.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: Ingo Molnar <mingo@elte.hu> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Stephane Eranian <eranian@google.com> CC: Lin Ming <ming.m.lin@intel.com> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20110708201712.GS23657@sun Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2011-07-14 17:25:04 -04:00
Len Brown	abe48b1082	x86, intel, power: Initialize MSR_IA32_ENERGY_PERF_BIAS Since 2.6.36 (`23016bf0d2`), Linux prints the existence of "epb" in /proc/cpuinfo, Since 2.6.38 (`d5532ee7b4`), the x86_energy_perf_policy(8) utility has been available in-tree to update MSR_IA32_ENERGY_PERF_BIAS. However, the typical BIOS fails to initialize the MSR, presumably because this is handled by high-volume shrink-wrap operating systems... Linux distros, on the other hand, do not yet invoke x86_energy_perf_policy(8). As a result, WSM-EP, SNB, and later hardware from Intel will run in its default hardware power-on state (performance), which assumes that users care for performance at all costs and not for energy efficiency. While that is fine for performance benchmarks, the hardware's intended default operating point is "normal" mode... Initialize the MSR to the "normal" by default during kernel boot. x86_energy_perf_policy(8) is available to change the default after boot, should the user have a different preference. Signed-off-by: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1107140051020.18606@x980 Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@kernel.org>	2011-07-14 12:13:42 -07:00
David S. Miller	6a7ebdf2fd	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/bluetooth/l2cap_core.c	2011-07-14 07:56:40 -07:00
Glauber Costa	095c0aa83e	sched: adjust scheduler cpu power for stolen time This patch makes update_rq_clock() aware of steal time. The mechanism of operation is not different from irq_time, and follows the same principles. This lives in a CONFIG option itself, and can be compiled out independently of the rest of steal time reporting. The effect of disabling it is that the scheduler will still report steal time (that cannot be disabled), but won't use this information for cpu power adjustments. Everytime update_rq_clock_task() is invoked, we query information about how much time was stolen since last call, and feed it into sched_rt_avg_update(). Although steal time reporting in account_process_tick() keeps track of the last time we read the steal clock, in prev_steal_time, this patch do it independently using another field, prev_steal_time_rq. This is because otherwise, information about time accounted in update_process_tick() would never reach us in update_rq_clock(). Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:47 +03:00
Glauber Costa	3c404b578f	KVM guest: Add a pv_ops stub for steal time This patch adds a function pointer in one of the many paravirt_ops structs, to allow guests to register a steal time function. Besides a steal time function, we also declare two jump_labels. They will be used to allow the steal time code to be easily bypassed when not in use. Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:44 +03:00
Glauber Costa	c9aaa8957f	KVM: Steal time implementation To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. [avi, yongjie: export delayacct_on, to avoid build failures in some configs] Signed-off-by: Glauber Costa <glommer@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:14 +03:00
Andy Lutomirski	433bd805e5	clocksource: Replace vread with generic arch data The vread field was bloating struct clocksource everywhere except x86_64, and I want to change the way this works on x86_64, so let's split it out into per-arch data. Cc: x86@kernel.org Cc: Clemens Ladisch <clemens@ladisch.de> Cc: linux-ia64@vger.kernel.org Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/3ae5ec76a168eaaae63f08a2a1060b91aa0b7759.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-13 11:23:12 -07:00
Andy Lutomirski	7f79ad15f3	x86-64: Add --no-undefined to vDSO build This gives much nicer diagnostics when something goes wrong. It's supported at least as far back as binutils 2.15. Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/de0b50920469ff6359c529526e7639fdd36fa83c.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-13 11:23:09 -07:00
Andy Lutomirski	1b3f2a72bb	x86-64: Allow alternative patching in the vDSO This code is short enough and different enough from the module loader that it's not worth trying to share anything. Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/e73112e4381fff29e31b882c2d0856822edaea53.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-13 11:23:07 -07:00
Andy Lutomirski	59e97e4d6f	x86: Make alternative instruction pointers relative This save a few bytes on x86-64 and means that future patches can apply alternatives to unrelocated code. Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/ff64a6b9a1a3860ca4a7b8b6dc7b4754f9491cd7.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-13 11:22:56 -07:00
Andy Lutomirski	c9712944b2	x86-64: Improve vsyscall emulation CS and RIP handling Three fixes here: - Send SIGSEGV if called from compat code or with a funny CS. - Don't BUG on impossible addresses. - Add a missing local_irq_disable. This patch also removes an unused variable. Signed-off-by: Andy Lutomirski <luto@mit.edu> Link: http://lkml.kernel.org/r/6fb2b13ab39b743d1e4f466eef13425854912f7f.1310563276.git.luto@mit.edu Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-13 11:22:55 -07:00
Tejun Heo	1e01979c8f	x86, numa: Implement pfn -> nid mapping granularity check SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use sections array to map pfn to nid which is limited in granularity. If NUMA nodes are laid out such that the mapping cannot be accurate, boot will fail triggering BUG_ON() in mminit_verify_page_links(). On 32bit, it's 512MiB w/ PAE and SPARSEMEM. This seems to have been granular enough until commit `2706a0bf7b` (x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too). Apparently, there is a machine which aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT. This led to the following BUG_ON(). On node 0 totalpages: 2096615 DMA zone: 32 pages used for memmap DMA zone: 0 pages reserved DMA zone: 3927 pages, LIFO batch:0 Normal zone: 1740 pages used for memmap Normal zone: 220978 pages, LIFO batch:31 HighMem zone: 16405 pages used for memmap HighMem zone: 1853533 pages, LIFO batch:31 BUG: Int 6: CR2 (null) EDI (null) ESI 00000002 EBP 00000002 ESP c1543ecc EBX f2400000 EDX 00000006 ECX (null) EAX 00000001 err (null) EIP c16209aa CS 00000060 flg 00010002 Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000 (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe (null) f7200b80 c16395f0 00200a02 f7200a80 (null) 000375fe 00000002 (null) Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0b #17 Call Trace: [<c136b1e5>] ? early_fault+0x2e/0x2e [<c16209aa>] ? mminit_verify_page_links+0x12/0x42 [<c1620613>] ? memmap_init_zone+0xaf/0x10c [<c1620929>] ? free_area_init_node+0x2b9/0x2e3 [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451 [<c1601d80>] ? paging_init+0x112/0x118 [<c15f578d>] ? setup_arch+0x791/0x82f [<c15f43d9>] ? start_kernel+0x6a/0x257 This patch implements node_map_pfn_alignment() which determines maximum internode alignment and update numa_register_memblks() to reject NUMA configuration if alignment exceeds the pfn -> nid mapping granularity of the memory model as determined by PAGES_PER_SECTION. This makes the problematic machine boot w/ flatmem by rejecting the NUMA config and provides protection against crazy NUMA configurations. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com> Reported-and-Tested-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Cc: Conny Seidel <conny.seidel@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-12 21:58:29 -07:00
Tejun Heo	d0ead15738	x86, mm: s/PAGES_PER_ELEMENT/PAGES_PER_SECTION/ DISCONTIGMEM on x86-32 implements pfn -> nid mapping similarly to SPARSEMEM; however, it calls each mapping unit ELEMENT instead of SECTION. This patch renames it to SECTION so that PAGES_PER_SECTION is valid for both DISCONTIGMEM and SPARSEMEM. This will be used by the next patch to implement mapping granularity check. This patch is trivial constant rename. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20110712074422.GA2872@htj.dyndns.org Cc: Hans Rosenfeld <hans.rosenfeld@amd.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-12 21:58:11 -07:00
Maxime Ripard	3628c3f5c8	x86. reboot: Make Dell Latitude E6320 use reboot=pci The Dell Latitude E6320 doesn't reboot unless reboot=pci is set. Force it thanks to DMI. Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Link: http://lkml.kernel.org/r/1309269451-4966-1-git-send-email-maxime.ripard@free-electrons.com Cc: Matthew Garrett <mjg@redhat.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-07-12 21:42:48 -07:00
Naga Chumbalkar	42f0efc5aa	x86, ioapic: Print IR_IO_APIC_route_entry when IR is enabled When IR (interrupt remapping) is enabled print_IO_APIC() displays output according to legacy RTE (redirection table entry) definitons: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 00 1 0 0 0 0 0 0 00 01 00 0 0 0 0 0 0 0 01 02 00 0 0 0 0 0 0 0 02 03 00 1 0 0 0 0 0 0 03 04 00 1 0 0 0 0 0 0 04 05 00 1 0 0 0 0 0 0 05 06 00 1 0 0 0 0 0 0 06 ... The above output is as per Sec 3.2.4 of the IOAPIC datasheet: 82093AA I/O Advanced Programmable Interrupt Controller (IOAPIC): http://download.intel.com/design/chipsets/datashts/29056601.pdf Instead the output should display the fields as discussed in Sec 5.5.1 of the VT-d specification: (Intel Virtualization Technology for Directed I/O: http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf) After the fix: NR Indx Fmt Mask Trig IRR Pol Stat Indx2 Zero Vect: 00 0000 0 1 0 0 0 0 0 0 00 01 000F 1 0 0 0 0 0 0 0 01 02 0001 1 0 0 0 0 0 0 0 02 03 0002 1 1 0 0 0 0 0 0 03 04 0011 1 1 0 0 0 0 0 0 04 05 0004 1 1 0 0 0 0 0 0 05 06 0005 1 1 0 0 0 0 0 0 06 ... Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Link: http://lkml.kernel.org/r/20110712211658.2939.93123.sendpatchset@nchumbalkar.americas.cpqcorp.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-12 20:17:58 -07:00
Naga Chumbalkar	3040db92ee	x86, ioapic: Print IRTE when IR is enabled When "apic=debug" is used as a boot parameter, Linux prints the IOAPIC routing entries in "dmesg". Below is output from IOAPIC whose apic_id is 8: # dmesg \| grep "routing entry" IOAPIC[8]: Set routing entry (8-1 -> 0x31 -> IRQ 1 Mode:0 Active:0 Dest:0) IOAPIC[8]: Set routing entry (8-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0) IOAPIC[8]: Set routing entry (8-3 -> 0x33 -> IRQ 3 Mode:0 Active:0 Dest:0) ... Similarly, when IR (interrupt remapping) is enabled, and the IRTE (interrupt remapping table entry) is set up we should display it. After the fix: # dmesg \| grep IRTE IOAPIC[8]: Set IRTE entry (P:1 FPD:0 Dst_Mode:0 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:31 Dest:00000000 SID:00F1 SQ:0 SVT:1) IOAPIC[8]: Set IRTE entry (P:1 FPD:0 Dst_Mode:0 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:30 Dest:00000000 SID:00F1 SQ:0 SVT:1) IOAPIC[8]: Set IRTE entry (P:1 FPD:0 Dst_Mode:0 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:33 Dest:00000000 SID:00F1 SQ:0 SVT:1) ... The IRTE is defined in Sec 9.5 of the Intel VT-d Specification. Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Link: http://lkml.kernel.org/r/20110712211704.2939.71291.sendpatchset@nchumbalkar.americas.cpqcorp.net Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-12 14:34:00 -07:00
Naga Chumbalkar	2597085228	x86, x2apic: Preserve high 32-bits of IA32_APIC_BASE MSR If there's no special reason to zero-out the "high" 32-bits of the IA32_APIC_BASE MSR, let's preserve it. The x2APIC Specification doesn't explicitly state any such requirement. (Sec 2.2 in: http://www.intel.com/Assets/PDF/manual/318148.pdf). Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Link: http://lkml.kernel.org/r/20110712055831.2498.78521.sendpatchset@nchumbalkar.americas.cpqcorp.net Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-12 14:33:49 -07:00
Christoph Lameter	688d3be815	percpu: Fixup __this_cpu_xchg* operations Somehow we got into a situation where the __this_cpu_xchg() operations were not defined in the same way as this_cpu_xchg() and friends. I had some build failures under 32 bit that were addressed by these fixes. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2011-07-12 13:47:16 +02:00
Glauber Costa	9ddabbe72e	KVM: KVM Steal time guest/host interface To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM. This is per-vcpu, and using the kvmclock structure for that is an abuse we decided not to make. In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that holds the memory area address containing information about steal time This patch contains the headers for it. I am keeping it separate to facilitate backports to people who wants to backport the kernel part but not the hypervisor, or the other way around. Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:17:03 +03:00
Glauber Costa	4b6b35f55c	KVM: Add constant to represent KVM MSRs enabled bit in guest/host interface This patch is simple, put in a different commit so it can be more easily shared between guest and hypervisor. It just defines a named constant to indicate the enable bit for KVM-specific MSRs. Signed-off-by: Glauber Costa <glommer@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:17:02 +03:00
Takuya Yoshikawa	3c8c652ae4	KVM: MMU: Introduce is_last_gpte() to clean up walk_addr_generic() Suggested by Ingo and Avi. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:44 +03:00
Takuya Yoshikawa	92c1c1e85b	KVM: MMU: Rename the walk label in walk_addr_generic() The current name does not explain the meaning well. So give it a better name "retry_walk" to show that we are trying the walk again. This was suggested by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:43 +03:00
Takuya Yoshikawa	134291bf3c	KVM: MMU: Clean up the error handling of walk_addr_generic() Avoid two step jump to the error handling part. This eliminates the use of the variables present and rsvd_fault. We also use the const type qualifier to show that write/user/fetch_fault do not change in the function. Both of these were suggested by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:42 +03:00
Marcelo Tosatti	f8f7e5ee10	Revert "KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB" This reverts commit bee931d31e588b8eb86b7edee32fac2d16930cd7. TLB flush should be done lazily during guest entry, in kvm_mmu_load(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:41 +03:00
Avi Kivity	45bd07b9d5	KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB kvm_set_cr0() and kvm_set_cr4(), and possible other functions, assume that kvm_mmu_reset_context() flushes the guest TLB. However, it does not. Fix by flushing the tlb (and syncing the new root as well). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:27 +03:00
Avi Kivity	411c588dfb	KVM: MMU: Adjust shadow paging to work when SMEP=1 and CR0.WP=0 When CR0.WP=0, we sometimes map user pages as kernel pages (to allow the kernel to write to them). Unfortunately this also allows the kernel to fetch from these pages, even if CR4.SMEP is set. Adjust for this by also setting NX on the spte in these circumstances. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:26 +03:00
Yang, Wei	a01c8f9b4e	KVM: Enable ERMS feature support for KVM This patch exposes ERMS feature to KVM guests. The REP MOVSB/STOSB instruction can enhance fast strings attempts to move as much of the data with larger size load/stores as possible. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:25 +03:00
Yang, Wei	176f61da82	KVM: Expose RDWRGSFS bit to KVM guests This patch exposes RDWRGSFS bit to KVM guests. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:24 +03:00
Yang, Wei	74dc2b4ffe	KVM: Add RDWRGSFS support when setting CR4 This patch adds RDWRGSFS support when setting CR4. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:23 +03:00
Yang, Wei	d9c3476d8a	KVM: Remove RDWRGSFS bit from CR4_RESERVED_BITS This patch removes RDWRGSFS bit from CR4_RESERVED_BITS. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:22 +03:00
Yang, Wei Y	4a00efdf0c	KVM: Enable DRNG feature support for KVM This patch exposes DRNG feature to KVM guests. The RDRAND instruction can provide software with sequences of random numbers generated from white noise. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:21 +03:00
Andre Przywara	02668b061d	KVM: fix XSAVE bit scanning (now properly) commit 123108f1c1aafd51d6a5c79cc04d7999dd88a930 tried to fix KVMs XSAVE valid feature scanning, but it was wrong. It was not considering the sparse nature of this bitfield, instead reading values from uninitialized members of the entries array. This patch now separates subleaf indicies from KVM's array indicies and fills the entry before querying it's value. This fixes AVX support in KVM guests. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:20 +03:00
Yang, Wei Y	e57d4a356a	KVM: Add instruction fetch checking when walking guest page table This patch adds instruction fetch checking when walking guest page table, to implement SMEP when emulating instead of executing natively. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:15 +03:00
Yang, Wei Y	611c120f74	KVM: Mask function7 ebx against host capability word9 This patch masks CPUID leaf 7 ebx against host capability word9. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:14 +03:00
Yang, Wei Y	c68b734fba	KVM: Add SMEP support when setting CR4 This patch adds SMEP handling when setting CR4. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:13 +03:00
Yang, Wei Y	8d9c975fc5	KVM: Remove SMEP bit from CR4_RESERVED_BITS This patch removes SMEP bit from CR4_RESERVED_BITS. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:12 +03:00
Nadav Har'El	509c75ea19	KVM: nVMX: Fix bug preventing more than two levels of nesting The nested VMX feature is supposed to fully emulate VMX for the guest. This (theoretically) not only allows it to run its own guests, but also also to further emulate VMX for its own guests, and allow arbitrarily deep nesting. This patch fixes a bug (discovered by Kevin Tian) in handling a VMLAUNCH by L2, which prevented deeper nesting. Deeper nesting now works (I only actually tested L3), but is currently absurdly slow, to the point of being unusable. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:11 +03:00
Avi Kivity	9dac77fa40	KVM: x86 emulator: fold decode_cache into x86_emulate_ctxt This saves a lot of pointless casts x86_emulate_ctxt and decode_cache. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:09 +03:00
Avi Kivity	36dd9bb5ce	KVM: x86 emulator: rename decode_cache::eip to _eip The name eip conflicts with a field of the same name in x86_emulate_ctxt, which we plan to fold decode_cache into. The name _eip is unfortunate, but what's really needed is a refactoring here, not a better name. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:09 +03:00
Jan Kiszka	2e4ce7f574	KVM: VMX: Silence warning on 32-bit hosts a is unused now on CONFIG_X86_32. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:08 +03:00
Takuya Yoshikawa	f411e6cdc2	KVM: x86 emulator: Use opcode::execute for CLI/STI(FA/FB) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:07 +03:00
Takuya Yoshikawa	d06e03adcb	KVM: x86 emulator: Use opcode::execute for LOOP/JCXZ LOOP/LOOPcc : E0-E2 JCXZ/JECXZ/JRCXZ : E3 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:06 +03:00
Takuya Yoshikawa	5c5df76b8b	KVM: x86 emulator: Clean up INT n/INTO/INT 3(CC/CD/CE) Call emulate_int() directly to avoid spaghetti goto's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:04 +03:00
Takuya Yoshikawa	1bd5f469b2	KVM: x86 emulator: Use opcode::execute for MOV(8C/8E) Different functions for those which take segment register operands. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:03 +03:00
Takuya Yoshikawa	ebda02c2a5	KVM: x86 emulator: Use opcode::execute for RET(C3) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:02 +03:00
Takuya Yoshikawa	e4f973ae91	KVM: x86 emulator: Use opcode::execute for XCHG(86/87) In addition, replace one "goto xchg" with an em_xchg() call. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:01 +03:00
Takuya Yoshikawa	9f21ca599c	KVM: x86 emulator: Use opcode::execute for TEST(84/85, A8/A9) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:00 +03:00
Takuya Yoshikawa	db5b0762f3	KVM: x86 emulator: Use opcode::execute for some instructions Move the following functions to the opcode tables: RET (Far return) : CB IRET : CF JMP (Jump far) : EA SYSCALL : 0F 05 CLTS : 0F 06 SYSENTER : 0F 34 SYSEXIT : 0F 35 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:15:59 +03:00
Takuya Yoshikawa	e01991e71a	KVM: x86 emulator: Rename emulate_xxx() to em_xxx() The next patch will change these to be called by opcode::execute. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:15:58 +03:00
Takuya Yoshikawa	9d74191ab1	KVM: x86 emulator: Use the pointers ctxt and c consistently We should use the local variables ctxt and c when the emulate_ctxt and decode appears many times. At least, we need to be consistent about how we use these in a function. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:15:57 +03:00
Nadav Har'El	2844d84905	KVM: nVMX: Miscellenous small corrections Small corrections of KVM (spelling, etc.) not directly related to nested VMX. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:19 +03:00
Nadav Har'El	7b8050f570	KVM: nVMX: Add VMX to list of supported cpuid features If the "nested" module option is enabled, add the "VMX" CPU feature to the list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl. Qemu uses this ioctl, and intersects KVM's list with its own list of desired cpu features (depending on the -cpu option given to qemu) to determine the final list of features presented to the guest. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:19 +03:00
Nadav Har'El	7991825b85	KVM: nVMX: Additional TSC-offset handling In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to set vmcs12.tsc_offset, for this change to survive the next nested entry (see prepare_vmcs02()). Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics of this function is that the TSC of all guests on this vcpu, L1 and possibly several L2s, need to be adjusted. To do this, we need to adjust vmcs01's tsc_offset (this offset will also apply to each L2s we enter). We can't set vmcs01 now, so we have to remember this adjustment and apply it when we later exit to L1. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:19 +03:00
Nadav Har'El	36cf24e01e	KVM: nVMX: Further fixes for lazy FPU loading KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and NM exceptions, even if we have a guest hypervisor (L1) who didn't want these traps. And of course, conversely: If L1 wanted to trap these events, we must let it, even if L0 is not interested in them. This patch fixes some existing KVM code (in update_exception_bitmap(), vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's and L1's needs. Note that handle_cr() was already fixed in the above patch, and that new code in introduced in previous patches already handles CR0 correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()). Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:18 +03:00
Nadav Har'El	eeadf9e755	KVM: nVMX: Handling of CR0 and CR4 modifying instructions When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a previous patch). When L2 modifies bits that L1 doesn't care about, we let it think (via CR[04]_READ_SHADOW) that it did these modifications, while only changing (in GUEST_CR[04]) the bits that L0 doesn't shadow. This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may want to leave TS on, while pretending to allow the guest to change it. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:18 +03:00
Nadav Har'El	66c78ae40c	KVM: nVMX: Correct handling of idt vectoring info This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested case. When a guest exits while delivering an interrupt or exception, we get this information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1, there's nothing we need to do, because L1 will see this field in vmcs12, and handle it itself. However, when L2 exits and L0 handles the exit itself and plans to return to L2, L0 must inject this event to L2. In the normal non-nested case, the idt_vectoring_info case is discovered after the exit, and the decision to inject (though not the injection itself) is made at that point. However, in the nested case a decision of whether to return to L2 or L1 also happens during the injection phase (see the previous patches), so in the nested case we can only decide what to do about the idt_vectoring_info right after the injection, i.e., in the beginning of vmx_vcpu_run, which is the first time we know for sure if we're staying in L2. Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular vmx_complete_interrupts() code which queues the idt_vectoring_info for injection on next entry - because such injection would not be appropriate if we will decide to exit to L1. Rather, we just save the idt_vectoring_info and related fields in vmcs12 (which is a convenient place to save these fields). On the next entry in vmx_vcpu_run (after the injection phase, potentially exiting to L1 to inject an event requested by user space), if we find ourselves in L1 we don't need to do anything with those values we saved (as explained above). But if we find that we're in L2, or rather still at L2 (it's not nested_run_pending, meaning that this is the first round of L2 running after L1 having just launched it), we need to inject the event saved in those fields - by writing the appropriate VMCS fields. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:18 +03:00
Nadav Har'El	0b6ac343fc	KVM: nVMX: Correct handling of exception injection Similar to the previous patch, but concerning injection of exceptions rather than external interrupts. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:17 +03:00
Nadav Har'El	b6f1250edb	KVM: nVMX: Correct handling of interrupt injection The code in this patch correctly emulates external-interrupt injection while a nested guest L2 is running. Because of this code's relative un-obviousness, I include here a longer-than- usual justification for what it does - much longer than the code itself ;-) To understand how to correctly emulate interrupt injection while L2 is running, let's look first at what we need to emulate: How would things look like if the extra L0 hypervisor layer is removed, and instead of L0 injecting an interrupt, we had hardware delivering an interrupt? Now we have L1 running on bare metal with a guest L2, and the hardware generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what happens now is this: The processor exits from L2 to L1, with an external- interrupt exit reason but without an interrupt vector. L1 runs, with interrupts disabled, and it doesn't yet know what the interrupt was. Soon after, it enables interrupts and only at that moment, it gets the interrupt from the processor. when L1 is KVM, Linux handles this interrupt. Now we need exactly the same thing to happen when that L1->L2 system runs on top of L0, instead of real hardware. This is how we do this: When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with external-interrupt exit reason (with an invalid interrupt vector), and run L1. Just like in the bare metal case, it likely can't deliver the interrupt to L1 now because L1 is running with interrupts disabled, in which case it turns on the interrupt window when running L1 after the exit. L1 will soon enable interrupts, and at that point L0 will gain control again and inject the interrupt to L1. Finally, there is an extra complication in the code: when nested_run_pending, we cannot return to L1 now, and must launch L2. We need to remember the interrupt we wanted to inject (and not clear it now), and do it on the next exit. The above explanation shows that the relative strangeness of the nested interrupt injection code in this patch, and the extra interrupt-window exit incurred, are in fact necessary for accurate emulation, and are not just an unoptimized implementation. Let's revisit now the two assumptions made above: If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know does, by the way), things are simple: L0 may inject the interrupt directly to the L2 guest - using the normal code path that injects to any guest. We support this case in the code below. If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT, things look very different from the description above: L1 expects to see an exit from L2 with the interrupt vector already filled in the exit information, and does not expect to be interrupted again with this interrupt. The current code does not (yet) support this case, so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:17 +03:00
Nadav Har'El	644d711aa0	KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit This patch contains the logic of whether an L2 exit should be handled by L0 and then L2 should be resumed, or whether L1 should be run to handle this exit (using the nested_vmx_vmexit() function of the previous patch). The basic idea is to let L1 handle the exit only if it actually asked to trap this sort of event. For example, when L2 exits on a change to CR0, we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any bit which changed; If it did, we exit to L1. But if it didn't it means that it is we (L0) that wished to trap this event, so we handle it ourselves. The next two patches add additional logic of what to do when an interrupt or exception is injected: Does L0 need to do it, should we exit to L1 to do it, or should we resume L2 and keep the exception to be injected later. We keep a new flag, "nested_run_pending", which can override the decision of which should run next, L1 or L2. nested_run_pending=1 means that we must run L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending is especially intended to avoid switching to L1 in the injection decision-point described above. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:16 +03:00
Nadav Har'El	7c1779384a	KVM: nVMX: vmcs12 checks on nested entry This patch adds a bunch of tests of the validity of the vmcs12 fields, according to what the VMX spec and our implementation allows. If fields we cannot (or don't want to) honor are discovered, an entry failure is emulated. According to the spec, there are two types of entry failures: If the problem was in vmcs12's host state or control fields, the VMLAUNCH instruction simply fails. But a problem is found in the guest state, the behavior is more similar to that of an exit. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:16 +03:00
Nadav Har'El	4704d0befb	KVM: nVMX: Exiting from L2 to L1 This patch implements nested_vmx_vmexit(), called when the nested L2 guest exits and we want to run its L1 parent and let it handle this exit. Note that this will not necessarily be called on every L2 exit. L0 may decide to handle a particular exit on its own, without L1's involvement; In that case, L0 will handle the exit, and resume running L2, without running L1 and without calling nested_vmx_vmexit(). The logic for deciding whether to handle a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), will appear in a separate patch below. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:16 +03:00
Nadav Har'El	99e65e805d	KVM: nVMX: No need for handle_vmx_insn function any more Before nested VMX support, the exit handler for a guest executing a VMX instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume, vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD exception. Now that all these exit reasons are properly handled (and emulate the respective VMX instruction), nothing calls this dummy handler and it can be removed. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:15 +03:00
Nadav Har'El	cd232ad02f	KVM: nVMX: Implement VMLAUNCH and VMRESUME Implement the VMLAUNCH and VMRESUME instructions, allowing a guest hypervisor to run its own guests. This patch does not include some of the necessary validity checks on vmcs12 fields before the entry. These will appear in a separate patch below. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:15 +03:00
Nadav Har'El	fe3ef05c75	KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12 This patch contains code to prepare the VMCS which can be used to actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our own guests). Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	bf8179a011	KVM: nVMX: Move control field setup to functions Move some of the control field setup to common functions. These functions will also be needed for running L2 guests - L0's desires (expressed in these functions) will be appropriately merged with L1's desires. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	a3a8ff8ebf	KVM: nVMX: Move host-state field setup to a function Move the setting of constant host-state fields (fields that do not change throughout the life of the guest) from vmx_vcpu_setup to a new common function vmx_set_constant_host_state(). This function will also be used to set the host state when running L2 guests. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	49f705c532	KVM: nVMX: Implement VMREAD and VMWRITE Implement the VMREAD and VMWRITE instructions. With these instructions, L1 can read and write to the VMCS it is holding. The values are read or written to the fields of the vmcs12 structure introduced in a previous patch. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	6a4d755060	KVM: nVMX: Implement VMPTRST This patch implements the VMPTRST instruction. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:13 +03:00
Nadav Har'El	63846663ea	KVM: nVMX: Implement VMPTRLD This patch implements the VMPTRLD instruction. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:12 +03:00
Nadav Har'El	27d6c86521	KVM: nVMX: Implement VMCLEAR This patch implements the VMCLEAR instruction. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:12 +03:00
Nadav Har'El	0140caea3b	KVM: nVMX: Success/failure of VMX instructions. VMX instructions specify success or failure by setting certain RFLAGS bits. This patch contains common functions to do this, and they will be used in the following patches which emulate the various VMX instructions. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:12 +03:00
Nadav Har'El	22bd035868	KVM: nVMX: Add VMCS fields to the vmcs12 In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the standard VMCS fields. Later patches will enable L1 to read and write these fields using VMREAD/ VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02, a hardware VMCS for running L2. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	ff2f6fe961	KVM: nVMX: Introduce vmcs02: VMCS used to run L2 We saw in a previous patch that L1 controls its L2 guest with a vcms12. L0 needs to create a real VMCS for running L2. We call that "vmcs02". A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02 fields. This patch only contains code for allocating vmcs02. In this version, prepare_vmcs02() sets all of vmcs02's fields each time we enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can be reused even when L1 runs multiple L2 guests. However, in future versions we'll probably want to add an optimization where vmcs02 fields that rarely change will not be set each time. For that, we may want to keep around several vmcs02s of L2 guests that have recently run, so that potentially we could run these L2s again more quickly because less vmwrites to vmcs02 will be needed. This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool, which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s. As explained above, in the current version we choose VMCS02_POOL_SIZE=1, I.e., one vmcs02 is allocated (and loaded onto the processor), and it is reused to enter any L2 guest. In the future, when prepare_vmcs02() is optimized not to set all fields every time, VMCS02_POOL_SIZE should be increased. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	064aea7747	KVM: nVMX: Decoding memory operands of VMX instructions This patch includes a utility function for decoding pointer operands of VMX instructions issued by L1 (a guest hypervisor) Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	b87a51ae28	KVM: nVMX: Implement reading and writing of VMX MSRs When the guest can use VMX instructions (when the "nested" module option is on), it should also be able to read and write VMX MSRs, e.g., to query about VMX capabilities. This patch adds this support. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	a9d30f33dd	KVM: nVMX: Introduce vmcs12: a VMCS structure for L1 An implementation of VMX needs to define a VMCS structure. This structure is kept in guest memory, but is opaque to the guest (who can only read or write it with VMX instructions). This patch starts to define the VMCS structure which our nested VMX implementation will present to L1. We call it "vmcs12", as it is the VMCS that L1 keeps for its L2 guest. We will add more content to this structure in later patches. This patch also adds the notion (as required by the VMX spec) of L1's "current VMCS", and finally includes utility functions for mapping the guest-allocated VMCSs in host memory. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:10 +03:00
Nadav Har'El	5e1746d620	KVM: nVMX: Allow setting the VMXE bit in CR4 This patch allows the guest to enable the VMXE bit in CR4, which is a prerequisite to running VMXON. Whether to allow setting the VMXE bit now depends on the architecture (svm or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4() will also return 1, and this will cause kvm_set_cr4() will throw a #GP. Turning on the VMXE bit is allowed only when the nested VMX feature is enabled, and turning it off is forbidden after a vmxon. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:10 +03:00
Nadav Har'El	ec378aeef9	KVM: nVMX: Implement VMXON and VMXOFF This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:09 +03:00
Nadav Har'El	801d342432	KVM: nVMX: Add "nested" module option to kvm_intel This patch adds to kvm_intel a module option "nested". This option controls whether the guest can use VMX instructions, i.e., whether we allow nested virtualization. A similar, but separate, option already exists for the SVM module. This option currently defaults to 0, meaning that nested VMX must be explicitly enabled by giving nested=1. When nested VMX matures, the default should probably be changed to enable nested VMX by default - just like nested SVM is currently enabled by default. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:09 +03:00
Takuya Yoshikawa	b5c9ff731f	KVM: x86 emulator: Avoid clearing the whole decode_cache During tracing the emulator, we noticed that init_emulate_ctxt() sometimes took a bit longer time than we expected. This patch is for mitigating the problem by some degree. By looking into the function, we soon notice that it clears the whole decode_cache whose size is about 2.5K bytes now. Furthermore, most of the bytes are taken for the two read_cache arrays, which are used only by a few instructions. Considering the fact that we are not assuming the cache arrays have been cleared when we store actual data, we do not need to clear the arrays: 2K bytes elimination. In addition, we can avoid clearing the fetch_cache and regs arrays. This patch changes the initialization not to clear the arrays. On our 64-bit host, init_emulate_ctxt() becomes 0.3 to 0.5us faster with this patch applied. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:09 +03:00
Takuya Yoshikawa	adf52235b4	KVM: x86 emulator: Clean up init_emulate_ctxt() Use a local pointer to the emulate_ctxt for simplicity. Then, arrange the hard-to-read mode selection lines neatly. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:08 +03:00
Jan Kiszka	d780592b99	KVM: Clean up error handling during VCPU creation So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if it fails. Move this confusing resonsibility back into the hands of kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected, all other archs cannot fail. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:08 +03:00
Nadav Har'El	d462b81923	KVM: VMX: Keep list of loaded VMCSs, instead of vcpus In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it because (at least in theory) the processor might not have written all of its content back to memory. Since a patch from June 26, 2008, this is done using a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU. The problem is that with nested VMX, we no longer have the concept of a vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for L2s), and each of those may be have been last loaded on a different cpu. So instead of linking the vcpus, we link the VMCSs, using a new structure loaded_vmcs. This structure contains the VMCS, and the information pertaining to its loading on a specific cpu (namely, the cpu number, and whether it was already launched on this cpu once). In nested we will also use the same structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the currently active VMCS. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Acked-by: Acked-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:08 +03:00
Avi Kivity	24c82e576b	KVM: Sanitize cpuid Instead of blacklisting known-unsupported cpuid leaves, whitelist known- supported leaves. This is more conservative and prevents us from reporting features we don't support. Also whitelist a few more leaves while at it. Signed-off-by: Avi Kivity <avi@redhat.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	bcdd9a93c5	KVM: MMU: cleanup for dropping parent pte Introduce drop_parent_pte to remove the rmap of parent pte and clear parent pte Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	38e3b2b28c	KVM: MMU: cleanup for kvm_mmu_page_unlink_children Cleanup the same operation between kvm_mmu_page_unlink_children and mmu_pte_write_zap_pte Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	67052b3508	KVM: MMU: remove the arithmetic of parent pte rmap Parent pte rmap and page rmap are very similar, so use the same arithmetic for them Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	53c07b1878	KVM: MMU: abstract the operation of rmap Abstract the operation of rmap to spte_list, then we can use it for the reverse mapping of parent pte in the later patch Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:06 +03:00
Xiao Guangrong	1249b96e72	KVM: fix uninitialized warning Fix: warning: ‘cs_sel’ may be used uninitialized in this function warning: ‘ss_sel’ may be used uninitialized in this function Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:06 +03:00
Xiao Guangrong	8b0cedff04	KVM: use __copy_to_user/__clear_user to write guest page Simply use __copy_to_user/__clear_user to write guest page since we have already verified the user address when the memslot is set Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:03 +03:00
Xiao Guangrong	332b207d65	KVM: MMU: optimize pte write path if don't have protected sp Simply return from kvm_mmu_pte_write path if no shadow page is write-protected, then we can avoid to walk all shadow pages and hold mmu-lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:02 +03:00
Avi Kivity	96304217a7	KVM: VMX: always_inline VMREADs vmcs_readl() and friends are really short, but gcc thinks they are long because of the out-of-line exception handlers. Mark them always_inline to clear the misunderstanding. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:01 +03:00
Avi Kivity	5e520e6278	KVM: VMX: Move VMREAD cleanup to exception handler We clean up a failed VMREAD by clearing the output register. Do it in the exception handler instead of unconditionally. This is worthwhile since there are more than a hundred call sites. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:00 +03:00
Takuya Yoshikawa	7b105ca290	KVM: x86 emulator: Stop passing ctxt->ops as arg of emul functions Dereference it in the actual users. This not only cleans up the emulator but also makes it easy to convert the old emulation functions to the new em_xxx() form later. Note: Remove some inline keywords to let the compiler decide inlining. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:44:59 +03:00
Takuya Yoshikawa	ef5d75cc9a	KVM: x86 emulator: Stop passing ctxt->ops as arg of decode helpers Dereference it in the actual users: only do_insn_fetch_byte(). This is consistent with the way __linearize() dereferences it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:44:57 +03:00
Takuya Yoshikawa	67cbc90db5	KVM: x86 emulator: Place insn_fetch helpers together The two macros need special care to use: Assume rc, ctxt, ops and done exist outside of them. Can goto outside. Considering the fact that these are used only in decode functions, moving these right after do_insn_fetch() seems to be a right thing to improve the readability. We also rename do_fetch_insn_byte() to do_insn_fetch_byte() to be consistent. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:44:56 +03:00
Benjamin Herrenschmidt	a63fdc5156	mm: Move definition of MIN_MEMORY_BLOCK_SIZE to a header The macro MIN_MEMORY_BLOCK_SIZE is currently defined twice in two .c files, and I need it in a third one to fix a powerpc bug, so let's first move it into a header Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Ingo Molnar <mingo@elte.hu>	2011-07-12 11:08:01 +10:00
Raghavendra D Prabhu	3c52b7bf69	xen:pvhvm: Modpost section mismatch fix Removing __init from check_platform_magic since it is called by xen_unplug_emulated_devices in non-init contexts (It probably gets inlined because of -finline-functions-called-once, removing __init is more to avoid mismatch being reported). Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:37:04 -04:00
Konrad Rzeszutek Wilk	97ffab1f14	xen/pci: Use 'acpi_gsi_to_irq' value unconditionally. In the past we would only use the function's value if the returned value was not equal to 'acpi_sci_override_gsi'. Meaning that the INT_SRV_OVR values for global and source irq were different. But it is OK to use the function's value even when the global and source irq are the same. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:34 -04:00
Konrad Rzeszutek Wilk	78316ada22	xen/pci: Remove 'xen_allocate_pirq_gsi'. In the past (2.6.38) the 'xen_allocate_pirq_gsi' would allocate an entry in a Linux IRQ -> {XEN_IRQ, type, event, ..} array. All of that has been removed in 2.6.39 and the Xen IRQ subsystem uses an linked list that is populated when the call to 'xen_allocate_irq_gsi' (universally done from any of the xen_bind_* calls) is done. The 'xen_allocate_pirq_gsi' is a NOP and there is no need for it anymore so lets remove it. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:33 -04:00
Konrad Rzeszutek Wilk	34b1d1269d	xen/pci: Retire unnecessary #ifdef CONFIG_ACPI As the code paths that are guarded by CONFIG_XEN_DOM0 already depend on CONFIG_ACPI so the extra #ifdef is not required. The earlier patch that added them in had done its job. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:32 -04:00
Konrad Rzeszutek Wilk	9b6519db5e	xen/pci: Move the allocation of IRQs when there are no IOAPIC's to the end .. which means we can preset of NR_IRQS_LEGACY interrupts using the 'acpi_get_override_irq' API before this loop. This means that we can get the IRQ's polarity (and trigger) from either the ACPI (or MP); or use the default values. This fixes a bug if we did not have an IOAPIC we would not been able to preset the IRQ's polarity if the MP table existed. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:31 -04:00
Konrad Rzeszutek Wilk	a0ee056709	xen/pci: Squash pci_xen_initial_domain and xen_setup_pirqs together. Since they are only called once and the rest of the pci_xen_* functions follow the same pattern of setup. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:30 -04:00
Konrad Rzeszutek Wilk	ed89eb6396	xen/pci: Use the xen_register_pirq for HVM and initial domain users .. to cut down on the code duplicity. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:29 -04:00
Konrad Rzeszutek Wilk	30bd35edfd	xen/pci: In xen_register_pirq bind the GSI to the IRQ after the hypercall. Not before .. also that code segment starts looking like the HVM one. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:28 -04:00
Konrad Rzeszutek Wilk	d92edd814e	xen/pci: Provide #ifdef CONFIG_ACPI to easy code squashing. In the past we would guard those code segments to be dependent on CONFIG_XEN_DOM0 (which depends on CONFIG_ACPI) so this patch is not stricly necessary. But the next patch will merge common HVM and initial domain code and we want to make sure the CONFIG_ACPI dependency is preserved - as HVM code does not depend on CONFIG_XEN_DOM0. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:27 -04:00
Konrad Rzeszutek Wilk	996c34aee3	xen/pci: Update comments and fix empty spaces. Update the out-dated comment at the beginning of the file. Also provide the copyrights of folks who have been contributing to this code lately. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:26 -04:00
Konrad Rzeszutek Wilk	fef6e26208	xen/pci: Shuffle code around. The file is hard to read. Move the code around so that the contents of it follows a uniform format: - setup GSIs - PV, HVM, and initial domain case - then MSI/MSI-x setup - PV, HVM and then initial domain case. - then MSI/MSI-x teardown - same order. - lastly, the __init functions in PV, HVM, and initial domain order. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-11 13:19:25 -04:00
Naga Chumbalkar	7fece83235	x86, ioapic: Also print Dest field The code in setup_ioapic_irq() determines the Destination Field, so why not also include it in the debug printk output that gets displayed when the boot parameter "apic=debug" is used. Before the change, "dmesg" will show: IOAPIC[0]: Set routing entry (8-1 -> 0x31 -> IRQ 1 Mode:0 Active:0) IOAPIC[0]: Set routing entry (8-2 -> 0x30 -> IRQ 0 Mode:0 Active:0) IOAPIC[0]: Set routing entry (8-3 -> 0x33 -> IRQ 3 Mode:0 Active:0) ... After the change, you will see: IOAPIC[0]: Set routing entry (8-1 -> 0x31 -> IRQ 1 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (8-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0) IOAPIC[0]: Set routing entry (8-3 -> 0x33 -> IRQ 3 Mode:0 Active:0 Dest:0) ... Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Link: http://lkml.kernel.org/r/20110708184603.2734.91071.sendpatchset@nchumbalkar.americas.cpqcorp.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-11 16:31:05 +02:00
Naga Chumbalkar	bd6a46e087	x86, ioapic: Format clean up for IOAPIC output When IOAPIC data is displayed in "dmesg" with the help of the boot parameter "apic=debug" certain values are not formatted correctly wrt their size. In the "dmesg" snippet below, note that the output for "max redirection entries", and "IO APIC version" which are each defined to be just 8-bits long are displayed as 2 bytes in length. Similarly, "Dst" under the "IRQ redirection table" should only be 8-bits long. IO APIC #0...... ... ... .... register #01: 00170020 ....... : max redirection entries: 0017 ....... : PRQ implemented: 0 ....... : IO APIC version: 0020 ... ... .... IRQ redirection table: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 000 0 0 0 0 0 0 0 31 02 000 0 0 0 0 0 0 0 30 03 000 1 0 0 0 0 0 0 33 ... ... Do some formatting clean up, so you will see output like below: IO APIC #0...... ... ... .... register #01: 00170020 ....... : max redirection entries: 17 ....... : PRQ implemented: 0 ....... : IO APIC version: 20 ... ... .... IRQ redirection table: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 00 1 0 0 0 0 0 0 00 01 00 0 0 0 0 0 0 0 31 02 00 0 0 0 0 0 0 0 30 03 00 1 0 0 0 0 0 0 33 ... ... Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Link: http://lkml.kernel.org/r/20110708184557.2734.61830.sendpatchset@nchumbalkar.americas.cpqcorp.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-11 16:31:05 +02:00
Tejun Heo	5da0ef9a85	x86: Disable AMD_NUMA for 32bit for now Commit `2706a0bf7b` ("x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too") enabled AMD NUMA for 32bit too. Unfortunately, SPARSEMEM on 32bit had rather coarse (512MiB) addr->node mapping granularity due to lack of space in page->flags. This led to boot failure on certain AMD NUMA machines which had 128MiB alignment on nodes. Patches to properly detect this condition and reject NUMA configuration are posted[1] but deemed too pervasive for merge at this point (-rc6). Disable AMD NUMA for 32bit for now and re-enable once the detection logic is merged. [1] http://thread.gmane.org/gmane.linux.kernel/1161279/focus=1162583 Reported-by: Hans Rosenfeld <hans.rosenfeld@amd.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Conny Seidel <conny.seidel@amd.com> Link: http://lkml.kernel.org/r/20110711083432.GC943@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-11 16:25:30 +02:00
Michael Witten	2dc98fd320	doc: Konfig: Documentation/power/{pm => apm-acpi}.txt Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-07-11 14:20:07 +02:00
Jiri Kosina	b7e9c223be	Merge branch 'master' into for-next Sync with Linus' tree to be able to apply pending patches that are based on newer code already present upstream.	2011-07-11 14:15:55 +02:00
Anupam Chanda	24a42bae68	x86, hyper: Change hypervisor detection order Detect Xen before HyperV because in Viridian compatibility mode Xen presents itself as HyperV. Move Xen to the top since it seems more likely that Xen would emulate VMware than vice versa. Signed-off-by: Anupam Chanda <achanda@nicira.com> Link: http://lkml.kernel.org/r/1310150570-26810-1-git-send-email-achanda@nicira.com Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Yaozu (Eddie) Dong <eddie.dong@intel.com> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-08 16:22:29 -07:00
Vivek Goyal	14cb6dcf0a	x86, boot: Wait for boot cpu to show up if nr_cpus limit is about to hit nr_cpus allows one to specify number of possible cpus in the system. Current assumption seems to be that first cpu to show up is boot cpu and this assumption will be broken in kdump scenario where we can be booting on a non boot cpu with nr_cpus=1. It might happen that first cpu we parse is not the cpu we boot on and later we ignore boot cpu. Though code later seems to recognize this anomaly and forcibly sets boot cpu in physical cpu map with following warning. if (!physid_isset(hard_smp_processor_id(), phys_cpu_present_map)) { printk(KERN_WARNING "weird, boot CPU (#%d) not listed by the BIOS.\n", hard_smp_processor_id()); physid_set(hard_smp_processor_id(), phys_cpu_present_map); } This patch waits for boot cpu to show up and starts ignoring the cpus once we have hit (nr_cpus - 1) number of cpus. So effectively we are reserving one slot out of nr_cpus for boot cpu explicitly. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20110708171926.GF2930@redhat.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-08 15:33:35 -07:00
Naga Chumbalkar	ded1f6ab43	x86: print APIC data a little later during boot To view IOAPIC data you could boot with "apic=debug". When booting in such a way then the kernel will dump the IO-APIC's registers, for example: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 000 0 0 0 0 0 0 0 31 02 000 0 0 0 0 0 0 0 30 03 000 0 0 0 0 0 0 0 33 04 000 0 0 0 0 0 0 0 34 05 000 0 0 0 0 0 0 0 35 06 000 0 0 0 0 0 0 0 36 07 000 0 0 0 0 0 0 0 37 08 000 0 0 0 0 0 0 0 38 09 000 0 1 0 0 0 0 0 39 0a 000 0 0 0 0 0 0 0 3A 0b 000 0 0 0 0 0 0 0 3B 0c 000 0 0 0 0 0 0 0 3C 0d 000 0 0 0 0 0 0 0 3D 0e 000 0 0 0 0 0 0 0 3E 0f 000 0 0 0 0 0 0 0 3F 10 000 1 0 0 0 0 0 0 00 11 000 1 0 0 0 0 0 0 00 12 000 1 0 0 0 0 0 0 00 13 000 1 0 0 0 0 0 0 00 14 000 1 0 0 0 0 0 0 00 15 000 1 0 0 0 0 0 0 00 16 000 1 0 0 0 0 0 0 00 17 000 1 0 0 0 0 0 0 00 Delaying the call to print_ICs() gives better results: NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect: 00 000 1 0 0 0 0 0 0 00 01 000 0 0 0 0 0 0 0 31 02 000 0 0 0 0 0 0 0 30 03 000 1 0 0 0 0 0 0 33 04 000 1 0 0 0 0 0 0 34 05 000 1 0 0 0 0 0 0 35 06 000 1 0 0 0 0 0 0 36 07 000 1 0 0 0 0 0 0 37 08 000 0 0 0 0 0 0 0 38 09 000 0 1 0 0 0 0 0 39 0a 000 1 0 0 0 0 0 0 3A 0b 000 1 0 0 0 0 0 0 3B 0c 000 0 0 0 0 0 0 0 3C 0d 000 1 0 0 0 0 0 0 3D 0e 000 1 0 0 0 0 0 0 3E 0f 000 1 0 0 0 0 0 0 3F 10 000 1 1 0 1 0 0 0 29 11 000 1 0 0 0 0 0 0 00 12 000 1 0 0 0 0 0 0 00 13 000 1 0 0 0 0 0 0 00 14 000 0 1 0 1 0 0 0 51 15 000 1 0 0 0 0 0 0 00 16 000 0 1 0 1 0 0 0 61 17 000 0 1 0 1 0 0 0 59 Notice that the entries beyond interrupt input signal 0x0f also get populated and arent just the hw-initialization default of all zeroes. Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Link: http://lkml.kernel.org/r/20110708083555.2598.42216.sendpatchset@nchumbalkar.americas.hpqcorp.net Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-08 13:20:14 +02:00
Linus Torvalds	075d9db131	Merge branch 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/pci: Move check for acpi_sci_override_gsi to xen_setup_acpi_sci.	2011-07-07 13:19:04 -07:00
Linus Torvalds	e55f1b1c00	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Don't use the EFI reboot method by default x86, suspend: Restore MISC_ENABLE MSR in realmode wakeup x86, reboot: Acer Aspire One A110 reboot quirk x86-32, NUMA: Fix boot regression caused by NUMA init unification on highmem machines	2011-07-07 13:18:13 -07:00
Linus Torvalds	27a3b735b7	Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: debugobjects: Fix boot crash when kmemleak and debugobjects enabled * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: jump_label: Fix jump_label update for modules oprofile, x86: Fix race in nmi handler while starting counters * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Disable (revert) SCHED_LOAD_SCALE increase sched, cgroups: Fix MIN_SHARES on 64-bit boxen	2011-07-07 13:17:45 -07:00
Steven Rostedt	e08fbb78f0	tracing, x86/irq: Do not trace arch_local_{,irq_}() functions I triggered a triple fault with gcc 4.5.1 because it did not honor the inline annotation to arch_local_save_flags() function and that function was added to the pool of functions traced by the function tracer. When preempt_schedule() called arch_local_save_flags() (called by irqs_disabled()), it was traced, but the first thing the function tracer does is disable preemption. When it enables preemption, the NEED_RESCHED flag will not have been cleared and the preemption check will trigger the call to preempt_schedule() again. Although the dynamic function tracer crashed immediately, the static version of the function tracer (CONFIG_DYNAMIC_FTRACE is not set) actually was able to show where the problem was. swapper-1 3.N.. 103885us : arch_local_save_flags <-preempt_schedule swapper-1 3.N.. 103886us : arch_local_save_flags <-preempt_schedule swapper-1 3.N.. 103886us : arch_local_save_flags <-preempt_schedule swapper-1 3.N.. 103887us : arch_local_save_flags <-preempt_schedule swapper-1 3.N.. 103887us : arch_local_save_flags <-preempt_schedule swapper-1 3.N.. 103888us : arch_local_save_flags <-preempt_schedule swapper-1 3.N.. 103888us : arch_local_save_flags <-preempt_schedule It went on for a while before it triple faulted with a corrupted stack. The arch_local_save_flags and arch_local_irq_* functions should not be traced. Even though they are marked as inline, gcc may still make them a function and enable tracing of them. The simple solution is to just mark them as notrace. I had to add the <linux/types.h> for this file to include the notrace tag. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20110702033852.733414762@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-07 19:22:32 +02:00
Konrad Rzeszutek Wilk	ee339fe63a	xen/pci: Move check for acpi_sci_override_gsi to xen_setup_acpi_sci. Previously we would check for acpi_sci_override_gsi == gsi every time a PCI device was enabled. That works during early bootup, but later on it could lead to triggering unnecessarily the acpi_gsi_to_irq(..) lookup. The reason is that acpi_sci_override_gsi was declared in __initdata and after early bootup could contain bogus values. This patch moves the check for acpi_sci_override_gsi to the site where the ACPI SCI is preset. CC: stable@kernel.org Reported-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Tested-by: Raghavendra D Prabhu <rprabhu@wnohang.net> [http://lists.xensource.com/archives/html/xen-devel/2011-07/msg00154.html] Suggested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2011-07-07 12:19:08 -04:00
Ingo Molnar	b395fb36d5	Merge branch 'iommu-3.1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu into core/iommu	2011-07-07 12:58:28 +02:00
Matthew Garrett	f70e957cda	x86: Don't use the EFI reboot method by default Testing suggests that at least some Lenovos and some Intels will fail to reboot via EFI, attempting to jump to an unmapped physical address. In the long run we could handle this by providing a page table with a 1:1 mapping of physical addresses, but for now it's probably just easier to assume that ACPI or legacy methods will be present and reboot via those. Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/1309985557-15350-1-git-send-email-mjg@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-07 11:35:05 +02:00
Kees Cook	7a3136666b	x86, suspend: Restore MISC_ENABLE MSR in realmode wakeup Some BIOSes will reset the Intel MISC_ENABLE MSR (specifically the XD_DISABLE bit) when resuming from S3, which can interact poorly with `ebba638ae7`. In 32bit PAE mode, this can lead to a fault when EFER is restored by the kernel wakeup routines, due to it setting the NX bit for a CPU that (thanks to the BIOS reset) now incorrectly thinks it lacks the NX feature. (64bit is not affected because it uses a common CPU bring-up that specifically handles the XD_DISABLE bit.) The need for MISC_ENABLE being restored so early is specific to the S3 resume path. Normally, MISC_ENABLE is saved in save_processor_state(), but this happens after the resume header is created, so just reproduce the logic here. (acpi_suspend_lowlevel() creates the header, calls do_suspend_lowlevel, which calls save_processor_state(), so the saved processor context isn't available during resume header creation.) [ hpa: Consider for stable if OK in mainline ] Signed-off-by: Kees Cook <kees.cook@canonical.com> Link: http://lkml.kernel.org/r/20110707011034.GA8523@outflux.net Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: <stable@kernel.org> 2.6.38+	2011-07-06 20:09:34 -07:00
Daniel Drake	a0f30f592d	x86, olpc: Add XO-1.5 SCI driver Add a driver for the ACPI-based EC event interface found on the OLPC XO-1.5 laptop. This enables notification of battery/AC power events, and enables various devices to be used as wakeup sources through regular ACPI mechanisms. This driver can't be built as a module, because some drivers need to know at boot-time if SCI-based functionality is available via olpc_ec_wakeup_available(). Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-12-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:43 -07:00
Daniel Drake	cfee95977b	x86, olpc: Add XO-1 RTC driver Add a driver to configure the XO-1 RTC via CS5536 MSRs, to be used as a system wakeup source via olpc-xo1-pm. Device detection is based on finding the relevant device tree node. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-11-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Acked-by: Grant Likely <grant.likely@secretlab.ca> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: devicetree-discuss@lists.ozlabs.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:42 -07:00
Daniel Drake	e1040ac693	x86, olpc-xo1-sci: Propagate power supply/battery events EC events indicate change in AC power connectivity, battery state of charge, battery error, battery presence, etc. Send notifications to the power supply subsystem when changes are detected. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-10-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:40 -07:00
Daniel Drake	2cf2baea10	x86, olpc-xo1-sci: Add lid switch functionality Configure the XO-1's lid switch GPIO to trigger an SCI interrupt, and correctly expose this input device which can be used as a wakeup source. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-9-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:39 -07:00
Daniel Drake	7bc74b3df7	x86, olpc-xo1-sci: Add GPE handler and ebook switch functionality The EC in the OLPC XO-1 delivers GPE events to provide various notifications. Add the basic code for GPE/EC event processing and enable the ebook switch, which can be used as a wakeup source. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-8-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:38 -07:00
Daniel Drake	bc4ecd5a5e	x86, olpc: EC SCI wakeup mask functionality Update the EC SCI masks with recent additions. Add functions to query SCI events and set the wakeup mask, to be used by followup patches. Add functions to tweak an event mask used to select certain EC events as a system wakeup source. Also add a function to determine if EC wakeup functionality is available, as this depends on child drivers (different for each laptop model) to configure the SCI interrupt. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-7-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:36 -07:00
Daniel Drake	7feda8e9f3	x86, olpc: Add XO-1 SCI driver and power button control The System Control Interrupt is used in the OLPC XO-1 to control various features of the laptop. Add the driver base and the power button functionality. This driver can't be built as a module, because functionality added in future patches means that some drivers need to know at boot-time whether SCI-based functionality is available. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-6-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:34 -07:00
Daniel Drake	97c4cb71c1	x86, olpc: Add XO-1 suspend/resume support Add code needed for basic suspend/resume of the XO-1 laptop. Based on earlier work by Jordan Crouse, Andres Salomon, and others. This patch incorporates all earlier feedback from Thomas Gleixner. To clarify a certain point (now more obvious in the code itself): On resume, OpenFirmware returns execution to Linux in protected mode with a kernel-compatible GDT already set up. The changes and simplifications suggested have all been included. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-5-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:32 -07:00
Daniel Drake	a3128588b3	x86, olpc: Rename olpc-xo1 to olpc-xo1-pm Based on earlier review comments, we'll no longer try to stick all of our XO-1 goodies in a single driver. We'll split it into a power management driver, and an EC/SCI driver. As a first step, rename olpc-xo1 to olpc-xo1-pm, and make it builtin instead of modular. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-4-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:28 -07:00
Daniel Drake	7a0d4fcf6d	x86, olpc: Move CS5536-related constants to cs5535.h Move these definitions into the relevant header file. This was requested in the review of the upcoming XO-1 suspend/resume code. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-3-git-send-email-dsd@laptop.org Acked-by: Andres Salomon <dilinger@queued.net> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:23 -07:00
Daniel Drake	f70d8ef474	x86, olpc: Add missing elements to device tree In response to new device tree code in the kernel, OLPC will start using it for probing of certain devices. However, some firmware fixes are needed to put the devicetree into a usable state. Retain compatibility with old firmware by fixing up the device tree at boot-time if it does not contain the new nodes/properties that we need for probing. This is the same approach taken on PPC platforms. Signed-off-by: Daniel Drake <dsd@laptop.org> Link: http://lkml.kernel.org/r/1309019658-1712-2-git-send-email-dsd@laptop.org Acked-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Andres Salomon <dilinger@queued.net> Cc: devicetree-discuss@lists.ozlabs.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2011-07-06 14:44:19 -07:00
David S. Miller	e12fe68ce3	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2011-07-05 23:23:37 -07:00
Peter Chubb	b49c78d482	x86, reboot: Acer Aspire One A110 reboot quirk Since git commit `660e34cebf` x86: reorder reboot method preferences, my Acer Aspire One hangs on reboot. It appears that its ACPI method for rebooting is broken. The attached patch adds a quirk so that the machine will reboot via the BIOS. [ hpa: verified that the ACPI control on this machine is just plain broken. ] Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au> Link: http://lkml.kernel.org/r/w439iki5vl.wl%25peter@chubb.wattle.id.au Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2011-07-05 19:43:23 -07:00
Jan Beulich	d80603c9d8	x86, efi: Properly pre-initialize table pointers Consumers of the table pointers in struct efi check for EFI_INVALID_TABLE_ADDR to determine validity, hence these pointers should all be pre-initialized to this value (rather than zero). Noticed by the discrepancy between efivars' systab sysfs entry showing all tables (and their pointers) despite the code intending to only display the valid ones. No other bad effects known, but having the various table parsing routines bogusly access physical address zero is certainly not very desirable (even though they're unlikely to find anything useful there). Signed-off-by: Jan Beulich <jbeulich@novell.com> Link: http://lkml.kernel.org/r/4E13100A020000780004C256@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-05 13:40:34 +02:00
Ingo Molnar	931da6137e	Merge branch 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core	2011-07-05 11:55:43 +02:00
Ingo Molnar	9f8b6a6cf0	Merge branch 'core' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile into perf/core	2011-07-04 12:27:28 +02:00
Ingo Molnar	729aa21ab8	Merge branch 'perf/stacktrace' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core	2011-07-03 20:39:40 +02:00
Frederic Weisbecker	a2bbe75089	x86: Don't use frame pointer to save old stack on irq entry rbp is used in SAVE_ARGS_IRQ to save the old stack pointer in order to restore it later in ret_from_intr. It is convenient because we save its value in the irq regs and it's easily restored using the leave instruction. However this is a kind of abuse of the frame pointer which role is to help unwinding the kernel by chaining frames together, each node following the return address to the previous frame. But although we are breaking the frame by changing the stack pointer, there is no preceding return address before the new frame. Hence using the frame pointer to link the two stacks breaks the stack unwinders that find a random value instead of a return address here. There is no workaround that can work in every case. We are using the fixup_bp_irq_link() function to dereference that abused frame pointer in the case of non nesting interrupt (which means stack changed). But that doesn't fix the case of interrupts that don't change the stack (but we still have the unconditional frame link), which is the case of hardirq interrupting softirq. We have no way to detect this transition so the frame irq link is considered as a real frame pointer and the return address is dereferenced but it is still a spurious one. There are two possible results of this: either the spurious return address, a random stack value, luckily belongs to the kernel text and then the unwinding can continue and we just have a weird entry in the stack trace. Or it doesn't belong to the kernel text and unwinding stops there. This is the reason why stacktraces (including perf callchains) on irqs that interrupted softirqs don't work very well. To solve this, we don't save the old stack pointer on rbp anymore but we save it to a scratch register that we push on the new stack and that we pop back later on irq return. This preserves the whole frame chain without spurious return addresses in the middle and drops the need for the horrid fixup_bp_irq_link() workaround. And finally irqs that interrupt softirq are sanely unwinded. Before: 99.81% perf [kernel.kallsyms] [k] perf_pending_event \| --- perf_pending_event irq_work_run smp_irq_work_interrupt irq_work_interrupt \| \|--41.60%-- __read \| \| \| \|--99.90%-- create_worker \| \| bench_sched_messaging \| \| cmd_bench \| \| run_builtin \| \| main \| \| __libc_start_main \| --0.10%-- [...] After: 1.64% swapper [kernel.kallsyms] [k] perf_pending_event \| --- perf_pending_event irq_work_run smp_irq_work_interrupt irq_work_interrupt \| \|--95.00%-- arch_irq_work_raise \| irq_work_queue \| __perf_event_overflow \| perf_swevent_overflow \| perf_swevent_event \| perf_tp_event \| perf_trace_softirq \| __do_softirq \| call_softirq \| do_softirq \| irq_exit \| \| \| \|--73.68%-- smp_apic_timer_interrupt \| \| apic_timer_interrupt \| \| \| \| \| \|--96.43%-- amd_e400_idle \| \| \| cpu_idle \| \| \| start_secondary Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jan Beulich <JBeulich@novell.com>	2011-07-02 18:06:36 +02:00
Frederic Weisbecker	48ffee7d9e	x86: Remove useless unwinder backlink from irq regs saving The unwinder backlink in interrupt entry is very useless. It's actually not part of the stack frame chain and thus is never used. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jan Beulich <JBeulich@novell.com>	2011-07-02 18:06:21 +02:00
Frederic Weisbecker	3b99a3ef55	x86,64: Separate arg1 from rbp handling in SAVE_REGS_IRQ Just for clarity in the code. Have a first block that handles the frame pointer and a separate one that handles pt_regs pointer and its use. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jan Beulich <JBeulich@novell.com>	2011-07-02 18:05:46 +02:00
Frederic Weisbecker	1871853f7a	x86,64: Simplify save_regs() The save_regs function that saves the regs on low level irq entry is complicated because of the fact it changes its stack in the middle and also because it manipulates data allocated in the caller frame and accesses there are directly calculated from callee rsp value with the return address in the middle of the way. This complicates the static stack offsets calculation and require more dynamic ones. It also needs a save/restore of the function's return address. To simplify and optimize this, turn save_regs() into a macro. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jan Beulich <JBeulich@novell.com>	2011-07-02 18:05:31 +02:00
Frederic Weisbecker	47ce11a2b6	x86: Fetch stack from regs when possible in dump_trace() When regs are passed to dump_stack(), we fetch the frame pointer from the regs but the stack pointer is taken from the current frame. Thus the frame and stack pointers may not come from the same context. For example this can result in the unwinder to think the context is in irq, due to the current value of the stack, but the frame pointer coming from the regs points to a frame from another place. It then tries to fix up the irq link but ends up dereferencing a random frame pointer that doesn't belong to the irq stack: [ 9131.706906] ------------[ cut here ]------------ [ 9131.707003] WARNING: at arch/x86/kernel/dumpstack_64.c:129 dump_trace+0x2aa/0x330() [ 9131.707003] Hardware name: AMD690VM-FMH [ 9131.707003] Perf: bad frame pointer = 0000000000000005 in callchain [ 9131.707003] Modules linked in: [ 9131.707003] Pid: 1050, comm: perf Not tainted 3.0.0-rc3+ #181 [ 9131.707003] Call Trace: [ 9131.707003] <IRQ> [<ffffffff8104bd4a>] warn_slowpath_common+0x7a/0xb0 [ 9131.707003] [<ffffffff8104be21>] warn_slowpath_fmt+0x41/0x50 [ 9131.707003] [<ffffffff8178b873>] ? bad_to_user+0x6d/0x10be [ 9131.707003] [<ffffffff8100c2da>] dump_trace+0x2aa/0x330 [ 9131.707003] [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50 [ 9131.707003] [<ffffffff8101b164>] perf_callchain_kernel+0x54/0x70 [ 9131.707003] [<ffffffff810d391f>] perf_prepare_sample+0x19f/0x2a0 [ 9131.707003] [<ffffffff810d546c>] __perf_event_overflow+0x16c/0x290 [ 9131.707003] [<ffffffff810d5430>] ? __perf_event_overflow+0x130/0x290 [ 9131.707003] [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50 [ 9131.707003] [<ffffffff8100fbb9>] ? sched_clock+0x9/0x10 [ 9131.707003] [<ffffffff810752e5>] ? T.375+0x15/0x90 [ 9131.707003] [<ffffffff81084da4>] ? trace_hardirqs_on_caller+0x64/0x180 [ 9131.707003] [<ffffffff810817bd>] ? trace_hardirqs_off+0xd/0x10 [ 9131.707003] [<ffffffff810d5764>] perf_event_overflow+0x14/0x20 [ 9131.707003] [<ffffffff810d588c>] perf_swevent_hrtimer+0x11c/0x130 [ 9131.707003] [<ffffffff817821a1>] ? error_exit+0x51/0xb0 [ 9131.707003] [<ffffffff81072e93>] __run_hrtimer+0x83/0x1e0 [ 9131.707003] [<ffffffff810d5770>] ? perf_event_overflow+0x20/0x20 [ 9131.707003] [<ffffffff81073256>] hrtimer_interrupt+0x106/0x250 [ 9131.707003] [<ffffffff812a3bfd>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 9131.707003] [<ffffffff81024833>] smp_apic_timer_interrupt+0x53/0x90 [ 9131.707003] [<ffffffff81789053>] apic_timer_interrupt+0x13/0x20 [ 9131.707003] <EOI> [<ffffffff817821a1>] ? error_exit+0x51/0xb0 [ 9131.707003] [<ffffffff8178219c>] ? error_exit+0x4c/0xb0 [ 9131.707003] ---[ end trace b2560d4876709347 ]--- Fix this by simply taking the stack pointer from regs->sp when regs are provided. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-07-02 18:04:20 +02:00
Frederic Weisbecker	9e46294dad	x86: Save stack pointer in perf live regs savings In order to prepare for fetching the stack pointer from the regs when possible in dump_trace() instead of taking the local one, save the current stack pointer in perf live regs saving. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>	2011-07-02 18:04:03 +02:00
Sergei Shtylyov	50c31e4a24	x86, mtrr: Use pci_dev->revision This code uses PCI_CLASS_REVISION instead of PCI_REVISION_ID, so it wasn't converted by commit `44c10138fd` ("PCI: Change all drivers to use pci_device->revision") before being moved to arch/x86/... Do it now at last -- and save one level of indentation... Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/201107012242.08347.sshtylyov@ru.mvista.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-02 11:10:07 +02:00
Linus Torvalds	c9e0b84545	Merge branch 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen * 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/pci: Use the INT_SRC_OVR IRQ (instead of GSI) to preset the ACPI SCI IRQ. xen/mmu: Fix for linker errors when CONFIG_SMP is not defined.	2011-07-01 13:25:56 -07:00
Tejun Heo	a26474e864	x86-32, NUMA: Fix boot regression caused by NUMA init unification on highmem machines During 32/64 NUMA init unification, commit `797390d855` ("x86-32, NUMA: use sparse_memory_present_with_active_regions()") made 32bit mm init call memory_present() automatically from active_regions instead of leaving it to each NUMA init path. This commit description is inaccurate - memory_present() calls aren't the same for flat and numaq. After the commit, memory_present() is only called for the intersection of e820 and NUMA layout. Before, on flatmem, memory_present() would be called from 0 to max_pfn. After, it would be called only on the areas that e820 indicates to be populated. This is how x86_64 works and should be okay as memmap is allowed to contain holes; however, x86_32 DISCONTIGMEM is missing early_pfn_valid(), which makes memmap_init_zone() assume that memmap doesn't contain any hole. This leads to the following oops if e820 map contains holes as it often does on machine with near or more 4GiB of memory by calling pfn_to_page() on a pfn which isn't mapped to a NUMA node, a reported by Conny Seidel: BUG: unable to handle kernel paging request at 000012b0 IP: [<c1aa13ce>] memmap_init_zone+0x6c/0xf2 pdpt =3D 0000000000000000 pde =3D f000eef3f000ee00 Oops: 0000 [#1] SMP last sysfs file: Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00164-g797390d #1 To Be Filled By O.E.M. To Be Filled By O.E.M./E350M1 EIP: 0060:[<c1aa13ce>] EFLAGS: 00010012 CPU: 0 EIP is at memmap_init_zone+0x6c/0xf2 EAX: 00000000 EBX: 000a8000 ECX: 000a7fff EDX: f2c00b80 ESI: 000a8000 EDI: f2c00800 EBP: c19ffe54 ESP: c19ffe34 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=3Dc19fe000 task=3Dc1a07f60 task.ti=3Dc19fe000) Stack: 00000002 00000000 0023f000 00000000 10000000 00000a00 f2c00000 f2c00b58 c19ffeb0 c1a80f24 000375fe 00000000 f2c00800 00000800 00000100 00000030 c1abb768 0000003c 00000000 00000000 00000004 00207a02 f2c00800 000375fe Call Trace: [<c1a80f24>] free_area_init_node+0x358/0x385 [<c1a81384>] free_area_init_nodes+0x420/0x487 [<c1a79326>] paging_init+0x114/0x11b [<c1a6cb13>] setup_arch+0xb37/0xc0a [<c1a69554>] start_kernel+0x76/0x316 [<c1a690a8>] i386_start_kernel+0xa8/0xb0 This patch fixes the bug by defining early_pfn_valid() to be the same as pfn_valid() when DISCONTIGMEM. Reported-bisected-and-tested-by: Conny Seidel <conny.seidel@amd.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: hans.rosenfeld@amd.com Cc: Christoph Lameter <cl@linux.com> Cc: Conny Seidel <conny.seidel@amd.com> Link: http://lkml.kernel.org/r/20110628094107.GB3386@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 13:38:51 +02:00
Avi Kivity	0af3ac1fdb	x86, perf: Add constraints for architectural PMU The v1 PMU does not have any fixed counters. Using the v2 constraints, which do have fixed counters, causes an additional choice to be present in the weight calculation, but not when actually scheduling the event, leading to an event being not scheduled at all. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309362157-6596-3-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:39 +02:00
Avi Kivity	4dc0da8696	perf: Add context field to perf_event The perf_event overflow handler does not receive any caller-derived argument, so many callers need to resort to looking up the perf_event in their local data structure. This is ugly and doesn't scale if a single callback services many perf_events. Fix by adding a context parameter to perf_event_create_kernel_counter() (and derived hardware breakpoints APIs) and storing it in the perf_event. The field can be accessed from the callback as event->overflow_handler_context. All callers are updated. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:38 +02:00
Peter Zijlstra	89d6c0b5bd	perf, arch: Add generic NODE cache events Add a NODE level to the generic cache events which is used to measure local vs remote memory accesses. Like all other cache events, an ACCESS is HIT+MISS, if there is no way to distinguish between reads and writes do reads only etc.. The below needs filling out for !x86 (which I filled out with unsupported events). I'm fairly sure ARM can leave it like that since it doesn't strike me as an architecture that even has NUMA support. SH might have something since it does appear to have some NUMA bits. Sparc64, PowerPC and MIPS certainly want a good look there since they clearly are NUMA capable. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Miller <davem@davemloft.net> Cc: Anton Blanchard <anton@samba.org> Cc: David Daney <ddaney@caviumnetworks.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1303508226.4865.8.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:38 +02:00
Peter Zijlstra	b79e8941fb	perf, intel: Try alternative OFFCORE encodings Since the OFFCORE registers are fully symmetric, try the other one when the specified one is already in use. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1306141897.18455.8.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:37 +02:00
Stephane Eranian	ee89cbc2d4	perf_events: Add Intel Sandy Bridge offcore_response low-level support This patch adds Intel Sandy Bridge offcore_response support by providing the low-level constraint table for those events. On Sandy Bridge, there are two offcore_response events. Each uses its own dedictated extra register. But those registers are NOT shared between sibling CPUs when HT is on unlike Nehalem/Westmere. They are always private to each CPU. But they still need to be controlled within an event group. All events within an event group must use the same value for the extra MSR. That's not controlled by the second patch in this series. Furthermore on Sandy Bridge, the offcore_response events have NO counter constraints contrary to what the official documentation indicates, so drop the events from the contraint table. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110606145712.GA7304@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:37 +02:00
Stephane Eranian	cd8a38d33e	perf_events: Fix validation of events using an extra reg The validate_group() function needs to validate events with extra shared regs. Within an event group, only events with the same value for the extra reg can co-exist. This was not checked by validate_group() because it was missing the shared_regs logic. This patch changes the allocation of the fake cpuc used for validation to also point to a fake shared_regs structure such that group events be properly testing. It modifies __intel_shared_reg_get_constraints() to use spin_lock_irqsave() to avoid lockdep issues. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110606145708.GA7279@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:36 +02:00
Stephane Eranian	efc9f05df2	perf_events: Update Intel extra regs shared constraints management This patch improves the code managing the extra shared registers used for offcore_response events on Intel Nehalem/Westmere. The idea is to use static allocation instead of dynamic allocation. This simplifies greatly the get and put constraint routines for those events. The patch also renames per_core to shared_regs because the same data structure gets used whether or not HT is on. When HT is off, those events still need to coordination because they use a extra MSR that has to be shared within an event group. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110606145703.GA7258@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:36 +02:00
Peter Zijlstra	a7ac67ea02	perf: Remove the perf_output_begin(.sample) argument Since only samples call perf_output_sample() its much saner (and more correct) to put the sample logic in there than in the perf_output_begin()/perf_output_end() pair. Saves a useless argument, reduces conditionals and shrinks struct perf_output_handle, win! Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:35 +02:00
Peter Zijlstra	a8b0ca17b8	perf: Remove the nmi parameter from the swevent and overflow interface The nmi parameter indicated if we could do wakeups from the current context, if not, we would set some state and self-IPI and let the resulting interrupt do the wakeup. For the various event classes: - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from the PMI-tail (ARM etc.) - tracepoint: nmi=0; since tracepoint could be from NMI context. - software: nmi=[0,1]; some, like the schedule thing cannot perform wakeups, and hence need 0. As one can see, there is very little nmi=1 usage, and the down-side of not using it is that on some platforms some software events can have a jiffy delay in wakeup (when arch_irq_work_raise isn't implemented). The up-side however is that we can remove the nmi parameter and save a bunch of conditionals in fast paths. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Will Deacon <will.deacon@arm.com> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Cc: Anton Blanchard <anton@samba.org> Cc: Eric B Munson <emunson@mgebm.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-07-01 11:06:35 +02:00

... 2 3 4 5 6 ...

13850 Commits