linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-08 05:01:48 +00:00

Author	SHA1	Message	Date
Yinghai Lu	adbc742bf7	x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low Per hpa, use crashkernel=X,high crashkernel=Y,low instead of crashkernel_hign=X crashkernel_low=Y. As that could be extensible. -v2: according to Vivek, change delimiter to ; -v3: let hign and low only handle simple form and it conforms to description in kernel-parameters.txt still keep crashkernel=X override any crashkernel=X,high crashkernel=Y,low -v4: update get_last_crashkernel returning and add more strict checking in parse_crashkernel_simple() found by HATAYAMA. -v5: Change delimiter back to , according to HPA. also separate parse_suffix from parse_simper according to vivek. so we can avoid @pos in that path. -v6: Tight the checking about crashkernel=X,highblahblah,high found by HTYAYAMA. Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-17 12:35:33 -07:00
Yinghai Lu	55a20ee780	x86, kdump: Retore crashkernel= to allocate under 896M Vivek found old kexec-tools does not work new kernel anymore. So change back crashkernel= back to old behavoir, and add crashkernel_high= to let user decide if buffer could be above 4G, and also new kexec-tools will be needed. -v2: let crashkernel=X override crashkernel_high= update description about _high will be ignored by crashkernel=X -v3: update description about kernel-parameters.txt according to Vivek. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-4-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-17 12:35:33 -07:00
Yinghai Lu	c729de8fce	x86, kdump: Set crashkernel_low automatically Chao said that kdump does does work well on his system on 3.8 without extra parameter, even iommu does not work with kdump. And now have to append crashkernel_low=Y in first kernel to make kdump work. We have now modified crashkernel=X to allocate memory beyong 4G (if available) and do not allocate low range for crashkernel if the user does not specify that with crashkernel_low=Y. This causes regression if iommu is not enabled. Without iommu, swiotlb needs to be setup in first 4G and there is no low memory available to second kernel. Set crashkernel_low automatically if the user does not specify that. For system that does support IOMMU with kdump properly, user could specify crashkernel_low=0 to save that 72M low ram. -v3: add swiotlb_size() according to Konrad. -v4: add comments what 8M is for according to hpa. also update more crashkernel_low= in kernel-parameters.txt -v5: update changelog according to Vivek. -v6: Change description about swiotlb referring according to HATAYAMA. Reported-by: WANG Chao <chaowang@redhat.com> Tested-by: WANG Chao <chaowang@redhat.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-2-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-17 12:35:32 -07:00
Richard Weinberger	8c58bf3eec	x86,efi: Implement efi_no_storage_paranoia parameter Using this parameter one can disable the storage_size/2 check if he is really sure that the UEFI does sane gc and fulfills the spec. This parameter is useful if a devices uses more than 50% of the storage by default. The Intel DQSW67 desktop board is such a sucker for exmaple. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-17 15:13:38 +01:00
Thomas Gleixner	d190e8195b	idle: Remove GENERIC_IDLE_LOOP config switch All archs are converted over. Remove the config switch and the fallback code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2013-04-17 10:39:38 +02:00
H. Peter Anvin	c889ba801d	x86, relocs: Refactor the relocs tool to merge 32- and 64-bit ELF Refactor the relocs tool so that the same tool can handle 32- and 64-bit ELF. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/1365797627-20874-5-git-send-email-keescook@chromium.org	2013-04-16 16:02:58 -07:00
Kees Cook	17c961f770	x86, relocs: Build separate 32/64-bit tools Since the ELF structures and access macros change size based on 32 vs 64 bits, build a separate 32-bit relocs tool (for handling realmode and 32-bit relocations), and a 64-bit relocs tool (for handling 64-bit kernel relocations). Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/1365797627-20874-5-git-send-email-keescook@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-16 15:22:01 -07:00
Kees Cook	946166af95	x86, relocs: Add 64-bit ELF support to relocs tool This adds the ability to process relocations from the 64-bit kernel ELF, if built with ELF_BITS=64 defined. The special case for the percpu area is handled, along with some other symbols specific to the 64-bit kernel. Based on work by Neill Clift and Michael Davidson. Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/1365797627-20874-4-git-send-email-keescook@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-16 15:19:22 -07:00
Kees Cook	5d442e63d6	x86, relocs: Consolidate processing logic Instead of counting and then processing relocations, do it in a single pass. This splits the processing logic into separate functions for realmode and 32-bit (and paves the way for 64-bit). Also extracts helper functions when emitting relocations. Based on work by Neill Clift and Michael Davidson. Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/1365797627-20874-3-git-send-email-keescook@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-16 15:19:13 -07:00
Kees Cook	bf11655cf2	x86, relocs: Generalize ELF structure names In preparation for making the reloc tool operate on 64-bit relocations, generalize the structure names for easy recompilation via #defines. Based on work by Neill Clift and Michael Davidson. Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/1365797627-20874-2-git-send-email-keescook@chromium.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-16 15:19:06 -07:00
Gleb Natapov	f13882d84d	KVM: VMX: Fix check guest state validity if a guest is in VM86 mode If guest vcpu is in VM86 mode the vcpu state should be checked as if in real mode. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 18:34:19 -03:00
Paolo Bonzini	26539bd0e4	KVM: nVMX: check vmcs12 for valid activity state KVM does not use the activity state VMCS field, and does not support it in nested VMX either (the corresponding bits in the misc VMX feature MSR are zero). Fail entry if the activity state is set to anything but "active". Since the value will always be the same for L1 and L2, we do not need to read and write the corresponding VMCS field on L1/L2 transitions, either. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 18:22:14 -03:00
Konrad Rzeszutek Wilk	b12abaa192	xen/smp: Unifiy some of the PVs and PVHVM offline CPU path The "xen_cpu_die" and "xen_hvm_cpu_die" are very similar. Lets coalesce them. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 16:05:17 -04:00
Konrad Rzeszutek Wilk	27d8b207f0	xen/smp/pvhvm: Don't initialize IRQ_WORKER as we are using the native one. There is no need to use the PV version of the IRQ_WORKER mechanism as under PVHVM we are using the native version. The native version is using the SMP API. They just sit around unused: 69: 0 0 xen-percpu-ipi irqwork0 83: 0 0 xen-percpu-ipi irqwork1 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 16:05:17 -04:00
Konrad Rzeszutek Wilk	70dd4998cb	xen/spinlock: Disable IRQ spinlock (PV) allocation on PVHVM See git commit `f10cd522c5` (xen: disable PV spinlocks on HVM) for details. But we did not disable it everywhere - which means that when we boot as PVHVM we end up allocating per-CPU irq line for spinlock. This fixes that. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 16:05:16 -04:00
Konrad Rzeszutek Wilk	cb9c6f15f3	xen/spinlock: Check against default value of -1 for IRQ line. The default (uninitialized) value of the IRQ line is -1. Check if we already have allocated an spinlock interrupt line and if somebody is trying to do it again. Also set it to -1 when we offline the CPU. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 16:05:15 -04:00
Konrad Rzeszutek Wilk	ef35a4e6d9	xen/time: Add default value of -1 for IRQ and check for that. If the timer interrupt has been de-init or is just now being initialized, the default value of -1 should be preset as interrupt line. Check for that and if something is odd WARN us. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 16:05:14 -04:00
Konrad Rzeszutek Wilk	7918c92ae9	xen/time: Fix kasprintf splat when allocating timer%d IRQ line. When we online the CPU, we get this splat: smpboot: Booting Node 0 Processor 1 APIC 0x2 installing Xen timer for CPU 1 BUG: sleeping function called from invalid context at /home/konrad/ssd/konrad/linux/mm/slab.c:3179 in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/1 Pid: 0, comm: swapper/1 Not tainted 3.9.0-rc6upstream-00001-g3884fad #1 Call Trace: [<ffffffff810c1fea>] __might_sleep+0xda/0x100 [<ffffffff81194617>] __kmalloc_track_caller+0x1e7/0x2c0 [<ffffffff81303758>] ? kasprintf+0x38/0x40 [<ffffffff813036eb>] kvasprintf+0x5b/0x90 [<ffffffff81303758>] kasprintf+0x38/0x40 [<ffffffff81044510>] xen_setup_timer+0x30/0xb0 [<ffffffff810445af>] xen_hvm_setup_cpu_clockevents+0x1f/0x30 [<ffffffff81666d0a>] start_secondary+0x19c/0x1a8 The solution to that is use kasprintf in the CPU hotplug path that 'online's the CPU. That is, do it in in xen_hvm_cpu_notify, and remove the call to in xen_hvm_setup_cpu_clockevents. Unfortunatly the later is not a good idea as the bootup path does not use xen_hvm_cpu_notify so we would end up never allocating timer%d interrupt lines when booting. As such add the check for atomic() to continue. CC: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 16:05:07 -04:00
Yang Zhang	5a71785dde	KVM: VMX: Use posted interrupt to deliver virtual interrupt If posted interrupt is avaliable, then uses it to inject virtual interrupt to guest. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:41 -03:00
Yang Zhang	a20ed54d6e	KVM: VMX: Add the deliver posted interrupt algorithm Only deliver the posted interrupt when target vcpu is running and there is no previous interrupt pending in pir. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:40 -03:00
Yang Zhang	cf9e65b773	KVM: Set TMR when programming ioapic entry We already know the trigger mode of a given interrupt when programming the ioapice entry. So it's not necessary to set it in each interrupt delivery. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:40 -03:00
Yang Zhang	3d81bc7e96	KVM: Call common update function when ioapic entry changed. Both TMR and EOI exit bitmap need to be updated when ioapic changed or vcpu's id/ldr/dfr changed. So use common function instead eoi exit bitmap specific function. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:40 -03:00
Yang Zhang	01e439be77	KVM: VMX: Check the posted interrupt capability Detect the posted interrupt feature. If it exists, then set it in vmcs_config. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:40 -03:00
Yang Zhang	d78f266483	KVM: VMX: Register a new IPI for posted interrupt Posted Interrupt feature requires a special IPI to deliver posted interrupt to guest. And it should has a high priority so the interrupt will not be blocked by others. Normally, the posted interrupt will be consumed by vcpu if target vcpu is running and transparent to OS. But in some cases, the interrupt will arrive when target vcpu is scheduled out. And host will see it. So we need to register a dump handler to handle it. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:39 -03:00
Yang Zhang	a547c6db4d	KVM: VMX: Enable acknowledge interupt on vmexit The "acknowledge interrupt on exit" feature controls processor behavior for external interrupt acknowledgement. When this control is set, the processor acknowledges the interrupt controller to acquire the interrupt vector on VM exit. After enabling this feature, an interrupt which arrived when target cpu is running in vmx non-root mode will be handled by vmx handler instead of handler in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt table to let real handler to handle it. Further, we will recognize the interrupt and only delivery the interrupt which not belong to current vcpu through idt table. The interrupt which belonged to current vcpu will be handled inside vmx handler. This will reduce the interrupt handle cost of KVM. Also, interrupt enable logic is changed if this feature is turnning on: Before this patch, hypervior call local_irq_enable() to enable it directly. Now IF bit is set on interrupt stack frame, and will be enabled on a return from interrupt handler if exterrupt interrupt exists. If no external interrupt, still call local_irq_enable() to enable it. Refer to Intel SDM volum 3, chapter 33.2. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:39 -03:00
Konrad Rzeszutek Wilk	66ff0fe9e7	xen/smp/spinlock: Fix leakage of the spinlock interrupt line for every CPU online/offline While we don't use the spinlock interrupt line (see for details commit `f10cd522c5` - xen: disable PV spinlocks on HVM) - we should still do the proper init / deinit sequence. We did not do that correctly and for the CPU init for PVHVM guest we would allocate an interrupt line - but failed to deallocate the old interrupt line. This resulted in leakage of an irq_desc but more importantly this splat as we online an offlined CPU: genirq: Flags mismatch irq 71. 0002cc20 (spinlock1) vs. 0002cc20 (spinlock1) Pid: 2542, comm: init.late Not tainted 3.9.0-rc6upstream #1 Call Trace: [<ffffffff811156de>] __setup_irq+0x23e/0x4a0 [<ffffffff81194191>] ? kmem_cache_alloc_trace+0x221/0x250 [<ffffffff811161bb>] request_threaded_irq+0xfb/0x160 [<ffffffff8104c6f0>] ? xen_spin_trylock+0x20/0x20 [<ffffffff813a8423>] bind_ipi_to_irqhandler+0xa3/0x160 [<ffffffff81303758>] ? kasprintf+0x38/0x40 [<ffffffff8104c6f0>] ? xen_spin_trylock+0x20/0x20 [<ffffffff810cad35>] ? update_max_interval+0x15/0x40 [<ffffffff816605db>] xen_init_lock_cpu+0x3c/0x78 [<ffffffff81660029>] xen_hvm_cpu_notify+0x29/0x33 [<ffffffff81676bdd>] notifier_call_chain+0x4d/0x70 [<ffffffff810bb2a9>] __raw_notifier_call_chain+0x9/0x10 [<ffffffff8109402b>] __cpu_notify+0x1b/0x30 [<ffffffff8166834a>] _cpu_up+0xa0/0x14b [<ffffffff816684ce>] cpu_up+0xd9/0xec [<ffffffff8165f754>] store_online+0x94/0xd0 [<ffffffff8141d15b>] dev_attr_store+0x1b/0x20 [<ffffffff81218f44>] sysfs_write_file+0xf4/0x170 [<ffffffff811a2864>] vfs_write+0xb4/0x130 [<ffffffff811a302a>] sys_write+0x5a/0xa0 [<ffffffff8167ada9>] system_call_fastpath+0x16/0x1b cpu 1 spinlock event irq -16 smpboot: Booting Node 0 Processor 1 APIC 0x2 And if one looks at the /proc/interrupts right after offlining (CPU1): 70: 0 0 xen-percpu-ipi spinlock0 71: 0 0 xen-percpu-ipi spinlock1 77: 0 0 xen-percpu-ipi spinlock2 There is the oddity of the 'spinlock1' still being present. CC: stable@vger.kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 15:11:56 -04:00
Konrad Rzeszutek Wilk	888b65b4bc	xen/smp: Fix leakage of timer interrupt line for every CPU online/offline. In the PVHVM path when we do CPU online/offline path we would leak the timer%d IRQ line everytime we do a offline event. The online path (xen_hvm_setup_cpu_clockevents via x86_cpuinit.setup_percpu_clockev) would allocate a new interrupt line for the timer%d. But we would still use the old interrupt line leading to: kernel BUG at /home/konrad/ssd/konrad/linux/kernel/hrtimer.c:1261! invalid opcode: 0000 [#1] SMP RIP: 0010:[<ffffffff810b9e21>] [<ffffffff810b9e21>] hrtimer_interrupt+0x261/0x270 .. snip.. <IRQ> [<ffffffff810445ef>] xen_timer_interrupt+0x2f/0x1b0 [<ffffffff81104825>] ? stop_machine_cpu_stop+0xb5/0xf0 [<ffffffff8111434c>] handle_irq_event_percpu+0x7c/0x240 [<ffffffff811175b9>] handle_percpu_irq+0x49/0x70 [<ffffffff813a74a3>] __xen_evtchn_do_upcall+0x1c3/0x2f0 [<ffffffff813a760a>] xen_evtchn_do_upcall+0x2a/0x40 [<ffffffff8167c26d>] xen_hvm_callback_vector+0x6d/0x80 <EOI> [<ffffffff81666d01>] ? start_secondary+0x193/0x1a8 [<ffffffff81666cfd>] ? start_secondary+0x18f/0x1a8 There is also the oddity (timer1) in the /proc/interrupts after offlining CPU1: 64: 1121 0 xen-percpu-virq timer0 78: 0 0 xen-percpu-virq timer1 84: 0 2483 xen-percpu-virq timer2 This patch fixes it. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: stable@vger.kernel.org	2013-04-16 15:11:54 -04:00
Jan Beulich	dec02dea1c	xen: drop tracking of IRQ vector For quite a few Xen versions, this wasn't the IRQ vector anymore anyway, and it is not being used by the kernel for anything. Hence drop the field from struct irq_info, and respective function parameters. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 15:05:45 -04:00
David Vrabel	96f28bc66a	x86/xen: populate boot_params with EDD data During early setup of a dom0 kernel, populate boot_params with the Enhanced Disk Drive (EDD) and MBR signature data. This makes information on the BIOS boot device available in /sys/firmware/edd/. Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-16 15:04:37 -04:00
Sergey Vlasov	3668011d4a	efi: Export efi_query_variable_store() for efivars.ko Fixes build with CONFIG_EFI_VARS=m which was broken after the commit "x86, efivars: firmware bug workarounds should be in platform code". Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-16 17:34:07 +01:00
Sergey Vlasov	f6ce500262	x86/Kconfig: Make EFI select UCS2_STRING The commit "efi: Distinguish between "remaining space" and actually used space" added usage of ucs2_*() functions to arch/x86/platform/efi/efi.c, but the only thing which selected UCS2_STRING was EFI_VARS, which is technically optional and can be built as a module. Signed-off-by: Sergey Vlasov <vsu@altlinux.ru> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-16 17:31:08 +01:00
Stephane Eranian	f1923820c4	perf/x86: Fix offcore_rsp valid mask for SNB/IVB The valid mask for both offcore_response_0 and offcore_response_1 was wrong for SNB/SNB-EP, IVB/IVB-EP. It was possible to write to reserved bit and cause a GP fault crashing the kernel. This patch fixes the problem by correctly marking the reserved bits in the valid mask for all the processors mentioned above. A distinction between desktop and server parts is introduced because bits 24-30 are only available on the server parts. This version of the patch is just a rebase to perf/urgent tree and should apply to older kernels as well. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: jolsa@redhat.com Cc: gregkh@linuxfoundation.org Cc: security@kernel.org Cc: ak@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-16 15:02:06 +02:00
Borislav Petkov	1077c932db	x86, CPU, AMD: Drop useless label All we want to do is return from this function so stop jumping around like a flea for no good reason. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1365436666-9837-5-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-16 11:50:51 +02:00
Borislav Petkov	682469a5db	x86, AMD: Correct {rd,wr}msr_amd_safe warnings The idea with those routines is to slowly phase them out and not call them on anything else besides K8. They even have a check for that which, when called too early, fails. Let me explain: It gets the cpuinfo_x86 pointer from the per_cpu array and when this happens for cpu0, before its boot_cpu_data has been copied back to the per_cpu array in smp_store_boot_cpu_info(), we get an empty struct and thus the check fails. Use boot_cpu_data directly instead. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1365436666-9837-4-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-16 11:50:51 +02:00
Borislav Petkov	55a36b65ee	x86: Fold-in trivial check_config function Fold it into its single call site. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1365436666-9837-3-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-16 11:50:50 +02:00
Alexandre Courbot	7fd2bf3d32	Remove GENERIC_GPIO config option GENERIC_GPIO has been made equivalent to GPIOLIB in architecture code and all driver code has been switch to depend on GPIOLIB. It is thus safe to have GENERIC_GPIO removed. Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Grant Likely <grant.likely@secretlab.ca>	2013-04-16 18:47:19 +09:00
Ingo Molnar	b5210b2a34	Merge branch 'uprobes/core' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/core Pull uprobes updates from Oleg Nesterov: - "uretprobes" - an optimization to uprobes, like kretprobes are an optimization to kprobes. "perf probe -x file sym%return" now works like kretprobes. - PowerPC fixes plus a couple of cleanups/optimizations in uprobes and trace_uprobes. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-16 11:04:10 +02:00
Wang YanQing	26bfc540f6	x86/mm/gart: Drop unnecessary check The memblock_find_in_range() return value addr is guaranteed to be within "addr + aper_size" and not beyond GART_MAX_ADDR. Signed-off-by: Wang YanQing <udknight@gmail.com> Cc: yinghai@kernel.org Link: http://lkml.kernel.org/r/20130416013734.GA14641@udknight Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-16 10:54:40 +02:00
Yang Zhang	aa2fbe6d44	KVM: Let ioapic know the irq line status Userspace may deliver RTC interrupt without query the status. So we want to track RTC EOI for this case. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:34 -03:00
Yang Zhang	106069193c	KVM: Add reset/restore rtc_status support restore rtc_status from migration or save/restore Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:34 -03:00
Yang Zhang	b4f2225c07	KVM: Return destination vcpu on interrupt injection Add a new parameter to know vcpus who received the interrupt. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:34 -03:00
Yang Zhang	1fcc7890db	KVM: Add vcpu info to ioapic_update_eoi() Add vcpu info to ioapic_update_eoi, so we can know which vcpu issued this EOI. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:33 -03:00
Matthew Garrett	31ff2f20d9	efi: Distinguish between "remaining space" and actually used space EFI implementations distinguish between space that is actively used by a variable and space that merely hasn't been garbage collected yet. Space that hasn't yet been garbage collected isn't available for use and so isn't counted in the remaining_space field returned by QueryVariableInfo(). Combined with commit `68d9298` this can cause problems. Some implementations don't garbage collect until the remaining space is smaller than the maximum variable size, and as a result check_var_size() will always fail once more than 50% of the variable store has been used even if most of that space is marked as available for garbage collection. The user is unable to create new variables, and deleting variables doesn't increase the remaining space. The problem that `68d9298` was attempting to avoid was one where certain platforms fail if the actively used space is greater than 50% of the available storage space. We should be able to calculate that by simply summing the size of each available variable and subtracting that from the total storage space. With luck this will fix the problem described in https://bugzilla.kernel.org/show_bug.cgi?id=55471 without permitting damage to occur to the machines `68d9298` was attempting to fix. Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-15 21:33:05 +01:00
Matthew Garrett	cc5a080c5d	efi: Pass boot services variable info to runtime code EFI variables can be flagged as being accessible only within boot services. This makes it awkward for us to figure out how much space they use at runtime. In theory we could figure this out by simply comparing the results from QueryVariableInfo() to the space used by all of our variables, but that fails if the platform doesn't garbage collect on every boot. Thankfully, calling QueryVariableInfo() while still inside boot services gives a more reliable answer. This patch passes that information from the EFI boot stub up to the efi platform code. Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-15 21:31:09 +01:00
Tang Chen	587ff8c4ea	x86/mm/hotplug: Put kernel_physical_mapping_remove() declaration in CONFIG_MEMORY_HOTREMOVE kernel_physical_mapping_remove() is only called by arch_remove_memory() in init_64.c, which is enclosed in CONFIG_MEMORY_HOTREMOVE. So when we don't configure CONFIG_MEMORY_HOTREMOVE, the compiler will give a warning: warning: ‘kernel_physical_mapping_remove’ defined but not used So put kernel_physical_mapping_remove() in CONFIG_MEMORY_HOTREMOVE. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: linux-mm@kvack.org Cc: gregkh@linuxfoundation.org Cc: yinghai@kernel.org Cc: wency@cn.fujitsu.com Cc: mgorman@suse.de Cc: tj@kernel.org Cc: liwanp@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1366019207-27818-3-git-send-email-tangchen@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-15 12:03:24 +02:00
Andy Shevchenko	d50ba3687b	x86/lib: Fix spelling, put space between a numeral and its units As suggested by Peter Anvin. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: H . Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-15 11:40:32 +02:00
Andy Shevchenko	bb916ff7cd	x86/lib: Fix spelling in the comments Apparently 'byts' should be 'bytes'. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: H . Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-15 11:40:31 +02:00
Linus Torvalds	6c4c4d4bda	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set x86/mm/cpa/selftest: Fix false positive in CPA self test x86/mm/cpa: Convert noop to functional fix x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates	2013-04-14 11:13:24 -07:00
Linus Torvalds	ae9f4939ba	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixlets" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix error return code ftrace: Fix strncpy() use, use strlcpy() instead of strncpy() perf: Fix strncpy() use, use strlcpy() instead of strncpy() perf: Fix strncpy() use, always make sure it's NUL terminated perf: Fix ring_buffer perf_output_space() boundary calculation perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer()	2013-04-14 11:10:44 -07:00
Jan Kiszka	c0d1c770c0	KVM: nVMX: Avoid reading VM_EXIT_INTR_ERROR_CODE needlessly on nested exits We only need to update vm_exit_intr_error_code if there is a valid exit interruption information and it comes with a valid error code. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:10 +03:00
Jan Kiszka	e8457c67a4	KVM: nVMX: Fix conditions for interrupt injection If we are entering guest mode, we do not want L0 to interrupt this vmentry with all its side effects on the vmcs. Therefore, injection shall be disallowed during L1->L2 transitions, as in the previous version. However, this check is conceptually independent of nested_exit_on_intr, so decouple it. If L1 traps external interrupts, we can kick the guest from L2 to L1, also just like the previous code worked. But we no longer need to consider L1's idt_vectoring_info_field. It will always be empty at this point. Instead, if L2 has pending events, those are now found in the architectural queues and will, thus, prevent vmx_interrupt_allowed from being called at all. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:09 +03:00
Jan Kiszka	5f3d579997	KVM: nVMX: Rework event injection and recovery The basic idea is to always transfer the pending event injection on vmexit into the architectural state of the VCPU and then drop it from there if it turns out that we left L2 to enter L1, i.e. if we enter prepare_vmcs12. vmcs12_save_pending_events takes care to transfer pending L0 events into the queue of L1. That is mandatory as L1 may decide to switch the guest state completely, invalidating or preserving the pending events for later injection (including on a different node, once we support migration). This concept is based on the rule that a pending vmlaunch/vmresume is not canceled. Otherwise, we would risk to lose injected events or leak them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the entry of nested_vmx_vmexit. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:07 +03:00
Jan Kiszka	3b656cf764	KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1 Check if the interrupt or NMI window exit is for L1 by testing if it has the corresponding controls enabled. This is required when we allow direct injection from L0 to L2 Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:05 +03:00
Gleb Natapov	188424ba10	KVM: emulator: mark 0xff 0x7d opcode as undefined. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:17 +03:00
Gleb Natapov	1146a78b8d	KVM: emulator: Do not fail on emulation of undefined opcode Emulation of undefined opcode should inject #UD instead of causing emulation failure. Do that by moving Undefined flag check to emulation stage and injection #UD there. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:17 +03:00
Gleb Natapov	991eebf9f8	KVM: VMX: do not try to reexecute failed instruction while emulating invalid guest state During invalid guest state emulation vcpu cannot enter guest mode to try to reexecute instruction that emulator failed to emulate, so emulation will happen again and again. Prevent that by telling the emulator that instruction reexecution should not be attempted. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:17 +03:00
Gleb Natapov	0b789eee2c	KVM: emulator: fix unimplemented instruction detection Unimplemented instruction detection is broken for group instructions since it relies on "flags" field of opcode to be zero, but all instructions in a group inherit flags from a group encoding. Fix that by having a separate flag for unimplemented instructions. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:16 +03:00
Anton Arapov	791eca1010	uretprobes/x86: Hijack return address Hijack the return address and replace it with a trampoline address. Signed-off-by: Anton Arapov <anton@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com>	2013-04-13 15:31:55 +02:00
Dave Hansen	1de14c3c5c	x86-32: Fix possible incomplete TLB invalidate with PAE pagetables This patch attempts to fix: https://bugzilla.kernel.org/show_bug.cgi?id=56461 The symptom is a crash and messages like this: chrome: Corrupted page table at address 34a03000 pdpt = 0000000000000000 pde = 0000000000000000 Bad pagetable: 000f [#1] PREEMPT SMP Ingo guesses this got introduced by commit `611ae8e3f5` ("x86/tlb: enable tlb flush range support for x86") since that code started to free unused pagetables. On x86-32 PAE kernels, that new code has the potential to free an entire PMD page and will clear one of the four page-directory-pointer-table (aka pgd_t entries). The hardware aggressively "caches" these top-level entries and invlpg does not actually affect the CPU's copy. If we clear one we HAVE to do a full TLB flush, otherwise we might continue using a freed pmd page. (note, we do this properly on the population side in pud_populate()). This patch tracks whenever we clear one of these entries in the 'struct mmu_gather', and ensures that we follow up with a full tlb flush. BTW, I disassembled and checked that: if (tlb->fullmm == 0) and if (!tlb->fullmm && !tlb->need_flush_all) generate essentially the same code, so there should be zero impact there to the !PAE case. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Artem S Tashkinov <t.artem@mailcity.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-04-12 16:56:47 -07:00
Jiang Liu	89016506b6	x86/PCI: Implement pcibios_{add\|remove}_bus() hooks Implement pcibios_{add\|remove}_bus() hooks for x86 platforms. Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Yinghai Lu <yinghai@kernel.org> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Myron Stowe <myron.stowe@redhat.com>	2013-04-12 15:38:25 -06:00
Paul Bolle	a7e6567585	x86/mm/fixmap: Remove unused FIX_CYCLONE_TIMER The last users of FIX_CYCLONE_TIMER were removed in v2.6.18. We can remove this unneeded constant. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Link: http://lkml.kernel.org/r/1365698982.1427.3.camel@x61.thuisdomein Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-12 07:21:18 +02:00
Boris Ostrovsky	26564600c9	x86/mm: Flush lazy MMU when DEBUG_PAGEALLOC is set When CONFIG_DEBUG_PAGEALLOC is set page table updates made by kernel_map_pages() are not made visible (via TLB flush) immediately if lazy MMU is on. In environments that support lazy MMU (e.g. Xen) this may lead to fatal page faults, for example, when zap_pte_range() needs to allocate pages in __tlb_remove_page() -> tlb_next_batch(). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: konrad.wilk@oracle.com Link: http://lkml.kernel.org/r/1365703192-2089-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-12 07:19:19 +02:00
Andrea Arcangeli	18699739b6	x86/mm/cpa/selftest: Fix false positive in CPA self test If the pmd is not present, _PAGE_PSE will not be set anymore. Fix the false positive. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Stefan Bader <stefan.bader@canonical.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1365687369-30802-1-git-send-email-aarcange@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-12 06:39:20 +02:00
konrad@kernel.org	4d681be3c3	x86, wakeup, sleep: Use pvops functions for changing GDT entries We check the TSS descriptor before we try to dereference it. Also we document what the value '9' actually means using the AMD64 Architecture Programmer's Manual Volume 2, pg 90: "Hex value 9: Available 64-bit TSS" and pg 91: "The available 32-bit TSS (09h), which is redefined as the available 64-bit TSS." Without this, on Xen, where the GDT is available as R/O (to protect the hypervisor from the guest modifying it), we end up with a pagetable fault. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1365194544-14648-5-git-send-email-konrad.wilk@oracle.com Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-11 15:41:15 -07:00
Konrad Rzeszutek Wilk	357d122670	x86, xen, gdt: Remove the pvops variant of store_gdt. The two use-cases where we needed to store the GDT were during ACPI S3 suspend and resume. As the patches: x86/gdt/i386: store/load GDT for ACPI S3 or hibernation/resume path is not needed x86/gdt/64-bit: store/load GDT for ACPI S3 or hibernate/resume path is not needed. have demonstrated - there are other mechanism by which the GDT is saved and reloaded during early resume path. Hence we do not need to worry about the pvops call-chain for saving the GDT and can and can eliminate it. The other areas where the store_gdt is used are never going to be hit when running under the pvops platforms. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1365194544-14648-4-git-send-email-konrad.wilk@oracle.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-11 15:40:38 -07:00
Konrad Rzeszutek Wilk	84e70971e6	x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed During the ACPI S3 suspend, we store the GDT in the wakup_header (see wakeup_asm.s) field called 'pmode_gdt'. Which is then used during the resume path and has the same exact value as what the store/load_gdt do with the saved_context (which is saved/restored via save/restore_processor_state()). The flow during resume from ACPI S3 is simpler than the 64-bit counterpart. We only use the early bootstrap once (wakeup_gdt) and do various checks in real mode. After the checks are completed, we load the saved GDT ('pmode_gdt') and continue on with the resume (by heading to startup_32 in trampoline_32.S) - which quickly jumps to what was saved in 'pmode_entry' aka 'wakeup_pmode_return'. The 'wakeup_pmode_return' restores the GDT (saved_gdt) again (which was saved in do_suspend_lowlevel initially). After that it ends up calling the 'ret_point' which calls 'restore_processor_state()'. We have two opportunities to remove code where we restore the same GDT twice. Here is the call chain: wakeup_start \|- lgdtl wakeup_gdt [the work-around broken BIOSes] \| \| - lgdtl pmode_gdt [the real one] \| \-- startup_32 (in trampoline_32.S) \-- wakeup_pmode_return (in wakeup_32.S) \|- lgdtl saved_gdt [the real one] \-- ret_point \|.. \|- call restore_processor_state The hibernate path is much simpler. During the saving of the hibernation image we call save_processor_state() and save the contents of that along with the rest of the kernel in the hibernation image destination. We save the EIP of 'restore_registers' (restore_jump_address) and cr3 (restore_cr3). During hibernate resume, the 'restore_registers' (via the 'restore_jump_address) in hibernate_asm_32.S is invoked which restores the contents of most registers. Naturally the resume path benefits from already being in 32-bit mode, so it does not have to reload the GDT. It only reloads the cr3 (from restore_cr3) and continues on. Note that the restoration of the restore image page-tables is done prior to this. After the 'restore_registers' it returns and we end up called restore_processor_state() - where we reload the GDT. The reload of the GDT is not needed as bootup kernel has already loaded the GDT which is at the same physical location as the the restored kernel. Note that the hibernation path assumes the GDT is correct during its 'restore_registers'. The assumption in the code is that the restored image is the same as saved - meaning we are not trying to restore an different kernel in the virtual address space of a new kernel. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1365194544-14648-3-git-send-email-konrad.wilk@oracle.com Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-11 15:40:17 -07:00
Konrad Rzeszutek Wilk	e7a5cd063c	x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path is not needed. During the ACPI S3 resume path the trampoline code handles it already. During the ACPI S3 suspend phase (acpi_suspend_lowlevel) we set: early_gdt_descr.address = (..)get_cpu_gdt_table(smp_processor_id()); which is then used during the resume path and has the same exact value as what the store/load_gdt do with the saved_context (which is saved/restored via save/restore_processor_state()). The flow during resume is complex and for 64-bit kernels we use three GDTs - one early bootstrap GDT (wakeup_igdt) that we load to workaround broken BIOSes, an early Protected Mode to Long Mode transition one (tr_gdt), and the final one - early_gdt_descr (which points to the real GDT). The early ('wakeup_gdt') is loaded in 'trampoline_start' for working around broken BIOSes, and then when we end up in Protected Mode in the startup_32 (in trampoline_64.s, not head_32.s) we use the 'tr_gdt' (still in trampoline_64.s). This 'tr_gdt' has a a 32-bit code segment, 64-bit code segment with L=1, and a 32-bit data segment. Once we have transitioned from Protected Mode to Long Mode we then set the GDT to 'early_gdt_desc' and then via an iretq emerge in wakeup_long64 (set via 'initial_code' variable in acpi_suspend_lowlevel). In the wakeup_long64 we end up restoring the %rip (which is set to 'resume_point') and jump there. In 'resume_point' we call 'restore_processor_state' which does the load_gdt on the saved context. This load_gdt is redundant as the GDT loaded via early_gdt_desc is the same. Here is the call-chain: wakeup_start \|- lgdtl wakeup_gdt [the work-around broken BIOSes] \| \-- trampoline_start (trampoline_64.S) \|- lgdtl tr_gdt \| \-- startup_32 (trampoline_64.S) \| \-- startup_64 (trampoline_64.S) \| \-- secondary_startup_64 \|- lgdtl early_gdt_desc \| ... \|- movq initial_code(%rip), %eax \|-.. lretq \-- wakeup_64 \|-- other registers are reloaded \|-- call restore_processor_state The hibernate path is much simpler. During the saving of the hibernation image we call save_processor_state() and save the contents of that along with the rest of the kernel in the hibernation image destination. We save the EIP of 'restore_registers' (restore_jump_address) and cr3 (restore_cr3). During hibernate resume, the 'restore_registers' (via the 'restore_jump_address) in hibernate_asm_64.S is invoked which restores the contents of most registers. Naturally the resume path benefits from already being in 64-bit mode, so it does not have to load the GDT. It only reloads the cr3 (from restore_cr3) and continues on. Note that the restoration of the restore image page-tables is done prior to this. After the 'restore_registers' it returns and we end up called restore_processor_state() - where we reload the GDT. The reload of the GDT is not needed as bootup kernel has already loaded the GDT which is at the same physical location as the the restored kernel. Note that the hibernation path assumes the GDT is correct during its 'restore_registers'. The assumption in the code is that the restored image is the same as saved - meaning we are not trying to restore an different kernel in the virtual address space of a new kernel. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1365194544-14648-2-git-send-email-konrad.wilk@oracle.com Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-11 15:39:38 -07:00
Kees Cook	4eefbe792b	x86: Use a read-only IDT alias on all CPUs Make a copy of the IDT (as seen via the "sidt" instruction) read-only. This primarily removes the IDT from being a target for arbitrary memory write attacks, and has the added benefit of also not leaking the kernel base offset, if it has been relocated. We already did this on vendor == Intel and family == 5 because of the F0 0F bug -- regardless of if a particular CPU had the F0 0F bug or not. Since the workaround was so cheap, there simply was no reason to be very specific. This patch extends the readonly alias to all CPUs, but does not activate the #PF to #UD conversion code needed to deliver the proper exception in the F0 0F case except on Intel family 5 processors. Signed-off-by: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/20130410192422.GA17344@www.outflux.net Cc: Eric Northup <digitaleric@google.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-11 13:53:19 -07:00
Richard Weinberger	7791c8423f	x86,efi: Check max_size only if it is non-zero. Some EFI implementations return always a MaximumVariableSize of 0, check against max_size only if it is non-zero. My Intel DQ67SW desktop board has such an implementation. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-11 15:45:52 +01:00
Kevin Wolf	f8da94e9e4	KVM: x86 emulator: Fix segment loading in VM86 This fixes a regression introduced in commit `03ebebeb1` ("KVM: x86 emulator: Leave segment limit and attributs alone in real mode"). The mentioned commit changed the segment descriptors for both real mode and VM86 to only update the segment base instead of creating a completely new descriptor with limit 0xffff so that unreal mode keeps working across a segment register reload. This leads to an invalid segment descriptor in the eyes of VMX, which seems to be okay for real mode because KVM will fix it up before the next VM entry or emulate the state, but it doesn't do this if the guest is in VM86, so we end up with: KVM: entry failed, hardware error 0x80000021 Fix this by effectively reverting commit `03ebebeb1` for VM86 and leaving it only in place for real mode, which is where it's really needed. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-11 15:53:06 +03:00
Andrea Arcangeli	f76cfa3c24	x86/mm/cpa: Convert noop to functional fix Commit: `a8aed3e075` ("x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge") introduced a valid fix but one location that didn't trigger the bug that lead to finding those (small) problems, wasn't updated using the right variable. The wrong variable was also initialized for no good reason, that may have been the source of the confusion. Remove the noop initialization accordingly. Commit `a8aed3e075` also erroneously removed one canon_pgprot pass meant to clear pmd bitflags not supported in hardware by older CPUs, that automatically gets corrected by this patch too by applying it to the right variable in the new location. Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Andy Whitcroft <apw@canonical.com> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/1365600505-19314-1-git-send-email-aarcange@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-11 10:34:42 +02:00
Linus Torvalds	722aacb285	Bug-fixes: - Early bootup issue found on DL380 machines - Fix for the timer interrupt not being processed right away. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQEcBAABAgAGBQJRZbq0AAoJEFjIrFwIi8fJTnsIAIWYw7g9j0T9gijc/t5wEZrK KpPBITlGFAeM7liEaUh5X5M2B86tBoI77uV5EGCvDDwth+FD5WsgeMesxV9KlMdj vbWLGubJpmd8zy6Q1f/T3LsxGHGCjz8jASeN7YTPRdBqITOQDqXjj2VC/4n7AQCh Le3ml3A/NZTZMiz2PK8lDzjpzY2lDgrIloevahVoYLe8Jxg2aW5JTaZQg7oPA6ir lqC1Sgju6RDKR0kmPmM8wl5TOIMCkrygriP62B+Ww9wl9HlS+X5/JlK3zVj0vJXo oxNyDKEK96M54oO5t/v7qzdfX2Xj+S/6JPZOlegCWGClS9rQoK3uDBZupGLiAh8= =jaNW -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen fixes from Konrad Rzeszutek Wilk: "Two bug-fixes: - Early bootup issue found on DL380 machines - Fix for the timer interrupt not being processed right awaym leading to quite delayed time skew on certain workloads" * tag 'stable/for-linus-3.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/mmu: On early bootup, flush the TLB when changing RO->RW bits Xen provided pagetables. xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.	2013-04-10 15:57:33 -07:00
Boris Ostrovsky	511ba86e1d	x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal Invoking arch_flush_lazy_mmu_mode() results in calls to preempt_enable()/disable() which may have performance impact. Since lazy MMU is not used on bare metal we can patch away arch_flush_lazy_mmu_mode() so that it is never called in such environment. [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU updates" may cause a minor performance regression on bare metal. This patch resolves that performance regression. It is somewhat unclear to me if this is a good -stable candidate. ] Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com Tested-by: Josh Boyer <jwboyer@redhat.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org> SEE NOTE ABOVE	2013-04-10 11:25:10 -07:00
Samu Kallio	1160c2779b	x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops when lazy MMU updates are enabled, because set_pgd effects are being deferred. One instance of this problem is during process mm cleanup with memory cgroups enabled. The chain of events is as follows: - zap_pte_range enables lazy MMU updates - zap_pte_range eventually calls mem_cgroup_charge_statistics, which accesses the vmalloc'd mem_cgroup per-cpu stat area - vmalloc_fault is triggered which tries to sync the corresponding PGD entry with set_pgd, but the update is deferred - vmalloc_fault oopses due to a mismatch in the PUD entries The OOPs usually looks as so: ------------[ cut here ]------------ kernel BUG at arch/x86/mm/fault.c:396! invalid opcode: 0000 [#1] SMP .. snip .. CPU 1 Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1 RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208 .. snip .. Call Trace: [<ffffffff81627759>] do_page_fault+0x399/0x4b0 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110 [<ffffffff81624065>] page_fault+0x25/0x30 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870 [<ffffffff81154962>] unmap_vmas+0x52/0xa0 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81059ce3>] mmput+0x83/0xf0 [<ffffffff810624c4>] exit_mm+0x104/0x130 [<ffffffff8106264a>] do_exit+0x15a/0x8c0 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0 [<ffffffff81063177>] sys_exit_group+0x17/0x20 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the changes visible to the consistency checks. Cc: <stable@vger.kernel.org> RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737 Tested-by: Josh Boyer <jwboyer@redhat.com> Reported-and-Tested-by: Krishna Raman <kraman@redhat.com> Signed-off-by: Samu Kallio <samu.kallio@aberdeencloud.com> Link: http://lkml.kernel.org/r/1364045796-10720-1-git-send-email-konrad.wilk@oracle.com Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-10 11:25:07 -07:00
Martin Bundgaard	7e9a2f0a08	x86/mm/numa: Simplify some bit mangling Minor. Reordered a few lines to lose a superfluous OR operation. Signed-off-by: Martin Bundgaard <martin@mindflux.org> Link: http://lkml.kernel.org/r/1363286075-62615-1-git-send-email-martin@mindflux.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 19:06:26 +02:00
Andi Kleen	f8378f5259	perf/x86: Add Sandy Bridge constraints for CYCLE_ACTIVITY.* Add CYCLE_ACTIVITY.CYCLES_NO_DISPATCH/CYCLES_L1D_PENDING constraints. These recently documented events have restrictions to counter 0-3 and counter 2 respectively. The perf scheduler needs to know that to schedule them correctly. IvyBridge already has the necessary constraints. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: a.p.zijlstra@chello.nl Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1362784968-12542-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 15:00:07 +02:00
Paul Bolle	cd69aa6b38	x86/mm: Re-enable DEBUG_TLBFLUSH for X86_32 CONFIG_INVLPG got removed in commit `094ab1db7c` ("x86, 386 removal: Remove CONFIG_INVLPG"). That commit left one instance of CONFIG_INVLPG untouched, effectively disabling DEBUG_TLBFLUSH for X86_32. Since all currently supported x86 CPUs should now be able to support that option, just drop the entire sub-dependency. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Link: http://lkml.kernel.org/r/1363262077.1335.71.camel@x61.thuisdomein Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 14:53:12 +02:00
Paul Bolle	781b0e870c	idle: Remove unused ARCH_HAS_DEFAULT_IDLE The Kconfig symbol ARCH_HAS_DEFAULT_IDLE is unused. Commit `a0bfa13738` ("cpuidle: stop depending on pm_idle") removed the only place were it was actually used. But it did not remove its Kconfig entries (for sh and x86). Remove those two entries now. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Cc: Len Brown <len.brown@intel.com> Cc: Paul Mundt <lethal@linux-sh.org> Link: http://lkml.kernel.org/r/1363869683.1390.134.camel@x61.thuisdomein Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 14:40:46 +02:00
Borislav Petkov	5952886bfe	x86/mm/cpa: Cleanup split_large_page() and its callee So basically we're generating the pte_t * from a struct page and we're handing it down to the __split_large_page() internal version which then goes and gets back struct page * from it because it needs it. Change the caller to hand down struct page * directly and the callee can compute the pte_t itself. Net save is one virt_to_page() call and simpler code. While at it, make __split_large_page() static. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1363886217-24703-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-10 14:39:08 +02:00
Jacob Shin	9c5320c8ea	cpufreq: AMD "frequency sensitivity feedback" powersave bias for ondemand governor Future AMD processors, starting with Family 16h, can provide software with feedback on how the workload may respond to frequency change -- memory-bound workloads will not benefit from higher frequency, where as compute-bound workloads will. This patch enables this "frequency sensitivity feedback" to aid the ondemand governor to make better frequency change decisions by hooking into the powersave bias. Signed-off-by: Jacob Shin <jacob.shin@amd.com> Acked-by: Thomas Renninger <trenn@suse.de> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2013-04-10 13:19:26 +02:00
Matt Fleming	a6e4d5a03e	x86, efivars: firmware bug workarounds should be in platform code Let's not burden ia64 with checks in the common efivars code that we're not writing too much data to the variable store. That kind of thing is an x86 firmware bug, plain and simple. efi_query_variable_store() provides platforms with a wrapper in which they can perform checks and workarounds for EFI variable storage bugs. Cc: H. Peter Anvin <hpa@zytor.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-04-09 11:34:05 +01:00
Ingo Molnar	b6d5278dc8	Clean up cmci_rediscover code to fix problems found by Dave Jones -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRXI4YAAoJEKurIx+X31iBY1UP/Rq7DkmmffFh2faMVPMdIC/T Z3xwbqmqNpqJOjBojocOkwpG44Da51h4dZNIdm6xtEFojqf9UdaTPobe00jOEotX MPcc4VsuTS2WLd3dCkRHCfHu2nW6z/F4ysEtIirrHVj5JVqqH9atZPkGBB3Wcsic 00QTV/JeOP1QjvGEMxnaW2rNIXSE19NNbiHE0tx9jNYZuecOhll/GH2AOW17+Kvn ocn4dkisbHRZ/YixuwkqXzDABj2+qiwjHZbP4f4jZbdDuuDR2KWeW0Vkoau1vRHg TU3FX2BCBy4foA8St/AQ7WRbmSA/+5obUWvSWevXXZu7A6CxAmkv4s46D3rRuIaB MLDqov0NobqM4vjJn4Y4bcbvYMIv8zJijok6sg2MNFIHUy9iBFTlJHOai9guSdiB kHKTfLsCUpdkiwHMq3rcFCAMZi/DSxjjPbae8IzP7+IJAG1ff1SP6QwM6UOZq9h1 wVYpvavIOwBhKIsjghkqF0Ov/8a+4lyXMksQ3nKleKYD5908hpDYMD/2i6MTCkpy 50k3OVvnpWbGdQmYwbTiGaOW/jQ2tC72v24o/Y9tKYc2Fea0BlWn+yzIGahCn6ok j7oQHAW1OJ2oc3h/RWtDWSejXUpYsyclCZhlFB+TQj1J/2d7VytFxo+g156xD1Rj VJ0Dseml7vA4Orr6jAMh =wipk -----END PGP SIGNATURE----- Merge tag 'please-pull-cmci_rediscover' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull clean up of the cmci_rediscover code to fix problems found by Dave Jones, from Tony Luck. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-08 17:41:50 +02:00
Thomas Gleixner	7d1a941731	x86: Use generic idle loop Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130321215235.486594473@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org	2013-04-08 17:39:29 +02:00
Thomas Gleixner	ee761f629d	arch: Consolidate tsk_is_polling() Move it to a common place. Preparatory patch for implementing set/clear for the idle need_resched poll implementation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130321215233.446034505@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2013-04-08 17:39:22 +02:00
Masami Hiramatsu	8101376dc5	kprobes/x86: Just return error for sanity check failure instead of using BUG_ON Return an error from __copy_instruction() and use printk() to give us a more productive message, since this is just an error case which we can handle and also the BUG_ON() never tells us why and what happened. This is related to the following bug-report: https://bugzilla.redhat.com/show_bug.cgi?id=910649 Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: yrl.pp-manager.tt@hitachi.com Link: http://lkml.kernel.org/r/20130404104230.22862.85242.stgit@mhiramat-M0-7522 Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-08 17:28:34 +02:00
Geoff Levand	8b415dcd76	KVM: Move kvm_rebooting declaration out of x86 The variable kvm_rebooting is a common kvm variable, so move its declaration from arch/x86/include/asm/kvm_host.h to include/asm/kvm_host.h. Fixes this sparse warning when building on arm64: virt/kvm/kvm_main.c⚠️ symbol 'kvm_rebooting' was not declared. Should it be static? Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 13:02:09 +03:00
Geoff Levand	e3ba45b804	KVM: Move kvm_spurious_fault to x86.c The routine kvm_spurious_fault() is an x86 specific routine, so move it from virt/kvm/kvm_main.c to arch/x86/kvm/x86.c. Fixes this sparse warning when building on arm64: virt/kvm/kvm_main.c⚠️ symbol 'kvm_spurious_fault' was not declared. Should it be static? Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 13:02:06 +03:00
Geoff Levand	fc1b74925f	KVM: Move vm_list kvm_lock declarations out of x86 The variables vm_list and kvm_lock are common to all architectures, so move the declarations from arch/x86/include/asm/kvm_host.h to include/linux/kvm_host.h. Fixes sparse warnings like these when building for arm64: virt/kvm/kvm_main.c: warning: symbol 'kvm_lock' was not declared. Should it be static? virt/kvm/kvm_main.c: warning: symbol 'vm_list' was not declared. Should it be static? Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 13:02:00 +03:00
Jan Kiszka	a63cb56061	KVM: VMX: Add missing braces to avoid redundant error check The code was already properly aligned, now also add the braces to avoid that err is checked even if alloc_apic_access_page didn't run and change it. Found via Coccinelle by Fengguang Wu. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 12:46:06 +03:00
Ingo Molnar	529801898b	Merge branch 'for-tip' of git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile into perf/core Pull IBM zEnterprise EC12 support patchlet from Robert Richter. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-04-08 11:43:30 +02:00
Yang Zhang	458f212e36	KVM: x86: fix memory leak in vmx_init Free vmx_msr_bitmap_longmode_x2apic and vmx_msr_bitmap_longmode if kvm_init() fails. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 10:56:08 +03:00
Linus Torvalds	875b7679ab	Merge git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fix from Gleb Natapov: "Bugfix for the regression introduced by commit c300aa64ddf5" * git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Allow cross page reads and writes from cached translations.	2013-04-07 13:01:25 -07:00
Jan Kiszka	b8c07d55d0	KVM: nVMX: Check exit control for VM_EXIT_SAVE_IA32_PAT, not entry controls Obviously a copy&paste mistake: prepare_vmcs12 has to check L1's exit controls for VM_EXIT_SAVE_IA32_PAT. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 14:06:42 +03:00
Yang Zhang	44944d4d28	KVM: Call kvm_apic_match_dest() to check destination vcpu For a given vcpu, kvm_apic_match_dest() will tell you whether the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap() and use kvm_apic_match_dest() instead. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 13:55:49 +03:00
Takuya Yoshikawa	450e0b411f	Revert "KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()" With the following commit, shadow pages can be zapped at random during a shadow page talbe walk: KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page() `7ddca7e43c` This patch reverts it and fixes __direct_map() and FNAME(fetch)(). Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 13:13:36 +03:00
Andrew Honig	8f964525a1	KVM: Allow cross page reads and writes from cached translations. This patch adds support for kvm_gfn_to_hva_cache_init functions for reads and writes that will cross a page. If the range falls within the same memslot, then this will be a fast operation. If the range is split between two memslots, then the slower kvm_read_guest and kvm_write_guest are used. Tested: Test against kvm_clock unit tests. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 13:05:35 +03:00
Jan Beulich	918708245e	x86: Fix rebuild with EFI_STUB enabled eboot.o and efi_stub_$(BITS).o didn't get added to "targets", and hence their .cmd files don't get included by the build machinery, leading to the files always getting rebuilt. Rather than adding the two files individually, take the opportunity and add $(VMLINUX_OBJS) to "targets" instead, thus allowing the assignment at the top of the file to be shrunk quite a bit. At the same time, remove a pointless flags override line - the variable assigned to was misspelled anyway, and the options added are meaningless for assembly sources. [ hpa: the patch is not minimal, but I am taking it for -urgent anyway since the excess impact of the patch seems to be small enough. ] Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/515C5D2502000078000CA6AD@nat28.tlf.novell.com Cc: Matthew Garrett <mjg@redhat.com> Cc: Matt Fleming <matt.fleming@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-05 13:59:23 -07:00
Thomas Gleixner	0ed2aef9b3	Merge branch 'fortglx/3.10/time' of git://git.linaro.org/people/jstultz/linux into timers/core	2013-04-03 12:27:29 +02:00
Tim Chen	d34a460092	crypto: sha256 - Optimized sha256 x86_64 routine using AVX2's RORX instructions Provides SHA256 x86_64 assembly routine optimized with SSE, AVX and AVX2's RORX instructions. Speedup of 70% or more has been measured over the generic implementation. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-04-03 09:06:32 +08:00
Tim Chen	ec2b4c851f	crypto: sha256 - Optimized sha256 x86_64 assembly routine with AVX instructions. Provides SHA256 x86_64 assembly routine optimized with SSE and AVX instructions. Speedup of 60% or more has been measured over the generic implementation. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-04-03 09:06:32 +08:00
Tim Chen	46d208a2bd	crypto: sha256 - Optimized sha256 x86_64 assembly routine using Supplemental SSE3 instructions. Provides SHA256 x86_64 assembly routine optimized with SSSE3 instructions. Speedup of 40% or more has been measured over the generic implementation. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-04-03 09:06:31 +08:00
Jussi Kivilinna	873b9cafa8	crypto: x86 - build AVX block cipher implementations only if assembler supports AVX instructions These modules require AVX support in assembler, so add new check to Makefile for this. Other option would be to use CONFIG_AS_AVX inside source files, but that would result dummy/empty/no-fuctionality modules being created. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-04-03 09:06:30 +08:00
Jussi Kivilinna	eca1726997	crypto: x86/crc32-pclmul - assembly clean-ups: use ENTRY/ENDPROC Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-04-03 09:06:29 +08:00
Borislav Petkov	73f460408c	x86, quirks: Shut-up a long-standing gcc warning So gcc nags about those since forever in randconfig builds. arch/x86/kernel/quirks.c: In function ‘ati_ixp4x0_rev’: arch/x86/kernel/quirks.c:361:4: warning: ‘b’ is used uninitialized in this function [-Wuninitialized] arch/x86/kernel/quirks.c: In function ‘ati_force_enable_hpet’: arch/x86/kernel/quirks.c:367:4: warning: ‘d’ may be used uninitialized in this function [-Wuninitialized] arch/x86/kernel/quirks.c:357:6: note: ‘d’ was declared here arch/x86/kernel/quirks.c:407:21: warning: ‘val’ may be used uninitialized in this function [-Wuninitialized] This function quirk is called on a SB400 chipset only anyway so the distant possibility of a PCI access failing becomes almost impossible there. Even if it did fail, then something else more serious is the problem. So zero-out the variables so that gcc shuts up but do a coarse check on the PCI accesses at the end and signal whether any of them had an error. They shouldn't but in case they do, we'll at least know and we can address it. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1362428180-8865-6-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-02 16:03:34 -07:00
Borislav Petkov	1423bed239	x86, msr: Unify variable names Make sure all MSR-accessing primitives which split MSR values in two 32-bit parts have their variables called 'low' and 'high' for consistence with the rest of the code and for ease of staring. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1362428180-8865-5-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-02 16:03:32 -07:00
Borislav Petkov	8e3c2a8cf6	x86: Drop KERNEL_IMAGE_START We have KERNEL_IMAGE_START and __START_KERNEL_map which both contain the start of the kernel text mapping's virtual address. Remove the prior one which has been replicated a lot less times around the tree. No functionality change. Signed-off-by: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/1362428180-8865-3-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-02 16:03:29 -07:00
Paul Moore	8b4b9f27e5	x86: remove the x32 syscall bitmask from syscall_get_nr() Commit `fca460f95e` simplified the x32 implementation by creating a syscall bitmask, equal to 0x40000000, that could be applied to x32 syscalls such that the masked syscall number would be the same as a x86_64 syscall. While that patch was a nice way to simplify the code, it went a bit too far by adding the mask to syscall_get_nr(); returning the masked syscall numbers can cause confusion with callers that expect syscall numbers matching the x32 ABI, e.g. unmasked syscall numbers. This patch fixes this by simply removing the mask from syscall_get_nr() while preserving the other changes from the original commit. While there are several syscall_get_nr() callers in the kernel, most simply check that the syscall number is greater than zero, in this case this patch will have no effect. Of those remaining callers, they appear to be few, seccomp and ftrace, and from my testing of seccomp without this patch the original commit definitely breaks things; the seccomp filter does not correctly filter the syscalls due to the difference in syscall numbers in the BPF filter and the value from syscall_get_nr(). Applying this patch restores the seccomp BPF filter functionality on x32. I've tested this patch with the seccomp BPF filters as well as ftrace and everything looks reasonable to me; needless to say general usage seemed fine as well. Signed-off-by: Paul Moore <pmoore@redhat.com> Link: http://lkml.kernel.org/r/20130215172143.12549.10292.stgit@localhost Cc: <stable@vger.kernel.org> Cc: Will Drewry <wad@chromium.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-02 14:38:09 -07:00
Srivatsa S. Bhat	7a0c819d28	x86/mce: Rework cmci_rediscover() to play well with CPU hotplug Dave Jones reports that offlining a CPU leads to this trace: numa_remove_cpu cpu 1 node 0: mask now 0,2-3 smpboot: CPU 1 is now offline BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591 caller is cmci_rediscover+0x6a/0xe0 Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2 Call Trace: [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100 [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0 [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150 [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10 [<ffffffff8104c2e3>] cpu_notify+0x23/0x50 [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20 [<ffffffff815ef082>] _cpu_down+0x302/0x350 [<ffffffff815ef106>] cpu_down+0x36/0x50 [<ffffffff815f1c9d>] store_online+0x8d/0xd0 [<ffffffff813edc48>] dev_attr_store+0x18/0x30 [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150 [<ffffffff811adfb2>] vfs_write+0xa2/0x170 [<ffffffff811ae16c>] sys_write+0x4c/0xa0 [<ffffffff81613019>] system_call_fastpath+0x16/0x1b However, a look at cmci_rediscover shows that it can be simplified quite a bit, apart from solving the above issue. It invokes functions that take spin locks with interrupts disabled, and hence it can run in atomic context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU is already dead and out of the cpu_online_mask. So take these points into account and simplify the code, and thereby also fix the above issue. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Tony Luck <tony.luck@intel.com>	2013-04-02 14:04:01 -07:00
Konrad Rzeszutek Wilk	b22227944b	xen/mmu: On early bootup, flush the TLB when changing RO->RW bits Xen provided pagetables. Occassionaly on a DL380 G4 the guest would crash quite early with this: (XEN) d244:v0: unhandled page fault (ec=0003) (XEN) Pagetable walk from ffffffff84dc7000: (XEN) L4[0x1ff] = 00000000c3f18067 0000000000001789 (XEN) L3[0x1fe] = 00000000c3f14067 000000000000178d (XEN) L2[0x026] = 00000000dc8b2067 0000000000004def (XEN) L1[0x1c7] = 00100000dc8da067 0000000000004dc7 (XEN) domain_crash_sync called from entry.S (XEN) Domain 244 (vcpu#0) crashed on cpu#3: (XEN) ----[ Xen-4.1.3OVM x86_64 debug=n Not tainted ]---- (XEN) CPU: 3 (XEN) RIP: e033:[<ffffffff81263f22>] (XEN) RFLAGS: 0000000000000216 EM: 1 CONTEXT: pv guest (XEN) rax: 0000000000000000 rbx: ffffffff81785f88 rcx: 000000000000003f (XEN) rdx: 0000000000000000 rsi: 00000000dc8da063 rdi: ffffffff84dc7000 The offending code shows it to be a loop writting the value zero (%rax) in the %rdi (the L4 provided by Xen) register: 0: 44 00 00 add %r8b,(%rax) 3: 31 c0 xor %eax,%eax 5: b9 40 00 00 00 mov $0x40,%ecx a: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 11: 00 00 13: ff c9 dec %ecx 15:* 48 89 07 mov %rax,(%rdi) <-- trapping instruction 18: 48 89 47 08 mov %rax,0x8(%rdi) 1c: 48 89 47 10 mov %rax,0x10(%rdi) which fails. xen_setup_kernel_pagetable recycles some of the Xen's page-table entries when it has switched over to its Linux page-tables. Right before try to clear the page, we make a hypercall to change it from _RO to _RW and that works (otherwise we would hit an BUG()). And the _RW flag is set for that page: (XEN) L1[0x1c7] = 001000004885f067 0000000000004dc7 The error code is 3, so PFEC_page_present and PFEC_write_access, so page is present (correct), and we tried to write to the page, but a violation occurred. The one theory is that the the page entries in hardware (which are cached) are not up to date with what we just set. Especially as we have just done an CR3 write and flushed the multicalls. This patch does solve the problem by flusing out the TLB page entry after changing it from _RO to _RW and we don't hit this issue anymore. Fixed-Oracle-Bug: 16243091 [ON OCCASIONS VM START GOES INTO 'CRASH' STATE: CLEAR_PAGE+0X12 ON HP DL380 G4] Reported-and-Tested-by: Saar Maoz <Saar.Maoz@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-04-02 14:02:23 -04:00
Borislav Petkov	7d7dc116e5	x86, cpu: Convert AMD Erratum 400 Convert AMD erratum 400 to the bug infrastructure. Then, retract all exports for modules since they're not needed now and make the AMD erratum checking machinery local to amd.c. Use forward declarations to avoid shuffling too much code around needlessly. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-7-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:55 -07:00
Borislav Petkov	e6ee94d58d	x86, cpu: Convert AMD Erratum 383 Convert the AMD erratum 383 testing code to the bug infrastructure. This allows keeping the AMD-specific erratum testing machinery private to amd.c and not export symbols to modules needlessly. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-6-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:54 -07:00
Borislav Petkov	c5b41a6750	x86, cpu: Convert Cyrix coma bug detection ... to the new facility. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-5-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:54 -07:00
Borislav Petkov	93a829e8e2	x86, cpu: Convert FDIV bug detection ... to the new facility. Add a reference to the wikipedia article explaining the FDIV test we're doing here. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-4-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:53 -07:00
Borislav Petkov	e2604b49e8	x86, cpu: Convert F00F bug detection ... to using the new facility and drop the cpuinfo_x86 member. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-3-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:52 -07:00
Borislav Petkov	65fc985b37	x86, cpu: Expand cpufeature facility to include cpu bugs We add another 32-bit vector at the end of the ->x86_capability bitvector which collects bugs present in CPUs. After all, a CPU bug is a kind of a capability, albeit a strange one. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-2-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:52 -07:00
Paolo Bonzini	afd80d85ae	pmu: prepare for migration support In order to migrate the PMU state correctly, we need to restore the values of MSR_CORE_PERF_GLOBAL_STATUS (a read-only register) and MSR_CORE_PERF_GLOBAL_OVF_CTRL (which has side effects when written). We also need to write the full 40-bit value of the performance counter, which would only be possible with a v3 architectural PMU's full-width counter MSRs. To distinguish host-initiated writes from the guest's, pass the full struct msr_data to kvm_pmu_set_msr. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-02 17:42:44 +03:00
David S. Miller	a210576cf8	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/mac80211/sta_info.c net/wireless/core.h Two minor conflicts in wireless. Overlapping additions of extern declarations in net/wireless/core.h and a bug fix overlapping with the addition of a boolean parameter to __ieee80211_key_free(). Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-01 13:36:50 -04:00
Stephane Eranian	9ad64c0f48	perf/x86: Add support for PEBS Precise Store This patch adds support for PEBS Precise Store which is available on Intel Sandy Bridge and Ivy Bridge processors. To use Precise store, the proper PEBS event must be used: mem_trans_retired:precise_stores. For the perf tool, the generic mem-stores event exported via sysfs can be used directly. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1359040242-8269-11-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2013-04-01 12:17:06 -03:00
Stephane Eranian	a63fcab452	perf/x86: Export PEBS load latency threshold register to sysfs Make the PEBS Load Latency threshold register layout and encoding visible to user level tools. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1359040242-8269-10-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2013-04-01 12:16:49 -03:00
Stephane Eranian	f20093eef5	perf/x86: Add memory profiling via PEBS Load Latency This patch adds support for memory profiling using the PEBS Load Latency facility. Load accesses are sampled by HW and the instruction address, data address, load latency, data source, tlb, locked information can be saved in the sampling buffer if using the PERF_SAMPLE_COST (for latency), PERF_SAMPLE_ADDR, PERF_SAMPLE_DATA_SRC types. To enable PEBS Load Latency, users have to use the model specific event: - on NHM/WSM: MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD - on SNB/IVB: MEM_TRANS_RETIRED:LATENCY_ABOVE_THRESHOLD To make things easier, this patch also exports a generic alias via sysfs: mem-loads. It export the right event encoding based on the host CPU and can be used directly by the perf tool. Loosely based on Intel's Lin Ming patch posted on LKML in July 2011. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1359040242-8269-9-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2013-04-01 12:16:31 -03:00
Stephane Eranian	9fac2cf316	perf/x86: Add flags to event constraints This patch adds a flags field to each event constraint. It can be used to store event specific features which can then later be used by scheduling code or low-level x86 code. The flags are propagated into event->hw.flags during the get_event_constraint() call. They are cleared during the put_event_constraint() call. This mechanism is going to be used by the PEBS-LL patches. It avoids defining yet another table to hold event specific information. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1359040242-8269-4-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2013-04-01 12:15:04 -03:00
Linus Torvalds	dfca53fb16	ACPI and power management fixes for 3.9-rc5 - Fix for a recent cpufreq regression related to acpi-cpufreq and suspend/resume from Viresh Kumar. - cpufreq stats reference counting fix from Viresh Kumar. - intel_pstate driver fixes from Dirk Brandewie and Konrad Rzeszutek Wilk. - New ACPI suspend blacklist entry for Sony Vaio VGN-FW21M from Fabio Valentini. - ACPI Platform Error Interface (APEI) fix from Chen Gong. - PCI root bridge hotplug locking fix from Yinghai Lu. / -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIcBAABAgAGBQJRVETOAAoJEKhOf7ml8uNs30kP/3GsKWacHsaIPdhIiHQC3f91 HMLabrW7NE7ldrOoXzj1lTHsIc1TQHm722vyI+aF061HErfkF8Jkdi5rkIai8VMq IJXe4CtwuuCi0SeKQsV9ymiQanTrgsP/AlGV5x/KM/As8dvAVW/1+Ln/gXAnH0IJ /Onqf3eA4NBw/1Hjg7AGHGeCmOlDHvcetHF7eX4MaiYZHEwuy/a7jswH4aNOjwgx GZtbrnwUO6OtDKv6ie//1EbP753VrkHDtK3jzIy2lUA5YyLmr0XOTvy4uQh2n/r7 tVTqsVoNZNA4En0YUspfsWwBruUic3ra9qVTrJqn7Fzymyr+TgyCQQzSUGrOGy2a wY0vwMAwm1dMwAsZWPhnui6aqvu0bbg0u7sxCZQs8WapdtjxPdD7iIhRk2YU4wOZ omtejW0thUIwEmHWgBPo9rFvfZmxy9hb044UfhkLI9xBmuTVrDb/HqeVPA767ZoO k7IVg1DG4Ye6xboCIILfluoUAsc3DvkHpCIvWVujK3pF5j/M9ptt3d8eXDFIzmWD J6tm9ARkQoUPRAs6751cG1N0nP++ZlErYseU/h6eXoC0rkeC/WbGyxIumii4xJhg Gs6GGeM8OgQ/7Fat68kA2Z7jriY+MTteLbq1Sl3PBlfdURaceOXkTIVrxXo33Itq jQiEKa1CbJDi6OBKog8K =0bjZ -----END PGP SIGNATURE----- Merge tag 'pm+acpi-3.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael J Wysocki: - Fix for a recent cpufreq regression related to acpi-cpufreq and suspend/resume from Viresh Kumar. - cpufreq stats reference counting fix from Viresh Kumar. - intel_pstate driver fixes from Dirk Brandewie and Konrad Rzeszutek Wilk. - New ACPI suspend blacklist entry for Sony Vaio VGN-FW21M from Fabio Valentini. - ACPI Platform Error Interface (APEI) fix from Chen Gong. - PCI root bridge hotplug locking fix from Yinghai Lu. * tag 'pm+acpi-3.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PCI / ACPI: hold acpi_scan_lock during root bus hotplug ACPI / APEI: fix error status check condition for CPER ACPI / PM: fix suspend and resume on Sony Vaio VGN-FW21M cpufreq: acpi-cpufreq: Don't set policy->related_cpus from .init() cpufreq: stats: do cpufreq_cpu_put() corresponding to cpufreq_cpu_get() intel-pstate: Use #defines instead of hard-coded values. cpufreq / intel_pstate: Fix calculation of current frequency cpufreq / intel_pstate: Add function to check that all MSRs are valid	2013-03-28 13:47:31 -07:00
Linus Torvalds	33b65f1e9c	Bug-fixes: - Regression fixes for C-and-P states not being parsed properly. - Fix possible security issue with guests triggering DoS via non-assigned MSI-Xs. - Fix regression (introduced in v3.7) with raising an event (v2). - Fix hastily introduced band-aid during c0 for the CR3 blowup. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQEcBAABAgAGBQJRUxlVAAoJEFjIrFwIi8fJiUsH/2a3A8EVqS7OYDNgT0ZFb1VI rMLNiA50sRJNDsq0NbGl1Y+Lubus1czc0c7HXFQ557OakN6WqcmPPjCKp4JT6NnV Jz/IZ0iimdoHiPru1Qe4ah3fSgzUtht2LB48Z/a0Is4k3LsRP2W3/niVC3ypnyuJ 52HjjuxeFAfXIkNeqsrO2a6cUXZeXzUyR4g9GNxDozi4jHpoPQ4j9okZbo218xH+ /pRnFeMD7t7dFkgNeyeGXUiJn2AkNPHi3Hx+RH5nN9KXQ1eem9R4p7Qpez1dUEWF YEc/bs7MyOYezzTVHPYk77Yt8baOHJt7UbHjM6jfi1aGYYINTRr3m5mORd3rCmc= =61IX -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen bug-fixes from Konrad Rzeszutek Wilk: "This is mostly just the last stragglers of the regression bugs that this merge window had. There are also two bug-fixes: one that adds an extra layer of security, and a regression fix for a change that was added in v3.7 (the v1 was faulty, the v2 works). - Regression fixes for C-and-P states not being parsed properly. - Fix possible security issue with guests triggering DoS via non-assigned MSI-Xs. - Fix regression (introduced in v3.7) with raising an event (v2). - Fix hastily introduced band-aid during c0 for the CR3 blowup." * tag 'stable/for-linus-3.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/events: avoid race with raising an event in unmask_evtchn() xen/mmu: Move the setting of pvops.write_cr3 to later phase in bootup. xen/acpi-stub: Disable it b/c the acpi_processor_add is no longer called. xen-pciback: notify hypervisor about devices intended to be assigned to guests xen/acpi-processor: Don't dereference struct acpi_processor on all CPUs.	2013-03-27 12:56:25 -07:00
David S. Miller	e2a553dbf1	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: include/net/ipip.h The changes made to ipip.h in 'net' were already included in 'net-next' before that header was moved to another location. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 13:52:49 -04:00
Konrad Rzeszutek Wilk	d3eb2c89e7	xen/mmu: Move the setting of pvops.write_cr3 to later phase in bootup. We move the setting of write_cr3 from the early bootup variant (see git commit `0cc9129d75` "x86-64, xen, mmu: Provide an early version of write_cr3.") to a more appropiate location. This new location sets all of the other non-early variants of pvops calls - and most importantly is before the alternative_asm mechanism kicks in. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-27 12:06:03 -04:00
Stephane Eranian	3a54aaa0a3	perf/x86: Improve sysfs event mapping with event string This patch extends Jiri's changes to make generic events mapping visible via sysfs. The patch extends the mechanism to non-generic events by allowing the mappings to be hardcoded in strings. This mechanism will be used by the PEBS-LL patch later on. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1359040242-8269-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> [ fixed up conflict with `2663960` "perf: Make EVENT_ATTR global" ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2013-03-26 17:36:45 -03:00
Andi Kleen	1a6461b128	perf/x86: Support CPU specific sysfs events Add a way for the CPU initialization code to register additional events, and merge them into the events attribute directory. Used in the next patch. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: acme@redhat.com Cc: jolsa@redhat.com Cc: namhyung.kim@lge.com Link: http://lkml.kernel.org/r/1359040242-8269-2-git-send-email-eranian@google.com [ small cleanups ] Signed-off-by: Ingo Molnar <mingo@kernel.org> [ merge_attr returns a *, not just ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2013-03-26 16:50:23 -03:00
Konrad Rzeszutek Wilk	05e99c8cf9	intel-pstate: Use #defines instead of hard-coded values. They are defined in coreboot (MSR_PLATFORM) and the other one is already defined in msr-index.h. Let's use those. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Dirk Brandewie <dirk.j.brandewie@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2013-03-25 15:13:07 +01:00
Linus Torvalds	33b73e9b3e	Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "A collection of minor fixes, more EFI variables paranoia (anti-bricking) plus the ability to disable the pstore either as a runtime default or completely, due to bricking concerns." * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efivars: Fix check for CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE x86, microcode_intel_early: Mark apply_microcode_early() as cpuinit efivars: Handle duplicate names from get_next_variable() efivars: explicitly calculate length of VariableName efivars: Add module parameter to disable use as a pstore backend efivars: Allow disabling use as a pstore backend x86-32, microcode_intel_early: Fix crash with CONFIG_DEBUG_VIRTUAL x86-64: Fix the failure case in copy_user_handle_tail()	2013-03-24 10:10:34 -07:00
Jan Beulich	909b3fdb0d	xen-pciback: notify hypervisor about devices intended to be assigned to guests For MSI-X capable devices the hypervisor wants to write protect the MSI-X table and PBA, yet it can't assume that resources have been assigned to their final values at device enumeration time. Thus have pciback do that notification, as having the device controlled by it is a prerequisite to assigning the device to guests anyway. This is the kernel part of hypervisor side commit 4245d33 ("x86/MSI: add mechanism to fully protect MSI-X table from PV guest accesses") on the master branch of git://xenbits.xen.org/xen.git. CC: stable@vger.kernel.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-22 10:20:55 -04:00
Boris Ostrovsky	bafcdd3b6c	x86, MCE, AMD: Use MCG_CAP MSR to find out number of banks on AMD Currently number of error reporting register banks is hardcoded to 6 on AMD processors. This may break in virtualized scenarios when a hypervisor prefers to report fewer banks than what the physical HW provides. Since number of supported banks is reported in MSR_IA32_MCG_CAP[7:0] that's what we should use. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1363295441-1859-3-git-send-email-boris.ostrovsky@oracle.com [ reverse NULL ptr test logic ] Signed-off-by: Borislav Petkov <bp@suse.de>	2013-03-22 11:25:01 +01:00
Boris Ostrovsky	c76e81643c	x86, MCE, AMD: Replace shared_bank array with is_shared_bank() helper Use helper function instead of an array to report whether register bank is shared. Currently only bank 4 (northbridge) is shared. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: http://lkml.kernel.org/r/1363295441-1859-2-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Borislav Petkov <bp@suse.de>	2013-03-22 11:25:01 +01:00
H. Peter Anvin	f564c24103	x86, microcode_intel_early: Mark apply_microcode_early() as cpuinit Add missing __cpuinit annotation to apply_microcode_early(). Reported-by: Shaun Ruffell <sruffell@digium.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/20130320170310.GA23362@digium.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-03-21 17:32:36 -07:00
Takuya Yoshikawa	81f4f76bbc	KVM: MMU: Rename kvm_mmu_free_some_pages() to make_mmu_pages_available() The current name "kvm_mmu_free_some_pages" should be used for something that actually frees some shadow pages, as we expect from the name, but what the function is doing is to make some, KVM_MIN_FREE_MMU_PAGES, shadow pages available: it does nothing when there are enough. This patch changes the name to reflect this meaning better; while doing this renaming, the code in the wrapper function is inlined into the main body since the whole function will be inlined into the only caller now. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-21 19:45:01 -03:00
Takuya Yoshikawa	7ddca7e43c	KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page() What this function is doing is to ensure that the number of shadow pages does not exceed the maximum limit stored in n_max_mmu_pages: so this is placed at every code path that can reach kvm_mmu_alloc_page(). Although it might have some sense to spread this function in each such code path when it could be called before taking mmu_lock, the rule was changed not to do so. Taking this background into account, this patch moves it into kvm_mmu_alloc_page() and simplifies the code. Note: the unlikely hint in kvm_mmu_free_some_pages() guarantees that the overhead of this function is almost zero except when we actually need to allocate some shadow pages, so we do not need to care about calling it multiple times in one path by doing kvm_mmu_get_page() a few times. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-21 19:44:56 -03:00
Daniel Borkmann	79617801ea	filter: bpf_jit_comp: refactor and unify BPF JIT image dump output If bpf_jit_enable > 1, then we dump the emitted JIT compiled image after creation. Currently, only SPARC and PowerPC has similar output as in the reference implementation on x86_64. Make a small helper function in order to reduce duplicated code and make the dump output uniform across architectures x86_64, SPARC, PPC, ARM (e.g. on ARM flen, pass and proglen are currently not shown, but would be interesting to know as well), also for future BPF JIT implementations on other archs. Cc: Mircea Gherzan <mgherzan@gmail.com> Cc: Matt Evans <matt@ozlabs.org> Cc: Eric Dumazet <eric.dumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-21 17:25:56 -04:00
Linus Torvalds	cd82346934	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "A fair chunk of the linecount comes from a fix for a tracing bug that corrupts latency tracing buffers when the overwrite mode is changed on the fly - the rest is mostly assorted fewliner fixlets." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add SNB/SNB-EP scheduling constraints for cycle_activity event kprobes/x86: Check Interrupt Flag modifier when registering probe kprobes: Make hash_64() as always inlined perf: Generate EXIT event only once per task context perf: Reset hwc->last_period on sw clock events tracing: Prevent buffer overwrite disabled for latency tracers tracing: Keep overwrite in sync between regular and snapshot buffers tracing: Protect tracer flags with trace_types_lock perf tools: Fix LIBNUMA build with glibc 2.12 and older. tracing: Fix free of probe entry by calling call_rcu_sched() perf/POWER7: Create a sysfs format entry for Power7 events perf probe: Fix segfault libtraceevent: Remove hard coded include to /usr/local/include in Makefile perf record: Fix -C option perf tools: check if -DFORTIFY_SOURCE=2 is allowed perf report: Fix build with NO_NEWT=1 perf annotate: Fix build with NO_NEWT=1 tracing: Fix race in snapshot swapping	2013-03-21 08:29:11 -07:00
Marcelo Tosatti	2ae33b3896	Merge remote-tracking branch 'upstream/master' into queue Merge reason: From: Alexander Graf <agraf@suse.de> "Just recently this really important patch got pulled into Linus' tree for 3.9: commit `1674400aae` Author: Anton Blanchard <anton <at> samba.org> Date: Tue Mar 12 01:51:51 2013 +0000 Without that commit, I can not boot my G5, thus I can't run automated tests on it against my queue. Could you please merge kvm/next against linus/master, so that I can base my trees against that?" * upstream/master: (653 commits) PCI: Use ROM images from firmware only if no other ROM source available sparc: remove unused "config BITS" sparc: delete "if !ULTRA_HAS_POPULATION_COUNT" KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798) KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797) KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796) arm64: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS arm64: Do not select GENERIC_HARDIRQS_NO_DEPRECATED inet: limit length of fragment queue hash table bucket lists qeth: Fix scatter-gather regression qeth: Fix invalid router settings handling qeth: delay feature trace sgy-cts1000: Remove __dev* attributes KVM: x86: fix deadlock in clock-in-progress request handling KVM: allow host header to be included even for !CONFIG_KVM hwmon: (lm75) Fix tcn75 prefix hwmon: (lm75.h) Update header inclusion MAINTAINERS: Remove Mark M. Hoffman xfs: ensure we capture IO errors correctly xfs: fix xfs_iomap_eof_prealloc_initial_size type ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-21 11:11:52 -03:00
Stephane Eranian	0e48026ae7	perf/x86: Fix uninitialized pt_regs in intel_pmu_drain_bts_buffer() This patch fixes an uninitialized pt_regs struct in drain BTS function. The pt_regs struct is propagated all the way to the code_get_segment() function from perf_instruction_pointer() and may get garbage. We cannot simply inherit the actual pt_regs from the interrupt because BTS must be flushed on context-switch or when the associated event is disabled. And there we do not have a pt_regs handy. Setting pt_regs to all zeroes may not be the best option but it is not clear what else to do given where the drain_bts_buffer() is called from. In V2, we move the memset() later in the code to avoid doing it when we end up returning early without doing the actual BTS processing. Also dropped the reg.val initialization because it is redundant with the memset() as suggested by PeterZ. Signed-off-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: peterz@infradead.org Cc: sqazi@google.com Cc: ak@linux.intel.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/20130319151038.GA25439@quad Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-21 12:03:29 +01:00
Paolo Bonzini	04b66839d3	KVM: x86: correctly initialize the CS base on reset The CS base was initialized to 0 on VMX (wrong, but usually overridden by userspace before starting) or 0xf0000 on SVM. The correct value is 0xffff0000, and VMX is able to emulate it now, so use it. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-20 17:34:55 -03:00
Fenghua Yu	c83a9d5e42	x86-32, microcode_intel_early: Fix crash with CONFIG_DEBUG_VIRTUAL In 32-bit, __pa_symbol() in CONFIG_DEBUG_VIRTUAL accesses kernel data (e.g. max_low_pfn) that not only hasn't been setup yet in such early boot phase, but since we are in linear mode, cannot even be detected as uninitialized. Thus, use __pa_nodebug() rather than __pa_symbol() to get a global symbol's physical address. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Link: http://lkml.kernel.org/r/1363705484-27645-1-git-send-email-fenghua.yu@intel.com Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-03-19 19:51:08 -07:00
Linus Torvalds	ea4a0ce111	Merge git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Marcelo Tosatti. * git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798) KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797) KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796) KVM: x86: fix deadlock in clock-in-progress request handling KVM: allow host header to be included even for !CONFIG_KVM	2013-03-19 18:24:12 -07:00
Andy Honig	0b79459b48	KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797) There is a potential use after free issue with the handling of MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable memory such as frame buffers then KVM might continue to write to that address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins the page in memory so it's unlikely to cause an issue, but if the user space component re-purposes the memory previously used for the guest, then the guest will be able to corrupt that memory. Tested: Tested against kvmclock unit test Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-19 14:17:35 -03:00
Andy Honig	c300aa64dd	KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796) If the guest sets the GPA of the time_page so that the request to update the time straddles a page then KVM will write onto an incorrect page. The write is done byusing kmap atomic to get a pointer to the page for the time structure and then performing a memcpy to that page starting at an offset that the guest controls. Well behaved guests always provide a 32-byte aligned address, however a malicious guest could use this to corrupt host kernel memory. Tested: Tested against kvmclock unit test. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-19 14:17:31 -03:00
Marcelo Tosatti	c09664bb44	KVM: x86: fix deadlock in clock-in-progress request handling There is a deadlock in pvclock handling: cpu0: cpu1: kvm_gen_update_masterclock() kvm_guest_time_update() spin_lock(pvclock_gtod_sync_lock) local_irq_save(flags) spin_lock(pvclock_gtod_sync_lock) kvm_make_mclock_inprogress_request(kvm) make_all_cpus_request() smp_call_function_many() Now if smp_call_function_many() called by cpu0 tries to call function on cpu1 there will be a deadlock. Fix by moving pvclock_gtod_sync_lock protected section outside irq disabled section. Analyzed by Gleb Natapov <gleb@redhat.com> Acked-by: Gleb Natapov <gleb@redhat.com> Reported-and-Tested-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-18 18:03:39 -03:00
CQ Tang	66db3feb48	x86-64: Fix the failure case in copy_user_handle_tail() The increment of "to" in copy_user_handle_tail() will have incremented before a failure has been noted. This causes us to skip a byte in the failure case. Only do the increment when assured there is no failure. Signed-off-by: CQ Tang <cq.tang@intel.com> Link: http://lkml.kernel.org/r/20130318150221.8439.993.stgit@phlsvslse11.ph.intel.com Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: <stable@vger.kernel.org>	2013-03-18 11:32:03 -07:00
Jan Kiszka	4918c6ca68	KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU Very old user space (namely qemu-kvm before kvm-49) didn't set the TSS base before running the VCPU. We always warned about this bug, but no reports about users actually seeing this are known. Time to finally remove the workaround that effectively prevented to call vmx_vcpu_reset while already holding the KVM srcu lock. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-18 13:48:15 -03:00
Stephane Eranian	fd4a5aef00	perf/x86: Add SNB/SNB-EP scheduling constraints for cycle_activity event Add scheduling constraints for SNB/SNB-EP CYCLE_ACTIVITY event as defined by SDM Jan 2013 edition. The STALLS umasks are combinations with the NO_DISPATCH umask. Signed-off-by: Stephane Eranian <eranian@gmail.com> Cc: peterz@infradead.org Cc: ak@linux.intel.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/20130317134957.GA8550@quad Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-18 10:23:13 +01:00
Masami Hiramatsu	9a556ab998	kprobes/x86: Check Interrupt Flag modifier when registering probe Currently kprobes check whether the copied instruction modifies IF (interrupt flag) on each probe hit. This results not only in introducing overhead but also involving inat_get_opcode_attribute into the kprobes hot path, and it can cause an infinite recursive call (and kernel panic in the end). Actually, since the copied instruction itself can never be modified on the buffer, it is needless to analyze the instruction on every probe hit. To fix this issue, we check it only once when registering probe and store the result on ainsn->if_modifier. Reported-by: Timo Juhani Lindfors <timo.lindfors@iki.fi> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20130314115242.19690.33573.stgit@mhiramat-M0-7522 Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-18 10:21:23 +01:00
Linus Torvalds	2a6e06b2ae	perf,x86: fix wrmsr_on_cpu() warning on suspend/resume Commit `1d9d8639c0` ("perf,x86: fix kernel crash with PEBS/BTS after suspend/resume") fixed a crash when doing PEBS performance profiling after resuming, but in using init_debug_store_on_cpu() to restore the DS_AREA mtrr it also resulted in a new WARN_ON() triggering. init_debug_store_on_cpu() uses "wrmsr_on_cpu()", which in turn uses CPU cross-calls to do the MSR update. Which is not really valid at the early resume stage, and the warning is quite reasonable. Now, it all happens to _work_, for the simple reason that smp_call_function_single() ends up just doing the call directly on the CPU when the CPU number matches, but we really should just do the wrmsr() directly instead. This duplicates the wrmsr() logic, but hopefully we can just remove the wrmsr_on_cpu() version eventually. Reported-and-tested-by: Parag Warudkar <parag.lkml@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-03-17 15:44:43 -07:00
Feng Tang	82f9c080b2	x86: tsc: Add support for new S3_NONSTOP feature Add support for new S3_NONSTOP feature Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: John Stultz <john.stultz@linaro.org>	2013-03-15 16:51:18 -07:00
Feng Tang	c54fdbb282	x86: Add cpu capability flag X86_FEATURE_NONSTOP_TSC_S3 On some new Intel Atom processors (Penwell and Cloverview), there is a feature that the TSC won't stop in S3 state, say the TSC value won't be reset to 0 after resume. This feature makes TSC a more reliable clocksource and could benefit the timekeeping code during system suspend/resume cycle, so add a flag for it. Signed-off-by: Feng Tang <feng.tang@intel.com> [jstultz: Fix checkpatch warning] Signed-off-by: John Stultz <john.stultz@linaro.org>	2013-03-15 16:50:26 -07:00
Prarit Bhargava	3195ef59cb	x86: Do full rtc synchronization with ntp Every 11 minutes ntp attempts to update the x86 rtc with the current system time. Currently, the x86 code only updates the rtc if the system time is within +/-15 minutes of the current value of the rtc. This was done originally to avoid setting the RTC if the RTC was in localtime mode (common with Windows dualbooting). Other architectures do a full synchronization and now that we have better infrastructure to detect when the RTC is in localtime, there is no reason that x86 should be software limited to a 30 minute window. This patch changes the behavior of the kernel to do a full synchronization (year, month, day, hour, minute, and second) of the rtc when ntp requests a synchronization between the system time and the rtc. I've used the RTC library functions in this patchset as they do all the required bounds checking. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: x86@kernel.org Cc: Matt Fleming <matt.fleming@intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: linux-efi@vger.kernel.org Signed-off-by: Prarit Bhargava <prarit@redhat.com> [jstultz: Tweak commit message, fold in build fix found by fengguang Also add select RTC_LIB to X86, per new dependency, as found by prarit] Signed-off-by: John Stultz <john.stultz@linaro.org>	2013-03-15 16:50:26 -07:00
Stephane Eranian	1d9d8639c0	perf,x86: fix kernel crash with PEBS/BTS after suspend/resume This patch fixes a kernel crash when using precise sampling (PEBS) after a suspend/resume. Turns out the CPU notifier code is not invoked on CPU0 (BP). Therefore, the DS_AREA (used by PEBS) is not restored properly by the kernel and keeps it power-on/resume value of 0 causing any PEBS measurement to crash when running on CPU0. The workaround is to add a hook in the actual resume code to restore the DS Area MSR value. It is invoked for all CPUS. So for all but CPU0, the DS_AREA will be restored twice but this is harmless. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-03-15 09:26:35 -07:00
Takuya Yoshikawa	982b3394dd	KVM: x86: Optimize mmio spte zapping when creating/moving memslot When we create or move a memory slot, we need to zap mmio sptes. Currently, zap_all() is used for this and this is causing two problems: - extra page faults after zapping mmu pages - long mmu_lock hold time during zapping mmu pages For the latter, Marcelo reported a disastrous mmu_lock hold time during hot-plug, which made the guest unresponsive for a long time. This patch takes a simple way to fix these problems: do not zap mmu pages unless they are marked mmio cached. On our test box, this took only 50us for the 4GB guest and we did not see ms of mmu_lock hold time any more. Note that we still need to do zap_all() for other cases. So another work is also needed: Xiao's work may be the one. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:21:21 +02:00
Takuya Yoshikawa	95b0430d1a	KVM: MMU: Mark sp mmio cached when creating mmio spte This will be used not to zap unrelated mmu pages when creating/moving a memory slot later. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:21:10 +02:00
Jan Kiszka	0238ea913c	KVM: nVMX: Add preemption timer support Provided the host has this feature, it's straightforward to offer it to the guest as well. We just need to load to timer value on L2 entry if the feature was enabled by L1 and watch out for the corresponding exit reason. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:01:21 +02:00
Jan Kiszka	c18911a23c	KVM: nVMX: Provide EFER.LMA saving support We will need EFER.LMA saving to provide unrestricted guest mode. All what is missing for this is picking up EFER.LMA from VM_ENTRY_CONTROLS on L2->L1 switches. If the host does not support EFER.LMA saving, no change is performed, otherwise we properly emulate for L1 what the hardware does for L0. Advertise the support, depending on the host feature. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:00:55 +02:00
Jan Kiszka	eabeaaccfc	KVM: nVMX: Clean up and fix pin-based execution controls Only interrupt and NMI exiting are mandatory for KVM to work, thus can be exposed to the guest unconditionally, virtual NMI exiting is optional. So we must not advertise it unless the host supports it. Introduce the symbolic constant PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR at this chance. Reviewed-by:: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-13 16:14:40 +02:00
Jan Kiszka	66450a21f9	KVM: x86: Rework INIT and SIPI handling A VCPU sending INIT or SIPI to some other VCPU races for setting the remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED was overwritten by kvm_emulate_halt and, thus, got lost. This introduces APIC events for those two signals, keeping them in kvm_apic until kvm_apic_accept_events is run over the target vcpu context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there are pending events, thus if vcpu blocking should end. The patch comes with the side effect of effectively obsoleting KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI. The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED state. That also means we no longer exit to user space after receiving a SIPI event. Furthermore, we already reset the VCPU on INIT, only fixing up the code segment later on when SIPI arrives. Moreover, we fix INIT handling for the BSP: it never enter wait-for-SIPI but directly starts over on INIT. Tested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-13 16:08:10 +02:00
Thomas Gleixner	35b61edb41	x86: Use tick broadcast expired check Avoid going back into deep idle if the tick broadcast IPI is about to fire. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: Arjan van de Veen <arjan@infradead.org> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20130306111537.702278273@linutronix.de	2013-03-13 11:39:40 +01:00
Marcelo Tosatti	5d21881432	KVM: MMU: make kvm_mmu_available_pages robust against n_used_mmu_pages > n_max_mmu_pages As noticed by Ulrich Obergfell <uobergfe@redhat.com>, the mmu counters are for beancounting purposes only - so n_used_mmu_pages and n_max_mmu_pages could be relaxed (example: before `f0f5933a16`), resulting in n_used_mmu_pages > n_max_mmu_pages. Make code robust against n_used_mmu_pages > n_max_mmu_pages. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-13 11:46:09 +02:00
Stephen Rothwell	4febd95a8a	Select VIRT_TO_BUS directly where needed In commit `887cbce0ad` ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS") I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where needed. I am not sure what I was thinking. Instead, just directly select VIRT_TO_BUS where it is needed. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-03-12 11:16:40 -07:00
Jan Kiszka	57f252f229	KVM: x86: Drop unused return code from VCPU reset callback Neither vmx nor svm nor the common part may generate an error on kvm_vcpu_reset. So drop the return code. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-12 13:25:56 +02:00
Marcelo Tosatti	03ba32cae6	VMX: x86: handle host TSC calibration failure If the host TSC calibration fails, tsc_khz is zero (see tsc_init.c). Handle such case properly in KVM (instead of dividing by zero). https://bugzilla.redhat.com/show_bug.cgi?id=859282 Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-12 13:18:00 +02:00
Zhang Yanfei	ad0304cfd9	x86/platform/intel/mrst: Remove cast for kmalloc() return value Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/513EB5DA.2010300@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-12 09:10:20 +01:00
Jan Beulich	c391c78846	x86: Constify a few items This in particular re-does the compiler warning fix `9faec5b` ("perf/x86: Fix P6 driver section warning"), tightening the section attributes rather than relaxing them. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Shaun Ruffell <sruffell@digium.com> Cc: yangyongqiang <yangyongqiang01@baidu.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/513DB84502000078000C4880@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-11 15:11:03 +01:00
Jan Beulich	ec7fd34425	x86: Drop always empty .text..page_aligned section Commit `e44b7b7` ("x86: move suspend wakeup code to C") didn't care to also eliminate the side effects that the earlier `4c49156` ("x86: make arch/x86/kernel/acpi/wakeup_32.S use a separate") had, thus leaving a now pointless, almost page size gap at the beginning of .text. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Machek <pavel@ucw.cz> Link: http://lkml.kernel.org/r/513DBAA402000078000C4896@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-11 15:07:56 +01:00
Ioan Orghici	0fa24ce3f5	kvm: remove cast for kmalloc return value Signed-off-by: Ioan Orghici<ioan.orghici@gmail.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-11 12:03:54 +02:00
Alexandru Gheorghiu	e87b686b51	x86/platform/uv: Replace kmalloc() & memset with kzalloc() This was found using coccicheck. Signed-off-by: Alexandru Gheorghiu <gheorghiuandru@gmail.com> Link: http://lkml.kernel.org/r/1362822043-15559-1-git-send-email-gheorghiuandru@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-03-11 08:33:01 +01:00
Tim Chen	918731fa28	crypto: crc32c - Update the links to the white papers on CRC32C calculations with PCLMULQDQ instructions. Herbert, The following patch update the stale link to the CRC32C white paper that was referenced. Tim Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-03-10 16:46:43 +08:00
Dave Hansen	60f583d56a	x86: Do not try to sync identity map for non-mapped pages kernel_map_sync_memtype() is called from a variety of contexts. The pat.c code that calls it seems to ensure that it is not called for non-ram areas by checking via pat_pagerange_is_ram(). It is important that it only be called on the actual identity map because there IS no map to sync for highmem pages, or for memory holes. The ioremap.c uses are not as careful as those from pat.c, and call kernel_map_sync_memtype() on PCI space which is in the middle of the kernel identity map _range_, but is not actually mapped. This patch adds a check to kernel_map_sync_memtype() which probably duplicates some of the checks already in pat.c. But, it is necessary for the ioremap.c uses and shouldn't hurt other callers. I have reproduced this bug and this patch fixes it for me and the original bug reporter: https://lkml.org/lkml/2013/2/5/396 Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20130307163151.D9B58C4E@kernel.stglabs.ibm.com Signed-off-by: Dave Hansen <dave@sr71.net> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-03-07 13:23:28 -08:00
Takuya Yoshikawa	5da596078f	KVM: MMU: Introduce a helper function for FIFO zapping Make the code for zapping the oldest mmu page, placed at the tail of the active list, a separate function. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 17:26:27 -03:00
Takuya Yoshikawa	945315b9db	KVM: MMU: Use list_for_each_entry_safe in kvm_mmu_commit_zap_page() We are traversing the linked list, invalid_list, deleting each entry by kvm_mmu_free_page(). _safe version is there for such a case. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 17:26:27 -03:00
Takuya Yoshikawa	1044b03034	KVM: MMU: Fix and clean up for_each_gfn_* macros The expression (sp)->gfn should not be expanded using @gfn. Although no user of these macros passes a string other than gfn now, this should be fixed before anyone sees strange errors. Note: ignored the following checkpatch errors: ERROR: Macros with complex values should be enclosed in parenthesis ERROR: trailing statements should be on next line Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 17:26:27 -03:00
Jan Kiszka	1a0d74e664	KVM: nVMX: Fix setting of CR0 and CR4 in guest mode The logic for calculating the value with which we call kvm_set_cr0/4 was broken (will definitely be visible with nested unrestricted guest mode support). Also, we performed the check regarding CR0_ALWAYSON too early when in guest mode. What really needs to be done on both CR0 and CR4 is to mask out L1-owned bits and merge them in from L1's guest_cr0/4. In contrast, arch.cr0/4 and arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, are not suited as input. For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and refuse the update if it fails. To be fully consistent, we implement this check now also for CR4. For CR4, we move the check into vmx_set_cr4 while we keep it in handle_set_cr0. This is because the CR0 checks for vmxon vs. guest mode will diverge soon when adding unrestricted guest mode support. Finally, we have to set the shadow to the value L2 wanted to write originally. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 15:48:47 -03:00
Jan Kiszka	33fb20c39e	KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS Properly set those bits to 1 that the spec demands in case bit 55 of VMX_BASIC is 0 - like in our case. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 15:47:11 -03:00
Frederic Weisbecker	6c1e0256fa	context_tracking: Restore correct previous context state on exception exit On exception exit, we restore the previous context tracking state based on the regs of the interrupted frame. Iff that frame is in user mode as stated by user_mode() helper, we restore the context tracking user mode. However there is a tiny chunck of low level arch code after we pass through user_enter() and until the CPU eventually resumes userspace. If an exception happens in this tiny area, exception_enter() correctly exits the context tracking user mode but exception_exit() won't restore it because of the value returned by user_mode(regs). As a result we may return to userspace with the wrong context tracking state. To fix this, change exception_enter() to return the context tracking state prior to its call and pass this saved state to exception_exit(). This restores the real context tracking state of the interrupted frame. (May be this patch was suggested to me, I don't recall exactly. If so, sorry for the missing credit). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Mats Liljegren <mats.liljegren@enea.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2013-03-07 17:10:11 +01:00
Frederic Weisbecker	56dd9470d7	context_tracking: Move exception handling to generic code Exceptions handling on context tracking should share common treatment: on entry we exit user mode if the exception triggered in that context. Then on exception exit we return to that previous context. Generalize this to avoid duplication across archs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Mats Liljegren <mats.liljegren@enea.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2013-03-07 17:09:25 +01:00
Peter Jones	3c4aff6b9a	x86, doc: Be explicit about what the x86 struct boot_params requires If the sentinel triggers, we do not want the boot loader authors to just poke it and make the error go away, we want them to actually fix the problem. This should help avoid making the incorrect change in non-compliant bootloaders. [ hpa: dropped the Documentation/x86/boot.txt hunk pending clarifications ] Signed-off-by: Peter Jones <pjones@redhat.com> Link: http://lkml.kernel.org/r/1362592823-28967-1-git-send-email-pjones@redhat.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-03-06 20:34:43 -08:00
Josh Boyer	2e604c0f19	x86: Don't clear efi_info even if the sentinel hits When boot_params->sentinel is set, all we really know is that some undefined set of fields in struct boot_params contain garbage. In the particular case of efi_info, however, there is a private magic for that substructure, so it is generally safe to leave it even if the bootloader is broken. kexec (for which we did the initial analysis) did not initialize this field, but of course all the EFI bootloaders do, and most EFI bootloaders are broken in this respect (and should be fixed.) Reported-by: Robin Holt <holt@sgi.com> Link: http://lkml.kernel.org/r/CA%2B5PVA51-FT14p4CRYKbicykugVb=PiaEycdQ57CK2km_OQuRQ@mail.gmail.com Tested-by: Josh Boyer <jwboyer@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-03-06 20:23:30 -08:00
Yinghai Lu	98e7a98997	x86, mm: Make sure to find a 2M free block for the first mapped area Henrik reported that his MacAir 3.1 would not boot with \| commit `8d57470d8f` \| Date: Fri Nov 16 19:38:58 2012 -0800 \| \| x86, mm: setup page table in top-down It turns out that we do not calculate the real_end properly: We try to get 2M size with 4K alignment, and later will round down to 2M, so we will get less then 2M for first mapping, in extreme case could be only 4K only. In Henrik's system it has (1M-32K) as last usable rage is [mem 0x7f9db000-0x7fef8fff]. The problem is exposed when EFI booting have several holes and it will force mapping to use PTE instead as we only map usable areas. To fix it, just make it be 2M aligned, so we can be guaranteed to be able to use large pages to map it. Reported-by: Henrik Rydberg <rydberg@euromail.se> Bisected-by: Henrik Rydberg <rydberg@euromail.se> Tested-by: Henrik Rydberg <rydberg@euromail.se> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/CAE9FiQX4nQ7_1kg5RL_vh56rmcSHXUi1ExrZX7CwED4NGMnHfg@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-03-06 20:18:32 -08:00
Krzysztof Mazur	015221fefb	x86: Fix 32-bit *_cpu_data initializers The commit `27be457000` ('x86 idle: remove 32-bit-only "no-hlt" parameter, hlt_works_ok flag') removed the hlt_works_ok flag from struct cpuinfo_x86, but boot_cpu_data and new_cpu_data initializers were not changed causing setting f00f_bug flag, instead of fdiv_bug. If CONFIG_X86_F00F_BUG is not set the f00f_bug flag is never cleared. To avoid such problems in future C99-style initialization is now used. Signed-off-by: Krzysztof Mazur <krzysiek@podlesie.net> Acked-by: Borislav Petkov <bp@suse.de> Cc: len.brown@intel.com Link: http://lkml.kernel.org/r/1362266082-2227-1-git-send-email-krzysiek@podlesie.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-03-06 20:15:50 -08:00
Jan Kiszka	c4627c72e9	KVM: nVMX: Reset RFLAGS on VM-exit Ouch, how could this work so well that far? We need to clear RFLAGS to the reset value as specified by the SDM. Particularly, IF must be off after VM-exit! Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-05 20:49:17 -03:00
Borislav Petkov	576cfb404c	x86, smpboot: Remove unused variable The cpuinfo_x86 ptr is unused now. Drop it. Got obsolete by `69fb3676df` ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param") removing its only user. [ hpa: fixes gcc warning ] Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1362428180-8865-2-git-send-email-bp@alien8.de Cc: Len Brown <len.brown@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-03-05 15:26:45 -08:00
Jan Kiszka	503cd0c50a	KVM: nVMX: Fix switching of debug state First of all, do not blindly overwrite GUEST_DR7 on L2 entry. The host may have guest debugging enabled. Then properly reset DR7 and DEBUG_CTL on L2->L1 switch as specified in the SDM. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 21:37:28 -03:00
Takuya Yoshikawa	8482644aea	KVM: set_memory_region: Refactor commit_memory_region() This patch makes the parameter old a const pointer to the old memory slot and adds a new parameter named change to know the change being requested: the former is for removing extra copying and the latter is for cleaning up the code. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Takuya Yoshikawa	7b6195a91d	KVM: set_memory_region: Refactor prepare_memory_region() This patch drops the parameter old, a copy of the old memory slot, and adds a new parameter named change to know the change being requested. This not only cleans up the code but also removes extra copying of the memory slot structure. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Takuya Yoshikawa	47ae31e257	KVM: set_memory_region: Drop user_alloc from set_memory_region() Except ia64's stale code, KVM_SET_MEMORY_REGION support, this is only used for sanity checks in __kvm_set_memory_region() which can easily be changed to use slot id instead. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Takuya Yoshikawa	462fce4606	KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region() X86 does not use this any more. The remaining user, s390's !user_alloc check, can be simply removed since KVM_SET_MEMORY_REGION ioctl is no longer supported. Note: fixed powerpc's indentations with spaces to suppress checkpatch errors. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Marcelo Tosatti	ee2c25efdd	Merge branch 'master' into queue * master: (15791 commits) Linux 3.9-rc1 btrfs/raid56: Add missing #include <linux/vmalloc.h> fix compat_sys_rt_sigprocmask() SUNRPC: One line comment fix ext4: enable quotas before orphan cleanup ext4: don't allow quota mount options when quota feature enabled ext4: fix a warning from sparse check for ext4_dir_llseek ext4: convert number of blocks to clusters properly ext4: fix possible memory leak in ext4_remount() jbd2: fix ERR_PTR dereference in jbd2__journal_start metag: Provide dma_get_sgtable() metag: prom.h: remove declaration of metag_dt_memblock_reserve() metag: copy devicetree to non-init memory metag: cleanup metag_ksyms.c includes metag: move mm/init.c exports out of metag_ksyms.c metag: move usercopy.c exports out of metag_ksyms.c metag: move setup.c exports out of metag_ksyms.c metag: move kick.c exports out of metag_ksyms.c metag: move traps.c exports out of metag_ksyms.c metag: move irq enable out of irqflags.h on SMP ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Conflicts: arch/x86/kernel/kvmclock.c	2013-03-04 20:10:32 -03:00
Borislav Petkov	6276a074c6	x86: Make Linux guest support optional Put all config options needed to run Linux as a guest behind a CONFIG_HYPERVISOR_GUEST menu so that they don't get built-in by default but be selectable by the user. Also, make all units which depend on x86_hyper, depend on this new symbol so that compilation doesn't fail when CONFIG_HYPERVISOR_GUEST is disabled but those units assume its presence. Sort options in the new HYPERVISOR_GUEST menu, adapt config text and drop redundant select. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1362428421-9244-3-git-send-email-bp@alien8.de Cc: Dmitry Torokhov <dtor@vmware.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-03-04 13:14:25 -08:00
Borislav Petkov	2c59cad694	x86, Kconfig: Move PARAVIRT_DEBUG into the paravirt menu This should be under the PARAVIRT_GUEST menu. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1362428421-9244-2-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-03-04 13:14:06 -08:00
Borislav Petkov	15b9c359f2	x86, efi: Make efi_memblock_x86_reserve_range more readable So basically this function copies EFI memmap stuff from boot_params into the EFI memmap descriptor and reserves memory for it. Make it much more readable. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Matthew Garret <mjg59@srcf.ucam.org> Signed-off-by: Matt Fleming <matt.fleming@intel.com>	2013-03-04 15:23:56 +00:00
Al Viro	4cce1a207c	x86: trim sys_ia32.h remove the externs for functions that don't exist anymore Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 23:00:34 -05:00
Al Viro	07b053457b	x86: sys32_kill and sys32_mprotect are pointless their argument types are identical to those of sys_kill and sys_mprotect resp., so we are not doing any kind of argument validation, etc. in those - they turn into unconditional branches to corresponding syscalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 23:00:33 -05:00
Al Viro	56e41d3c5a	merge compat sys_ipc instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 23:00:27 -05:00
Al Viro	d5dc77bfee	consolidate compat lookup_dcookie() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 23:00:23 -05:00
Al Viro	19f4fc3aee	convert sendfile{,64} to COMPAT_SYSCALL_DEFINE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 22:58:46 -05:00
Al Viro	2cf0966683	make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect ... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded uses of asmlinkage_protect() in a bunch of syscalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 22:58:33 -05:00
Al Viro	e1b5bb6d12	consolidate cond_syscall and SYSCALL_ALIAS declarations take them to asm/linkage.h, with default in linux/linkage.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-03-03 22:55:19 -05:00
Linus Torvalds	8e8b180a5f	Bug-fixes: - Update the Xen ACPI memory and CPU hotplug locking mechanism. - Fix PAT issues wherein various applications would not start - Fix handling of multiple MSI as AHCI now does it. - Fix ARM compile failures. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQEcBAABAgAGBQJRMM+8AAoJEFjIrFwIi8fJnGsIANU/lVS5EwV6ZMP9GiVtbm68 sBn0MoDIkN2ID16gcrQdfvzgtTQHsptL2fOl756veTHN2AIpIFYShKZpbgR9VM+c MpD68ltakkfjoVeb7F7yPbDvcSftKRW5VAq1SeFMc2gOOmiqAWQGgBC+3Cd04zFk SqzDs1RLUHypwBOFlZKa1ex/ShuYfzRb9x+J6zqGO+OpjhlMobyag8rhSlgehlfP 6gS1IzmcH8a6SgBKZk/+YC+i+QLgPOyxiK6zcxa2rfc6iUwodpqBpKP1N+CS4lnu FIKOIIzzCwCEAq94wVV0GJwHyw7nsqjG8syfRyOPmauLrpOI70xrV+lYFMVVRA0= =HwzV -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen bug-fixes from Konrad Rzeszutek Wilk: - Update the Xen ACPI memory and CPU hotplug locking mechanism. - Fix PAT issues wherein various applications would not start - Fix handling of multiple MSI as AHCI now does it. - Fix ARM compile failures. * tag 'stable/for-linus-3.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xenbus: fix compile failure on ARM with Xen enabled xen/pci: We don't do multiple MSI's. xen/pat: Disable PAT using pat_enabled value. xen/acpi: xen cpu hotplug minor updates xen/acpi: xen memory hotplug minor updates	2013-03-03 14:22:53 -08:00
Linus Torvalds	56a79b7b02	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more VFS bits from Al Viro: "Unfortunately, it looks like xattr series will have to wait until the next cycle ;-/ This pile contains 9p cleanups and fixes (races in v9fs_fid_add() etc), fixup for nommu breakage in shmem.c, several cleanups and a bit more file_inode() work" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: constify path_get/path_put and fs_struct.c stuff fix nommu breakage in shmem.c cache the value of file_inode() in struct file 9p: if v9fs_fid_lookup() gets to asking server, it'd better have hashed dentry 9p: make sure ->lookup() adds fid to the right dentry 9p: untangle ->lookup() a bit 9p: double iput() in ->lookup() if d_materialise_unique() fails 9p: v9fs_fid_add() can't fail now v9fs: get rid of v9fs_dentry 9p: turn fid->dlist into hlist 9p: don't bother with private lock in ->d_fsdata; dentry->d_lock will do just fine more file_inode() open-coded instances selinux: opened file can't have NULL or negative ->f_path.dentry (In the meantime, the hlist traversal macros have changed, so this required a semantic conflict fixup for the newly hlistified fid->dlist)	2013-03-03 13:23:03 -08:00
Yinghai Lu	20e6926dcb	x86, ACPI, mm: Revert movablemem_map support Tim found: WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80() Hardware name: S2600CP sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. smpboot: Booting Node 1, Processors #1 Modules linked in: Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1 Call Trace: set_cpu_sibling_map+0x279/0x449 start_secondary+0x11d/0x1e5 Don Morris reproduced on a HP z620 workstation, and bisected it to commit `e8d1955258` ("acpi, memory-hotplug: parse SRAT before memblock is ready") It turns out movable_map has some problems, and it breaks several things 1. numa_init is called several times, NOT just for srat. so those nodes_clear(numa_nodes_parsed) memset(&numa_meminfo, 0, sizeof(numa_meminfo)) can not be just removed. Need to consider sequence is: numaq, srat, amd, dummy. and make fall back path working. 2. simply split acpi_numa_init to early_parse_srat. a. that early_parse_srat is NOT called for ia64, so you break ia64. b. for (i = 0; i < MAX_LOCAL_APIC; i++) set_apicid_to_node(i, NUMA_NO_NODE) still left in numa_init. So it will just clear result from early_parse_srat. it should be moved before that.... c. it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved early before override from INITRD is settled. 3. that patch TITLE is total misleading, there is NO x86 in the title, but it changes critical x86 code. It caused x86 guys did not pay attention to find the problem early. Those patches really should be routed via tip/x86/mm. 4. after that commit, following range can not use movable ram: a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed? b. initrd... it will be freed after booting, so it could be on movable... c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G anymore. d. init_mem_mapping: can not put page table high anymore. e. initmem_init: vmemmap can not be high local node anymore. That is not good. If node is hotplugable, the mem related range like page table and vmemmap could be on the that node without problem and should be on that node. We have workaround patch that could fix some problems, but some can not be fixed. So just remove that offending commit and related ones including: `f7210e6c4a` ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().") `01a178a94e` ("acpi, memory-hotplug: support getting hotplug info from SRAT") `27168d38fa` ("acpi, memory-hotplug: extend movablemem_map ranges to the end of node") `e8d1955258` ("acpi, memory-hotplug: parse SRAT before memblock is ready") `fb06bc8e5f` ("page_alloc: bootmem limit with movablecore_map") `42f47e27e7` ("page_alloc: make movablemem_map have higher priority") `6981ec3114` ("page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes") `34b71f1e04` ("page_alloc: add movable_memmap kernel parameter") `4d59a75125` ("x86: get pg_data_t's memory from other node") Later we should have patches that will make sure kernel put page table and vmemmap on local node ram instead of push them down to node0. Also need to find way to put other kernel used ram to local node ram. Reported-by: Tim Gardner <tim.gardner@canonical.com> Reported-by: Don Morris <don.morris@hp.com> Bisected-by: Don Morris <don.morris@hp.com> Tested-by: Don Morris <don.morris@hp.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-03-02 09:34:39 -08:00
Linus Torvalds	14cc0b55b7	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull signal/compat fixes from Al Viro: "Fixes for several regressions introduced in the last signal.git pile, along with fixing bugs in truncate and ftruncate compat (on just about anything biarch at least one of those two had been done wrong)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: compat: restore timerfd settime and gettime compat syscalls [regression] braino in "sparc: convert to ksignal" fix compat truncate/ftruncate switch lseek to COMPAT_SYSCALL_DEFINE lseek() and truncate() on sparc really need sign extension	2013-03-02 08:34:06 -08:00
Lans Zhang	2dead15fb8	x86_64: Use __BOOT_DS instead_of __KERNEL_DS for safety In startup_32, the running code still uses the initial GDT located in setup. Thus, __BOOT_DS is preferred. Currently __KERNEL_DS is lucky to equal to __BOOT_DS, but this is not always a safe way. Signed-off-by: Lans Zhang <lans.zhang2008@gmail.com> Link: http://lkml.kernel.org/r/51300267.6000008@gmail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-03-01 10:18:33 -08:00
Konrad Rzeszutek Wilk	884ac2978a	xen/pci: We don't do multiple MSI's. There is no hypercall to setup multiple MSI per PCI device. As such with these two new commits: - `08261d87f7` PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto() - `5ca72c4f7c` AHCI: Support multiple MSIs we would call the PHYSDEVOP_map_pirq 'nvec' times with the same contents of the PCI device. Sander discovered that we would get the same PIRQ value 'nvec' times and return said values to the caller. That of course meant that the device was configured only with one MSI and AHCI would fail with: ahci 0000:00:11.0: version 3.0 xen: registering gsi 19 triggering 0 polarity 1 xen: --> pirq=19 -> irq=19 (gsi=19) (XEN) [2013-02-27 19:43:07] IOAPIC[0]: Set PCI routing entry (6-19 -> 0x99 -> IRQ 19 Mode:1 Active:1) ahci 0000:00:11.0: AHCI 0001.0200 32 slots 4 ports 6 Gbps 0xf impl SATA mode ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ahci: probe of 0000:00:11.0 failed with error -22 That is b/c in ahci_host_activate the second call to devm_request_threaded_irq would return -EINVAL as we passed in (on the second run) an IRQ that was never initialized. CC: stable@vger.kernel.org Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-03-01 10:54:21 -05:00
Linus Torvalds	1e5005979e	Merge git://git.kernel.org/pub/scm/virt/kvm/kvm Pull one kvm bugfix from Gleb Natapov. * git://git.kernel.org/pub/scm/virt/kvm/kvm: x86/kvm: Fix pvclock vsyscall fixmap	2013-02-28 20:44:23 -08:00
Linus Torvalds	2af78448ff	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux Pull thermal management updates from Zhang Rui: "Highlights: - introduction of Dove thermal sensor driver. - introduction of Kirkwood thermal sensor driver. - introduction of intel_powerclamp thermal cooling device driver. - add interrupt and DT support for rcar thermal driver. - add thermal emulation support which allows platform thermal driver to do software/hardware emulation for thermal issues." * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (36 commits) thermal: rcar: remove __devinitconst thermal: return an error on failure to register thermal class Thermal: rename thermal governor Kconfig option to avoid generic naming thermal: exynos: Use the new thermal trend type for quick cooling action. Thermal: exynos: Add support for temperature falling interrupt. Thermal: Dove: Add Themal sensor support for Dove. thermal: Add support for the thermal sensor on Kirkwood SoCs thermal: rcar: add Device Tree support thermal: rcar: remove machine_power_off() from rcar_thermal_notify() thermal: rcar: add interrupt support thermal: rcar: add read/write functions for common/priv data thermal: rcar: multi channel support thermal: rcar: use mutex lock instead of spin lock thermal: rcar: enable CPCTL to use hardware TSC deciding thermal: rcar: use parenthesis on macro Thermal: fix a build warning when CONFIG_THERMAL_EMULATION cleared Thermal: fix a wrong comment thermal: sysfs: Add a new sysfs node emul_temp for thermal emulation PM: intel_powerclamp: off by one in start_power_clamp() thermal: exynos: Miscellaneous fixes to support falling threshold interrupt ...	2013-02-28 19:48:26 -08:00
Konrad Rzeszutek Wilk	c79c498262	xen/pat: Disable PAT using pat_enabled value. The git commit `8eaffa67b4` (xen/pat: Disable PAT support for now) explains in details why we want to disable PAT for right now. However that change was not enough and we should have also disabled the pat_enabled value. Otherwise we end up with: mmap-example:3481 map pfn expected mapping type write-back for [mem 0x00010000-0x00010fff], got uncached-minus ------------[ cut here ]------------ WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0() mem 0x00010000-0x00010fff], got uncached-minus ------------[ cut here ]------------ WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0() ... Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu Call Trace: [<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0 [<ffffffff81156c1c>] unmap_single_vma+0xac/0x100 [<ffffffff81157459>] unmap_vmas+0x49/0x90 [<ffffffff8115f808>] exit_mmap+0x98/0x170 [<ffffffff810559a4>] mmput+0x64/0x100 [<ffffffff810560f5>] dup_mm+0x445/0x660 [<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510 [<ffffffff81057931>] do_fork+0x91/0x350 [<ffffffff81057c76>] sys_clone+0x16/0x20 [<ffffffff816ccbf9>] stub_clone+0x69/0x90 [<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f ---[ end trace 4918cdd0a4c9fea4 ]--- (a similar message shows up if you end up launching 'mcelog') The call chain is (as analyzed by Liu, Jinsong): do_fork --> copy_process --> dup_mm --> dup_mmap --> copy_page_range --> track_pfn_copy --> reserve_pfn_range --> line 624: flags != want_flags It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR (_PAGE_CACHE_UC_MINUS). Stefan Bader dug in this deep and found out that: "That makes it clearer as this will do reserve_memtype(...) --> pat_x_mtrr_type --> mtrr_type_lookup --> __mtrr_type_lookup And that can return -1/0xff in case of MTRR not being enabled/initialized. Which is not the case (given there are no messages for it in dmesg). This is not equal to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS. It looks like the problem starts early in reserve_memtype: if (!pat_enabled) { /* This is identical to page table setting without PAT / if (new_type) { if (req_type == _PAGE_CACHE_WC) new_type = _PAGE_CACHE_UC_MINUS; else *new_type = req_type & _PAGE_CACHE_MASK; } return 0; } This would be what we want, that is clearing the PWT and PCD flags from the supported flags - if pat_enabled is disabled." This patch does that - disabling PAT. CC: stable@vger.kernel.org # 3.3 and further Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reported-and-Tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>	2013-02-28 09:03:00 -05:00
Jan Kiszka	3ab66e8a45	KVM: VMX: Pass vcpu to __vmx_complete_interrupts Cleanup: __vmx_complete_interrupts has no use for the vmx structure. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-28 10:29:03 +02:00
Jan Kiszka	44ceb9d665	KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12 IDT_VECTORING_INFO_FIELD was already read right after vmexit. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-28 10:20:06 +02:00
Peter Hurley	3d2a80a230	x86/kvm: Fix pvclock vsyscall fixmap The physical memory fixmapped for the pvclock clock_gettime vsyscall was allocated, and thus is not a kernel symbol. __pa() is the proper method to use in this case. Fixes the crash below when booting a next-20130204+ smp guest on a 3.8-rc5+ KVM host. [ 0.666410] udevd[97]: starting version 175 [ 0.674043] udevd[97]: udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-28 08:50:11 +02:00
Linus Torvalds	2a7d2b96d5	Merge branch 'akpm' (final batch from Andrew) Merge third patch-bumb from Andrew Morton: "This wraps me up for -rc1. - Lots of misc stuff and things which were deferred/missed from patchbombings 1 & 2. - ocfs2 things - lib/scatterlist - hfsplus - fatfs - documentation - signals - procfs - lockdep - coredump - seqfile core - kexec - Tejun's large IDR tree reworkings - ipmi - partitions - nbd - random() things - kfifo - tools/testing/selftests updates - Sasha's large and pointless hlist cleanup" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (163 commits) hlist: drop the node parameter from iterators kcmp: make it depend on CHECKPOINT_RESTORE selftests: add a simple doc tools/testing/selftests/Makefile: rearrange targets selftests/efivarfs: add create-read test selftests/efivarfs: add empty file creation test selftests: add tests for efivarfs kfifo: fix kfifo_alloc() and kfifo_init() kfifo: move kfifo.c from kernel/ to lib/ arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS w1: add support for DS2413 Dual Channel Addressable Switch memstick: move the dereference below the NULL test drivers/pps/clients/pps-gpio.c: use devm_kzalloc Documentation/DMA-API-HOWTO.txt: fix typo include/linux/eventfd.h: fix incorrect filename is a comment mtd: mtd_stresstest: use prandom_bytes() mtd: mtd_subpagetest: convert to use prandom library mtd: mtd_speedtest: use prandom_bytes mtd: mtd_pagetest: convert to use prandom library mtd: mtd_oobtest: convert to use prandom library ...	2013-02-27 20:58:09 -08:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Stephen Rothwell	887cbce0ad	arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures that already provide virt_to_bus(). Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: James Hogan <james.hogan@imgtec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: H Hartley Sweeten <hartleys@visionengravers.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:23 -08:00
Linus Torvalds	e3c4877de8	Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/EFI changes from Peter Anvin: - Improve the initrd handling in the EFI boot stub by allowing forward slashes in the pathname - from Chun-Yi Lee. - Cleanup code duplication in the EFI mixed kernel/firmware code - from Satoru Takeuchi. - efivarfs bug fixes for more strict filename validation, with lots of input from Al Viro. * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, efi: remove duplicate code in setup_arch() by using, efi_is_native() efivarfs: guid part of filenames are case-insensitive efivarfs: Validate filenames much more aggressively efivarfs: Use sizeof() instead of magic number x86, efi: Allow slash in file path of initrd	2013-02-27 16:17:42 -08:00
Linus Torvalds	18a44a7ff1	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more x86 fixes from Peter Anvin: "Additional x86 fixes. Three of these patches are pure documentation, two are pretty trivial; the remaining one fixes boot problems on some non-BIOS machines." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Make sure we can boot in the case the BDA contains pure garbage x86, efi: Mark disable_runtime as __initdata x86, doc: Fix incorrect comment about 64-bit code segment descriptors doc, kernel-parameters: Document 'console=hvc<n>' doc, xen: Mention 'earlyprintk=xen' in the documentation. ACPI: Overriding ACPI tables via initrd only works with an initrd and on X86	2013-02-27 16:16:39 -08:00
Al Viro	6131ffaa1f	more file_inode() open-coded instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-27 16:59:05 -05:00
H. Peter Anvin	7c10093692	x86: Make sure we can boot in the case the BDA contains pure garbage On non-BIOS platforms it is possible that the BIOS data area contains garbage instead of being zeroed or something equivalent (firmware people: we are talking of 1.5K here, so please do the sane thing.) We need on the order of 20-30K of low memory in order to boot, which may grow up to < 64K in the future. We probably want to avoid the lowest of the low memory. At the same time, it seems extremely unlikely that a legitimate EBDA would ever reach down to the 128K (which would require it to be over half a megabyte in size.) Thus, pick 128K as the cutoff for "this is insane, ignore." We may still end up reserving a bunch of extra memory on the low megabyte, but that is not really a major issue these days. In the worst case we lose 512K of RAM. This code really should be merged with trim_bios_range() in arch/x86/kernel/setup.c, but that is a bigger patch for a later merge window. Reported-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/n/tip-oebml055yyfm8yxmria09rja@git.kernel.org	2013-02-27 13:38:57 -08:00
Chen Gang	02cdb50fd7	arch/x86/kvm: beautify source code for __u32 irq which is never < 0 irp->irq is __u32 which is never < 0. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 16:07:45 +02:00
Jan Kiszka	957c897e8c	KVM: nVMX: Use cached exit reason No need to re-read what vmx_vcpu_run already picked up for us. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:46:07 +02:00
Jan Kiszka	36c3cc422b	KVM: nVMX: Clear segment cache after switching between L1 and L2 Switching the VMCS obviously invalidates what may have been cached about the guest segments. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:41:09 +02:00
Jan Kiszka	d6851fbeee	KVM: nVMX: Advertise PAUSE and WBINVD exiting support These exits have no preconditions, and we already process the corresponding reasons in nested_vmx_exit_handled correctly. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:33:51 +02:00
Jan Kiszka	733568f9ce	KVM: VMX: Make prepare_vmcs12 and load_vmcs12_host_state static Both are only used locally. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:31:15 +02:00
Jan Kiszka	fe1140cc36	x86: kvmclock: Do not setup kvmclock vsyscall in the absence of that clock This fixes boot lockups with "no-kvmclock", when the host is not exposing this particular feature (QEMU: -cpu ...,-kvmclock) or when the kvmclock initialization failed for whatever reason. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 13:19:18 +02:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00
Linus Torvalds	2003cd90c4	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar. * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit" x86/mm/numa: Don't check if node is NUMA_NO_NODE x86, efi: Make "noefi" really disable EFI runtime serivces x86/apic: Fix parsing of the 'lapic' cmdline option	2013-02-26 19:45:29 -08:00
Linus Torvalds	f8ef15d6b9	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add Intel IvyBridge event scheduling constraints ftrace: Call ftrace cleanup module notifier after all other notifiers tracing/syscalls: Allow archs to ignore tracing compat syscalls	2013-02-26 19:40:37 -08:00
Herbert Xu	ca81a1a1b8	crypto: crc32c - Kill pointless CRYPTO_CRC32C_X86_64 option This bool option can never be set to anything other than y. So let's just kill it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2013-02-26 17:52:15 +08:00
Linus Torvalds	556f12f602	PCI changes for the v3.9 merge window: Host bridge hotplug - Major overhaul of ACPI host bridge add/start (Rafael Wysocki, Yinghai Lu) - Major overhaul of PCI/ACPI binding (Rafael Wysocki, Yinghai Lu) - Split out ACPI host bridge and ACPI PCI device hotplug (Yinghai Lu) - Stop caching _PRT and make independent of bus numbers (Yinghai Lu) PCI device hotplug - Clean up cpqphp dead code (Sasha Levin) - Disable ARI unless device and upstream bridge support it (Yijing Wang) - Initialize all hot-added devices (not functions 0-7) (Yijing Wang) Power management - Don't touch ASPM if disabled (Joe Lawrence) - Fix ASPM link state management (Myron Stowe) Miscellaneous - Fix PCI_EXP_FLAGS accessor (Alex Williamson) - Disable Bus Master in pci_device_shutdown (Konstantin Khlebnikov) - Document hotplug resource and MPS parameters (Yijing Wang) - Add accessor for PCIe capabilities (Myron Stowe) - Drop pciehp suspend/resume messages (Paul Bolle) - Make pci_slot built-in only (not a module) (Jiang Liu) - Remove unused PCI/ACPI bind ops (Jiang Liu) - Removed used pci_root_bus (Bjorn Helgaas) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRKS3hAAoJEFmIoMA60/r8xxoP/j1CS4oCZAnBIVT9fKBkis+/ CENcfHIUKj6J9iMfJEVvqBELvqaLqtpeNwAGMcGPxV7VuT3K1QumChfaTpRDP0HC VDRmrjcmfenEK+YPOG7acsrTyqk2wjpLOyu9MKRxtC5u7tF6376KQpkEFpO4haL4 eUHTxfE76OkrPBSvx3+PUSf6jqrvrNbjX8K6HdDVVlm3sVAQKmYJU/Wphv2NPOqa CAMyCzEGybFjr8hDRwvWgr+06c718GMwQUbnrPdHXAe7lMNMrN/XVBmU9ABN3Aas icd3lrDs+yPObgcO/gT8+sAZErCtdJ9zuHGYHdYpRbIQj/5JT4TMk7tw/Bj7vKY9 Mqmho9GR5YmYTRN9f1r+2n5AQ/KYWXJDrRNOnt5/ys5BOM3vwJ7WJ902zpSwtFQp nLX+oD/hLfzpnoIQGDuBAoAXp2Kam3XWRgVvG78buRNrPj+kUzimk14a8qQeY+CB El6UKuwi5Uv/qgs1gAqqjmZmsAkon2DnsRZa6Fl8NTkDlis7LY4gp9OU38ySFpB+ PhCmRyCZmDDqTVtwj6XzR3nPQ5LBSbvsTfgMxYMIUSXHa06tyb2q5p4mEIas0OmU RKaP5xQqZuTgD8fbdYrx0xgSrn7JHt/j/X//Qs6unlLCWhlpm3LjJZKxyw2FwBGr o4Lci+PiBh3MowCrju9D =ER3b -----END PGP SIGNATURE----- Merge tag 'pci-v3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI changes from Bjorn Helgaas: "Host bridge hotplug - Major overhaul of ACPI host bridge add/start (Rafael Wysocki, Yinghai Lu) - Major overhaul of PCI/ACPI binding (Rafael Wysocki, Yinghai Lu) - Split out ACPI host bridge and ACPI PCI device hotplug (Yinghai Lu) - Stop caching _PRT and make independent of bus numbers (Yinghai Lu) PCI device hotplug - Clean up cpqphp dead code (Sasha Levin) - Disable ARI unless device and upstream bridge support it (Yijing Wang) - Initialize all hot-added devices (not functions 0-7) (Yijing Wang) Power management - Don't touch ASPM if disabled (Joe Lawrence) - Fix ASPM link state management (Myron Stowe) Miscellaneous - Fix PCI_EXP_FLAGS accessor (Alex Williamson) - Disable Bus Master in pci_device_shutdown (Konstantin Khlebnikov) - Document hotplug resource and MPS parameters (Yijing Wang) - Add accessor for PCIe capabilities (Myron Stowe) - Drop pciehp suspend/resume messages (Paul Bolle) - Make pci_slot built-in only (not a module) (Jiang Liu) - Remove unused PCI/ACPI bind ops (Jiang Liu) - Removed used pci_root_bus (Bjorn Helgaas)" * tag 'pci-v3.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (51 commits) PCI/ACPI: Don't cache _PRT, and don't associate them with bus numbers PCI: Fix PCI Express Capability accessors for PCI_EXP_FLAGS ACPI / PCI: Make pci_slot built-in only, not a module PCI/PM: Clear state_saved during suspend PCI: Use atomic_inc_return() rather than atomic_add_return() PCI: Catch attempts to disable already-disabled devices PCI: Disable Bus Master unconditionally in pci_device_shutdown() PCI: acpiphp: Remove dead code for PCI host bridge hotplug PCI: acpiphp: Create companion ACPI devices before creating PCI devices PCI: Remove unused "rc" in virtfn_add_bus() PCI: pciehp: Drop suspend/resume ENTRY messages PCI/ASPM: Don't touch ASPM if forcibly disabled PCI/ASPM: Deallocate upstream link state even if device is not PCIe PCI: Document MPS parameters pci=pcie_bus_safe, pci=pcie_bus_perf, etc PCI: Document hpiosize= and hpmemsize= resource reservation parameters PCI: Use PCI Express Capability accessor PCI: Introduce accessor to retrieve PCIe Capabilities Register PCI: Put pci_dev in device tree as early as possible PCI: Skip attaching driver in device_add() PCI: acpiphp: Keep driver loaded even if no slots found ...	2013-02-25 21:18:18 -08:00
Matt Fleming	058e7b5814	x86, efi: Mark disable_runtime as __initdata disable_runtime is only referenced from __init functions, so mark it as __initdata. Reported-by: Yinghai Lu <yinghai@kernel.org> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Link: http://lkml.kernel.org/r/1361545427-26393-1-git-send-email-matt@console-pimps.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-02-25 16:01:35 -08:00
Linus Torvalds	32dc43e40a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "Here is the crypto update for 3.9: - Added accelerated implementation of crc32 using pclmulqdq. - Added test vector for fcrypt. - Added support for OMAP4/AM33XX cipher and hash. - Fixed loose crypto_user input checks. - Misc fixes" * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (43 commits) crypto: user - ensure user supplied strings are nul-terminated crypto: user - fix empty string test in report API crypto: user - fix info leaks in report API crypto: caam - Added property fsl,sec-era in SEC4.0 device tree binding. crypto: use ERR_CAST crypto: atmel-aes - adjust duplicate test crypto: crc32-pclmul - Kill warning on x86-32 crypto: x86/twofish - assembler clean-ups: use ENTRY/ENDPROC, localize jump labels crypto: x86/sha1 - assembler clean-ups: use ENTRY/ENDPROC crypto: x86/serpent - use ENTRY/ENDPROC for assember functions and localize jump targets crypto: x86/salsa20 - assembler cleanup, use ENTRY/ENDPROC for assember functions and rename ECRYPT_* to salsa20_* crypto: x86/ghash - assembler clean-up: use ENDPROC at end of assember functions crypto: x86/crc32c - assembler clean-up: use ENTRY/ENDPROC crypto: cast6-avx: use ENTRY()/ENDPROC() for assembler functions crypto: cast5-avx: use ENTRY()/ENDPROC() for assembler functions and localize jump targets crypto: camellia-x86_64/aes-ni: use ENTRY()/ENDPROC() for assembler functions and localize jump targets crypto: blowfish-x86_64: use ENTRY()/ENDPROC() for assembler functions and localize jump targets crypto: aesni-intel - add ENDPROC statements for assembler functions crypto: x86/aes - assembler clean-ups: use ENTRY/ENDPROC, localize jump targets crypto: testmgr - add test vector for fcrypt ...	2013-02-25 15:56:15 -08:00
Linus Torvalds	9043a2650c	The sweeping change is to make add_taint() explicitly indicate whether to disable lockdep, but it's a mechanical change. Cheers, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRJAcuAAoJENkgDmzRrbjxsw0P/3eXb+LddYnx0V0uHYdKpCUf 4vdW7X0fX3Z+aUK69IWRL/6ahoO4TpaHYGHBDjEoivyQ0GDq14X7JNWsYYt3LdMf 3wmDgRc2cn/mZOJbFeVpNV8ox5l/xc0CUvV+iQ8tMjfQItXMXgWUFZKMECsXKSO6 eex3lrw9M2jAX2uL8LQPp9W8xtKu24nSZRC6tH5riE/8fCzi1cZPPAqfxP5c8Lee ZXtbCRSyAFENZLpKyMe1PC7HvtJyi5NDn9xwOQiXULZV/VOlvP94DGBLIKCM/6dn 4QvZxpG0P0uOlpCgRAVLyh/z7g4XY4VF/fHopLCmEcqLsvgD+V2LQpQ9zWUalLPC Z+pUpz2vu0gIddPU1nR8R6oGpEdJ8O12aJle62p/RSXWZGx12qUQ+Tamu0tgKcv1 AsiJfbUGNDYfxgU6sHsoQjl2f68LTVckCU1C1LqEbW/S104EIORtGx30CHM4LRiO 32kDC5TtgYDBKQAIqJ4bL48ZMh+9W3uX40p7xzOI5khHQjvswUKa3jcxupU0C1uv lx8KXo7pn8WT33QGysWC782wJCgJuzSc2vRn+KQoqoynuHGM6agaEtR59gil3QWO rQEcxH63BBRDgHlg4FM9IkJwwsnC3PWKL8gbX0uAWXAPMbgapJkuuGZAwt0WDGVK +GszxsFkCjlW0mK0egTb =tiSY -----END PGP SIGNATURE----- Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module update from Rusty Russell: "The sweeping change is to make add_taint() explicitly indicate whether to disable lockdep, but it's a mechanical change." * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: MODSIGN: Add option to not sign modules during modules_install MODSIGN: Add -s <signature> option to sign-file MODSIGN: Specify the hash algorithm on sign-file command line MODSIGN: Simplify Makefile with a Kconfig helper module: clean up load_module a little more. modpost: Ignore ARC specific non-alloc sections module: constify within_module_* taint: add explicit flag to show whether lock dep is still OK. module: printk message when module signature fail taints kernel.	2013-02-25 15:41:43 -08:00
Konrad Rzeszutek Wilk	1256276c98	x86, doc: Fix incorrect comment about 64-bit code segment descriptors The AMD64 Architecture Programmer's Manual Volume 2, on page 89 mentions: "If the processor is running in 64-bit mode (L=1), the only valid setting of the D bit is 0." This matches with what the code does. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1361825650-14031-4-git-send-email-konrad.wilk@oracle.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-02-25 13:42:37 -08:00
Al Viro	3f6d078d4a	fix compat truncate/ftruncate Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-25 09:24:55 -05:00
Linus Torvalds	77be36de8b	Features: - Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor to be aware of new CPU and new DIMMs - Cleanups Bug-fixes: - Fixes a long-standing bug in the PV spinlock wherein we did not kick VCPUs that were in a tight loop. - Fixes in the error paths for the event channel machinery. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) iQEcBAABAgAGBQJRJS1kAAoJEFjIrFwIi8fJj2YIAMO3/LVUZyojX/d8U9pqrCly lFfEF2UVjcxHJSj0ZFNXt1o3fnYP1SLRlT9u7ZLDjXf6Lmxmw6/C3Haw2wp3DfGq yUR0G/X9CPTBEgMYDdX7bjeTjyURvZcUaFwr+qodaaeL3uXx2pW6621Sc6jRKuia yAFVZMAKeaRrvUUIXjKHtlpRp9LKFdSztShMtYqmFvxEwrJPq2b37caKruoUCa6o X/YO0fvE9QtYD/pG0jsghFmLh/mcr+n9IFMCUXo1Yc9FdQBExtKzABDS5jdpuFND 4aMDE3dqUmHmpbaQhRE7SdblvpyrGdQXL6FSTjvwBgISfLo847CrnRKRgPp0YeA= =LQeU -----END PGP SIGNATURE----- Merge tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen update from Konrad Rzeszutek Wilk: "This has two new ACPI drivers for Xen - a physical CPU offline/online and a memory hotplug. The way this works is that ACPI kicks the drivers and they make the appropiate hypercall to the hypervisor to tell it that there is a new CPU or memory. There also some changes to the Xen ARM ABIs and couple of fixes. One particularly nasty bug in the Xen PV spinlock code was fixed by Stefan Bader - and has been there since the 2.6.32! Features: - Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor to be aware of new CPU and new DIMMs - Cleanups Bug-fixes: - Fixes a long-standing bug in the PV spinlock wherein we did not kick VCPUs that were in a tight loop. - Fixes in the error paths for the event channel machinery" Fix up a few semantic conflicts with the ACPI interface changes in drivers/xen/xen-acpi-{cpu,mem}hotplug.c. * tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: event channel arrays are xen_ulong_t and not unsigned long xen: Send spinlock IPI to all waiters xen: introduce xen_remap, use it instead of ioremap xen: close evtchn port if binding to irq fails xen-evtchn: correct comment and error output xen/tmem: Add missing %s in the printk statement. xen/acpi: move xen_acpi_get_pxm under CONFIG_XEN_DOM0 xen/acpi: ACPI cpu hotplug xen/acpi: Move xen_acpi_get_pxm to Xen's acpi.h xen/stub: driver for CPU hotplug xen/acpi: ACPI memory hotplug xen/stub: driver for memory hotplug xen: implement updated XENMEM_add_to_physmap_range ABI xen/smp: Move the common CPU init code a bit to prep for PVH patch.	2013-02-24 16:18:31 -08:00
Linus Torvalds	89f883372f	Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Marcelo Tosatti: "KVM updates for the 3.9 merge window, including x86 real mode emulation fixes, stronger memory slot interface restrictions, mmu_lock spinlock hold time reduction, improved handling of large page faults on shadow, initial APICv HW acceleration support, s390 channel IO based virtio, amongst others" * tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) Revert "KVM: MMU: lazily drop large spte" x86: pvclock kvm: align allocation size to page size KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr x86 emulator: fix parity calculation for AAD instruction KVM: PPC: BookE: Handle alignment interrupts booke: Added DBCR4 SPR number KVM: PPC: booke: Allow multiple exception types KVM: PPC: booke: use vcpu reference from thread_struct KVM: Remove user_alloc from struct kvm_memory_slot KVM: VMX: disable apicv by default KVM: s390: Fix handling of iscs. KVM: MMU: cleanup __direct_map KVM: MMU: remove pt_access in mmu_set_spte KVM: MMU: cleanup mapping-level KVM: MMU: lazily drop large spte KVM: VMX: cleanup vmx_set_cr0(). KVM: VMX: add missing exit names to VMX_EXIT_REASONS array KVM: VMX: disable SMEP feature when guest is in non-paging mode KVM: Remove duplicate text in api.txt Revert "KVM: MMU: split kvm_mmu_free_page" ...	2013-02-24 13:07:18 -08:00
Al Viro	561c673197	switch lseek to COMPAT_SYSCALL_DEFINE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-24 10:52:26 -05:00
Andrea Arcangeli	a8aed3e075	x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge Without this patch any kernel code that reads kernel memory in non present kernel pte/pmds (as set by pageattr.c) will crash. With this kernel code: static struct page crash_page; static unsigned long crash_address; [..] crash_page = alloc_pages(GFP_KERNEL, 9); crash_address = page_address(crash_page); if (set_memory_np((unsigned long)crash_address, 1)) printk("set_memory_np failure\n"); [..] The kernel will crash if inside the "crash tool" one would try to read the memory at the not present address. crash> p crash_address crash_address = $8 = (long unsigned int ) 0xffff88023c000000 crash> rd 0xffff88023c000000 [ lockup* ] The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE shares the same bit, and pageattr leaves _PAGE_GLOBAL set on a kernel pte which is then mistaken as _PAGE_PROTNONE (so pte_present returns true by mistake and the kernel fault then gets confused and loops). With THP the same can happen after we taught pmd_present to check _PAGE_PROTNONE and _PAGE_PSE in commit `027ef6c878` ("mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP"). THP has the same problem with _PAGE_GLOBAL as the 4k pages, but it also has a problem with _PAGE_PSE, which must be cleared too. After the patch is applied copy_user correctly returns -EFAULT and doesn't lockup anymore. crash> p crash_address crash_address = $9 = (long unsigned int *) 0xffff88023c000000 crash> rd 0xffff88023c000000 rd: read error: kernel virtual address: ffff88023c000000 type: "64-bit KVADDR" Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Shaohua Li <shaohua.li@intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-02-24 13:01:45 +01:00
Andrea Arcangeli	954f857187	Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit" I got a report for a minor regression introduced by commit `027ef6c878` ("mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP"). So the problem is, pageattr creates kernel pagetables (pte and pmds) that breaks pte_present/pmd_present and the patch above exposed this invariant breakage for pmd_present. The same problem already existed for the pte and pte_present and it was fixed by commit `660a293ea9` ("x86, mm: Make spurious_fault check explicitly check the PRESENT bit") (if it wasn't for that commit, it wouldn't even be a regression). That fix avoids the pagefault to use pte_present. I could follow through by stopping using pmd_present/pmd_huge too. However I think it's more robust to fix pageattr and to clear the PSE/GLOBAL bitflags too in addition to the present bitflag. So the kernel page fault can keep using the regular pte_present/pmd_present/pmd_huge. The confusion arises because _PAGE_GLOBAL and _PAGE_PROTNONE are sharing the same bit, and in the pmd case we pretend _PAGE_PSE to be set only in present pmds (to facilitate split_huge_page final tlb flush). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Shaohua Li <shaohua.li@intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-02-24 13:01:44 +01:00
Wen Congyang	942670d0dc	x86/mm/numa: Don't check if node is NUMA_NO_NODE If we aren't debugging per_cpu maps, the cpu's node is stored in per_cpu variable numa_node. If `node' is NUMA_NO_NODE, it means the caller wants to clear the cpu's node. So we should also call set_cpu_numa_node() in this case. Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2013-02-24 13:01:43 +01:00
Linus Torvalds	9e2d59ad58	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull signal handling cleanups from Al Viro: "This is the first pile; another one will come a bit later and will contain SYSCALL_DEFINE-related patches. - a bunch of signal-related syscalls (both native and compat) unified. - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE (fixing several potential problems with missing argument validation, while we are at it) - a lot of now-pointless wrappers killed - a couple of architectures (cris and hexagon) forgot to save altstack settings into sigframe, even though they used the (uninitialized) values in sigreturn; fixed. - microblaze fixes for delivery of multiple signals arriving at once - saner set of helpers for signal delivery introduced, several architectures switched to using those." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits) x86: convert to ksignal sparc: convert to ksignal arm: switch to struct ksignal * passing alpha: pass k_sigaction and siginfo_t using ksignal pointer burying unused conditionals make do_sigaltstack() static arm64: switch to generic old sigaction() (compat-only) arm64: switch to generic compat rt_sigaction() arm64: switch compat to generic old sigsuspend arm64: switch to generic compat rt_sigqueueinfo() arm64: switch to generic compat rt_sigpending() arm64: switch to generic compat rt_sigprocmask() arm64: switch to generic sigaltstack sparc: switch to generic old sigsuspend sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE sparc: kill sign-extending wrappers for native syscalls kill sparc32_open() sparc: switch to use of generic old sigaction sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE mips: switch to generic sys_fork() and sys_clone() ...	2013-02-23 18:50:11 -08:00
Linus Torvalds	5ce1a70e2f	Merge branch 'akpm' (more incoming from Andrew) Merge second patch-bomb from Andrew Morton: - A little DM fix - the MM queue * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits) ksm: allocate roots when needed mm: cleanup "swapcache" in do_swap_page mm,ksm: swapoff might need to copy mm,ksm: FOLL_MIGRATION do migration_entry_wait ksm: shrink 32-bit rmap_item back to 32 bytes ksm: treat unstable nid like in stable tree ksm: add some comments tmpfs: fix mempolicy object leaks tmpfs: fix use-after-free of mempolicy object mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages mm: export mmu notifier invalidates mm: accelerate mm_populate() treatment of THP pages mm: use long type for page counts in mm_populate() and get_user_pages() mm: accurately document nr_free_*_pages functions with code comments HWPOISON: change order of error_states[]'s elements HWPOISON: fix misjudgement of page_action() for errors on mlocked pages memcg: stop warning on memcg_propagate_kmem net: change type of virtio_chan->p9_max_pages vmscan: change type of vm_total_pages to unsigned long fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used ...	2013-02-23 17:50:35 -08:00
Tang Chen	01a178a94e	acpi, memory-hotplug: support getting hotplug info from SRAT We now provide an option for users who don't want to specify physical memory address in kernel commandline. /* * For movablemem_map=acpi: * * SRAT: \|_____\| \|_____\| \|_________\| \|_________\| ...... * node id: 0 1 1 2 * hotpluggable: n y y n * movablemem_map: \|_____\| \|_________\| * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. */ So user just specify movablemem_map=acpi, and the kernel will use hotpluggable info in SRAT to determine which memory ranges should be set as ZONE_MOVABLE. If all the memory ranges in SRAT is hotpluggable, then no memory can be used by kernel. But before parsing SRAT, memblock has already reserve some memory ranges for other purposes, such as for kernel image, and so on. We cannot prevent kernel from using these memory. So we need to exclude these ranges even if these memory is hotpluggable. Furthermore, there could be several memory ranges in the single node which the kernel resides in. We may skip one range that have memory reserved by memblock, but if the rest of memory is too small, then the kernel will fail to boot. So, make the whole node which the kernel resides in un-hotpluggable. Then the kernel has enough memory to use. NOTE: Using this way will cause NUMA performance down because the whole node will be set as ZONE_MOVABLE, and kernel cannot use memory on it. If users don't want to lose NUMA performance, just don't use it. [akpm@linux-foundation.org: fix warning] [akpm@linux-foundation.org: use strcmp()] Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:14 -08:00
Tang Chen	27168d38fa	acpi, memory-hotplug: extend movablemem_map ranges to the end of node When implementing movablemem_map boot option, we introduced an array movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE. Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify the whole node memory range, we need to extend it to the node end so that we can use it to prevent memblock from allocating memory in the ranges user didn't specify. We now implement movablemem_map boot option like this: /* * For movablemem_map=nn[KMG]@ss[KMG]: * * SRAT: \|_____\| \|_____\| \|_________\| \|_________\| ...... * node id: 0 1 1 2 * user specified: \|__\| \|___\| * movablemem_map: \|___\| \|_________\| \|______\| ...... * * Using movablemem_map, we can prevent memblock from allocating memory * on ZONE_MOVABLE at boot time. * * NOTE: In this case, SRAT info will be ingored. */ [akpm@linux-foundation.org: clean up code, fix build warning] Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:14 -08:00
Tang Chen	e8d1955258	acpi, memory-hotplug: parse SRAT before memblock is ready On linux, the pages used by kernel could not be migrated. As a result, if a memory range is used by kernel, it cannot be hot-removed. So if we want to hot-remove memory, we should prevent kernel from using it. The way now used to prevent this is specify a memory range by movablemem_map boot option and set it as ZONE_MOVABLE. But when the system is booting, memblock will allocate memory, and reserve the memory for kernel. And before we parse SRAT, and know the node memory ranges, memblock is working. And it may allocate memory in ranges to be set as ZONE_MOVABLE. This memory can be used by kernel, and never be freed. So, let's parse SRAT before memblock is called first. And it is early enough. The first call of memblock_find_in_range_node() is in: setup_arch() \|-->setup_real_mode() so, this patch add a function early_parse_srat() to parse SRAT, and call it before setup_real_mode() is called. NOTE: 1) early_parse_srat() is called before numa_init(), and has initialized numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO NOT zero numa_meminfo in numa_init(), otherwise we will lose memory numa info. 2) I don't know why using count of memory affinities parsed from SRAT as a return value in original acpi_numa_init(). So I add a static variable srat_mem_cnt to remember this count and use it as the return value of the new acpi_numa_init() [mhocko@suse.cz: parse SRAT before memblock is ready fix] Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:14 -08:00
Yasuaki Ishimatsu	4d59a75125	x86: get pg_data_t's memory from other node During the implementation of SRAT support, we met a problem. In setup_arch(), we have the following call series: 1) memblock is ready; 2) some functions use memblock to allocate memory; 3) parse ACPI tables, such as SRAT. Before 3), we don't know which memory is hotpluggable, and as a result, we cannot prevent memblock from allocating hotpluggable memory. So, in 2), there could be some hotpluggable memory allocated by memblock. Now, we are trying to parse SRAT earlier, before memblock is ready. But I think we need more investigation on this topic. So in this v5, I dropped all the SRAT support, and v5 is just the same as v3, and it is based on 3.8-rc3. As we planned, we will support getting info from SRAT without users' participation at last. And we will post another patch-set to do so. And also, I think for now, we can add this boot option as the first step of supporting movable node. Since Linux cannot migrate the direct mapped pages, the only way for now is to limit the whole node containing only movable memory. Using SRAT is one way. But even if we can use SRAT, users still need an interface to enable/disable this functionality if they don't want to loose their NUMA performance. So I think, a user interface is always needed. For now, users can disable this functionality by not specifying the boot option. Later, we will post SRAT support, and add another option value "movablecore_map=acpi" to using SRAT. This patch: If system can create movable node which all memory of the node is allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid() to retry when the first allocation fails. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:14 -08:00
Wen Congyang	e13fe8695c	cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node When the node is offlined, there is no memory/cpu on the node. If a sleep task runs on a cpu of this node, it will be migrated to the cpu on the other node. So we can clear cpu-to-node mapping. [akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit] Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:13 -08:00
Wen Congyang	c4c6052464	cpu_hotplug: clear apicid to node when the cpu is hotremoved When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic() to store the cpu's node and apicid's node. But we don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is hotremoved. If the node is also hotremoved, we will get the following messages: kernel BUG at include/linux/gfp.h:329! invalid opcode: 0000 [#1] SMP Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB RIP: 0010:[<ffffffff811bc3fd>] [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300 RSP: 0018:ffff88078a049cf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246 RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0 R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003 FS: 00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650) Call Trace: new_slab+0x30/0x1b0 __slab_alloc+0x358/0x4c0 kmem_cache_alloc_node_trace+0xb4/0x1e0 alloc_fair_sched_group+0xd0/0x1b0 sched_create_group+0x3e/0x110 sched_autogroup_create_attach+0x4d/0x180 sys_setsid+0xd4/0xf0 system_call_fastpath+0x16/0x1b Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44 RIP [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300 RSP <ffff88078a049cf8> ---[ end trace adf84c90f3fea3e5 ]--- The reason is that the cpu's node is not NUMA_NO_NODE, we will call alloc_pages_exact_node() to alloc memory on the node, but the node is offlined. If the node is onlined, we still need cpu's node. For example: a task on the cpu is sleeped when the cpu is hotremoved. We will choose another cpu to run this task when it is waked up. If we know the cpu's node, we will choose the cpu on the same node first. So we should clear cpu-to-node mapping when the node is offlined. This patch only clears apicid-to-node mapping when the cpu is hotremoved. [akpm@linux-foundation.org: fix section error] Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-23 17:50:13 -08:00

... 3 4 5 6 7 ...

17471 Commits