linux/arch/x86/kernel
Jeremy Fitzhardinge b4ecc12699 x86: Fix performance regression caused by paravirt_ops on native kernels
Xiaohui Xin and some other folks at Intel have been looking into what's
behind the performance hit of paravirt_ops when running native.

It appears that the hit is entirely due to the paravirtualized
spinlocks introduced by:

 | commit 8efcbab674
 | Date:   Mon Jul 7 12:07:51 2008 -0700
 |
 |     paravirt: introduce a "lock-byte" spinlock implementation

The extra call/return in the spinlock path is somehow
causing an increase in the cycles/instruction of somewhere around 2-7%
(seems to vary quite a lot from test to test).  The working theory is
that the CPU's pipeline is getting upset about the
call->call->locked-op->return->return, and seems to be failing to
speculate (though I haven't seen anything definitive about the precise
reasons).  This doesn't entirely make sense, because the performance
hit is also visible on unlock and other operations which don't involve
locked instructions.  But spinlock operations clearly swamp all the
other pvops operations, even though I can't imagine that they're
nearly as common (there's only a .05% increase in instructions
executed).

If I disable just the pv-spinlock calls, my tests show that pvops is
identical to non-pvops performance on native (my measurements show that
it is actually about .1% faster, but Xiaohui shows a .05% slowdown).

Summary of results, averaging 10 runs of the "mmperf" test, using a
no-pvops build as baseline:

		nopv		Pv-nospin	Pv-spin
CPU cycles	100.00%		99.89%		102.18%
instructions	100.00%		100.10%		100.15%
CPI		100.00%		99.79%		102.03%
cache ref	100.00%		100.84%		100.28%
cache miss	100.00%		90.47%		88.56%
cache miss rate	100.00%		89.72%		88.31%
branches	100.00%		99.93%		100.04%
branch miss	100.00%		103.66%		107.72%
branch miss rt	100.00%		103.73%		107.67%
wallclock	100.00%		99.90%		102.20%

The clear effect here is that the 2% increase in CPI is
directly reflected in the final wallclock time.

(The other interesting effect is that the more ops are
out of line calls via pvops, the lower the cache access
and miss rates.  Not too surprising, but it suggests that
the non-pvops kernel is over-inlined.  On the flipside,
the branch misses go up correspondingly...)

So, what's the fix?

Paravirt patching turns all the pvops calls into direct calls, so
_spin_lock etc do end up having direct calls.  For example, the compiler
generated code for paravirtualized _spin_lock is:

<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq  *0xffffffff805a5b30
<_spin_lock+22>:	retq

The indirect call will get patched to:
<_spin_lock+0>:		mov    %gs:0xb4c8,%rax
<_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
<_spin_lock+15>:	callq <__ticket_spin_lock>
<_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
<_spin_lock+22>:	retq

One possibility is to inline _spin_lock, etc, when building an
optimised kernel (ie, when there's no spinlock/preempt
instrumentation/debugging enabled).  That will remove the outer
call/return pair, returning the instruction stream to a single
call/return, which will presumably execute the same as the non-pvops
case.  The downsides arel 1) it will replicate the
preempt_disable/enable code at eack lock/unlock callsite; this code is
fairly small, but not nothing; and 2) the spinlock definitions are
already a very heavily tangled mass of #ifdefs and other preprocessor
magic, and making any changes will be non-trivial.

The other obvious answer is to disable pv-spinlocks.  Making them a
separate config option is fairly easy, and it would be trivial to
enable them only when Xen is enabled (as the only non-default user).
But it doesn't really address the common case of a distro build which
is going to have Xen support enabled, and leaves the open question of
whether the native performance cost of pv-spinlocks is worth the
performance improvement on a loaded Xen system (10% saving of overall
system CPU when guests block rather than spin).  Still it is a
reasonable short-term workaround.

[ Impact: fix pvops performance regression when running native ]

Analysed-by: "Xin Xiaohui" <xiaohui.xin@intel.com>
Analysed-by: "Li Xin" <xin.li@intel.com>
Analysed-by: "Nakajima Jun" <jun.nakajima@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Xen-devel <xen-devel@lists.xensource.com>
LKML-Reference: <4A0B62F7.5030802@goop.org>
[ fixed the help text ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-15 20:07:42 +02:00
..
acpi Merge branch 'linus' into release 2009-04-05 02:14:15 -04:00
apic x86: Fix false positive section mismatch warnings in the apic code 2009-05-10 09:26:54 +02:00
cpu x86: mtrr: Fix high_width computation when phys-addr is >= 44bit 2009-05-11 11:40:43 +02:00
.gitignore
alternative.c x86: expand irq-off region in text_poke() 2009-03-10 16:24:23 +01:00
amd_iommu_init.c amd-iommu: fix iommu flag masks 2009-05-04 15:05:24 +02:00
amd_iommu.c Merge git://git.infradead.org/iommu-2.6 2009-04-03 10:36:57 -07:00
aperture_64.c aperture_64.c: clarify that too small aperture is valid reason for this code 2008-11-28 15:24:39 +01:00
apm_32.c Merge branch 'cpumask-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-03-31 13:33:50 +10:30
asm-offsets_32.c pm: rework includes, remove arch ifdefs 2009-04-01 08:59:16 -07:00
asm-offsets_64.c pm: rework includes, remove arch ifdefs 2009-04-01 08:59:16 -07:00
asm-offsets.c
audit_64.c
bios_uv.c x86, UV: system table in bios accessed after unmap 2009-04-03 19:25:57 +02:00
bootflag.c
check.c x86: fix 64k corruption-check 2009-03-15 07:03:15 +01:00
cpuid.c Merge branch 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-01-03 12:04:39 -08:00
crash_dump_32.c
crash_dump_64.c
crash.c x86, apic: remove duplicate asm/apic.h inclusions 2009-02-17 17:52:44 +01:00
doublefault_32.c
ds.c x86, pebs: correct qualifier passed to ds_write_config() from ds_request_pebs() 2009-03-06 16:13:15 +01:00
dumpstack_32.c ftrace: print real return in dumpstack for function graph 2008-12-03 08:56:25 +01:00
dumpstack_64.c x86-64: Move current task from PDA to per-cpu and consolidate with 32-bit. 2009-01-19 00:38:58 +09:00
dumpstack.c Merge commit 'origin/master' into next 2009-03-30 14:04:53 +11:00
dumpstack.h ftrace: print real return in dumpstack for function graph 2008-12-03 08:56:25 +01:00
e820.c x86: fix boot hang in early_reserve_e820() 2009-05-07 21:42:39 -07:00
early_printk.c x86: properly __init-annotate recent early_printk additions 2009-03-13 02:37:18 +01:00
early-quirks.c x86: only scan the root bus in early PCI quirks 2009-01-09 12:46:22 -08:00
efi_32.c
efi_64.c Merge branch 'core/percpu' into percpu-cpumask-x86-for-linus-2 2009-03-27 17:28:43 +01:00
efi_stub_32.S Merge branch 'x86/asm' into x86/mm 2009-02-25 08:27:46 +01:00
efi_stub_64.S x86: efi_stub_32,64 - add missing ENDPROCs 2009-02-24 18:08:40 +01:00
efi.c Merge branch 'core/percpu' into percpu-cpumask-x86-for-linus-2 2009-03-27 17:28:43 +01:00
entry_32.S x86: entry_32.S fix compile warnings - fix work mask bit width 2009-03-14 09:42:51 +01:00
entry_64.S lockdep, x86: account for irqs enabled in paranoid_exit 2009-04-18 09:04:28 +02:00
ftrace.c tracing/syscalls: use a dedicated file header 2009-04-09 05:43:32 +02:00
geode_32.c
head32.c x86-32: use brk segment for allocating initial kernel pagetable 2009-03-14 17:23:47 -07:00
head64.c x86: add brk allocation for very, very early allocations 2009-03-14 15:37:14 -07:00
head_32.S x86-32: tighten the bound on additional memory to map 2009-03-17 11:52:10 -07:00
head_64.S x86: head_64.S - use IDT_ENTRIES instead of hardcoded number 2009-02-24 18:08:38 +01:00
head.c x86, debug: remove EBDA debug printk 2008-12-12 11:08:42 +01:00
hpet.c x86: hpet: fix periodic mode programming on AMD 81xx 2009-04-22 15:53:40 +02:00
i386_ksyms_32.c
i387.c x86, math-emu: fix init_fpu for task != current 2009-03-04 20:33:16 +01:00
i8237.c i8327: fix outb() parameter order 2009-02-10 13:13:23 +01:00
i8253.c clocksource: pass clocksource to read() callback 2009-04-21 13:41:47 -07:00
i8259.c x86: refactor x86_quirks support 2009-02-23 00:08:11 +01:00
init_task.c take init_fs to saner place 2008-12-31 18:07:42 -05:00
io_delay.c x86: io_delay.c cleanup 2009-03-21 16:57:04 +05:30
ioport.c x86-32: use non-lazy io bitmap context switching 2009-03-02 12:07:48 +01:00
irq_32.c Merge branch 'tj-percpu' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu 2009-02-24 21:52:45 +01:00
irq_64.c x86: unify do_IRQ() 2009-02-09 12:16:05 +01:00
irq.c x86: smarten /proc/interrupts output for new counters 2009-04-08 18:06:07 +02:00
irqinit_32.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask 2009-03-30 18:00:26 -07:00
irqinit_64.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask 2009-03-30 18:00:26 -07:00
k8.c
kdebugfs.c x86: kdebugfs.c cleanup 2009-03-21 16:55:45 +05:30
kgdb.c x86, apic: remove genapic.h 2009-02-17 17:52:44 +01:00
kprobes.c Merge branch 'tracing/core-v2' into tracing-for-linus 2009-04-02 00:49:02 +02:00
kvm.c x86: with the last user gone, remove set_pte_present 2009-03-19 14:04:19 +01:00
kvmclock.c clocksource: pass clocksource to read() callback 2009-04-21 13:41:47 -07:00
ldt.c x86: ldt.c fix style problems 2009-01-02 17:46:24 +01:00
machine_kexec_32.c x86, kexec: fix crashdump panic with CONFIG_KEXEC_JUMP 2009-05-07 22:01:05 -07:00
machine_kexec_64.c x86, kexec: fix crashdump panic with CONFIG_KEXEC_JUMP 2009-05-07 22:01:05 -07:00
Makefile x86: Fix performance regression caused by paravirt_ops on native kernels 2009-05-15 20:07:42 +02:00
mca_32.c x86: refactor x86_quirks support 2009-02-23 00:08:11 +01:00
mfgpt_32.c cpumask: remove references to struct irqaction's mask field. 2009-03-30 22:05:14 +10:30
microcode_amd.c x86: microcode: cleanup 2009-03-18 13:51:17 +01:00
microcode_core.c Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-04-17 09:56:11 -07:00
microcode_intel.c x86: microcode: cleanup 2009-03-18 13:51:17 +01:00
mmconf-fam10h_64.c x86: move various CPU initialization objects into .cpuinit.rodata 2009-03-12 13:13:07 +01:00
module_32.c x86: module_32.c fix style problems 2009-01-12 11:22:55 +01:00
module_64.c x86: module_64.c fix style problems 2009-01-12 11:23:01 +01:00
mpparse.c x86: Fix section mismatches in mpparse 2009-04-12 12:32:18 +02:00
msr.c x86: msr.c fix style problems 2009-01-12 11:22:50 +01:00
olpc.c x86, olpc: fix model detection without OFW 2009-02-14 23:05:25 +01:00
paravirt_patch_32.c x86/pvops: add a paravirt_ident functions to allow special patching 2009-01-30 14:51:44 -08:00
paravirt_patch_64.c x86/pvops: add a paravirt_ident functions to allow special patching 2009-01-30 14:51:44 -08:00
paravirt-spinlocks.c x86: remove byte locks 2009-01-20 17:14:28 +01:00
paravirt.c x86: Fix performance regression caused by paravirt_ops on native kernels 2009-05-15 20:07:42 +02:00
pci-calgary_64.c x86, ia64: convert to use generic dma_map_ops struct 2009-01-06 14:06:57 +01:00
pci-dma.c dma-mapping: replace all DMA_24BIT_MASK macro with DMA_BIT_MASK(24) 2009-04-07 08:31:12 -07:00
pci-gart_64.c Merge branch 'linus' into core/iommu 2009-03-05 12:47:28 +01:00
pci-nommu.c dma-mapping: replace all DMA_32BIT_MASK macro with DMA_BIT_MASK(32) 2009-04-07 08:31:11 -07:00
pci-swiotlb.c x86: pci-swiotlb.c swiotlb_dma_ops should be static 2009-04-14 02:51:04 +02:00
pcspeaker.c
pmtimer_64.c
probe_roms_32.c x86: move mach-default/*.h files to asm/ 2009-01-29 14:16:51 +01:00
process_32.c Simplify copy_thread() 2009-04-02 19:04:51 -07:00
process_64.c Simplify copy_thread() 2009-04-02 19:04:51 -07:00
process.c Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-04-05 11:04:19 -07:00
ptrace.c tracing/syscalls: use a dedicated file header 2009-04-09 05:43:32 +02:00
pvclock.c
quirks.c x86, hpet: Stop soliciting hpet=force users on ICH4M 2009-04-24 08:41:39 +02:00
reboot_fixups_32.c
reboot.c x86: DMI match for the Dell DXP061 as it needs BIOS reboot 2009-04-08 17:53:27 +02:00
relocate_kernel_32.S x86, kexec: fix kexec x86 coding style 2009-03-10 18:13:25 -07:00
relocate_kernel_64.S x86, kexec: x86_64: add kexec jump support for x86_64 2009-03-10 18:13:25 -07:00
rtc.c x86: rtc.c cleanup 2009-03-21 16:56:37 +05:30
scx200_32.c
setup_percpu.c x86: remove duplicated code with pcpu_need_numa() 2009-04-02 06:08:05 +02:00
setup.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask 2009-03-30 18:00:26 -07:00
signal.c x86: signal: check sas_ss_size instead of sas_ss_flags() 2009-04-01 17:13:17 +02:00
smp.c x86, apic: remove genapic.h 2009-02-17 17:52:44 +01:00
smpboot.c cpumask: use new cpumask functions throughout x86 2009-03-13 14:49:54 +10:30
stacktrace.c x86: update copyrights 2009-01-31 04:21:18 +01:00
step.c
sys_i386_32.c
sys_x86_64.c
syscall_64.c
syscall_table_32.S preadv/pwritev: Add preadv and pwritev system calls. 2009-04-02 19:05:08 -07:00
tce_64.c
test_nx.c
test_rodata.c
time_32.c x86: refactor x86_quirks support 2009-02-23 00:08:11 +01:00
time_64.c cpumask: remove references to struct irqaction's mask field. 2009-03-30 22:05:14 +10:30
tlb_uv.c Merge branch 'x86/uv' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-04-16 16:43:20 -07:00
tls.c
tls.h
topology.c x86: topology.c cleanup 2009-03-21 16:55:24 +05:30
trampoline_32.S x86: use _types.h headers in asm where available 2009-02-13 11:35:01 -08:00
trampoline_64.S x86: use _types.h headers in asm where available 2009-02-13 11:35:01 -08:00
trampoline.c x86: change static allocation of trampoline area 2008-12-08 13:49:45 +01:00
traps.c x86-32: use non-lazy io bitmap context switching 2009-03-02 12:07:48 +01:00
tsc_sync.c Merge branches 'x86/apic', 'x86/cleanups', 'x86/cpufeature', 'x86/crashdump', 'x86/debug', 'x86/defconfig', 'x86/detect-hyper', 'x86/doc', 'x86/dumpstack', 'x86/early-printk', 'x86/fpu', 'x86/idle', 'x86/io', 'x86/memory-corruption-check', 'x86/microcode', 'x86/mm', 'x86/mtrr', 'x86/nmi-watchdog', 'x86/pat2', 'x86/pci-ioapic-boot-irq-quirks', 'x86/ptrace', 'x86/quirks', 'x86/reboot', 'x86/setup-memory', 'x86/signal', 'x86/sparse-fixes', 'x86/time', 'x86/uv' and 'x86/xen' into x86/core 2008-12-23 16:27:23 +01:00
tsc.c clocksource: pass clocksource to read() callback 2009-04-21 13:41:47 -07:00
uv_irq.c x86, apic: remove genapic.h 2009-02-17 17:52:44 +01:00
uv_sysfs.c x86: prevent /sys/firmware/sgi_uv from being created on non-uv systems 2009-04-08 14:58:10 +02:00
uv_time.c uv_time: add parameter to uv_read_rtc() 2009-04-22 17:41:25 +02:00
verify_cpu_64.S
visws_quirks.c x86: convert obsolete irq_desc_t typedef to struct irq_desc 2009-03-11 09:49:01 +01:00
vm86_32.c x86: use regparm(3) for passed-in pt_regs pointer 2009-02-11 14:00:56 -08:00
vmi_32.c x86: with the last user gone, remove set_pte_present 2009-03-19 14:04:19 +01:00
vmiclock_32.c clocksource: pass clocksource to read() callback 2009-04-21 13:41:47 -07:00
vmlinux_32.lds.S x86-32: move _end to a dummy section 2009-03-17 14:16:02 -07:00
vmlinux_64.lds.S x86/brk: put the brk reservations in their own section 2009-03-17 12:58:15 -07:00
vmlinux.lds.S
vsmp_64.c Revert "x86: don't compile vsmp_64 for 32bit" 2009-03-25 21:34:28 +01:00
vsyscall_64.c Merge branch 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2008-12-28 12:21:10 -08:00
x8664_ksyms_64.c x86: convert pda ops to wrappers around x86 percpu accessors 2009-01-16 14:20:22 +01:00
xsave.c x86-64: fix FPU corruption with signals and preemption 2009-04-20 14:33:00 -07:00