linux/arch/powerpc/kvm
Stewart Smith 9678cdaae9 Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.

The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
  - 32kb for L2
  - 128kb for L2+L3
  - Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)

We should abort any ongoing logging before initiating one.

To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
  32 cycles)

The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.

The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.

This implementation just does L2 store/load. In my benchmarks this proves
to be useful.

Benchmark 1:
 - 16 core POWER8
 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
 - No split core/SMT
 - two guests running sysbench memory test.
   sysbench --test=memory --num-threads=8 run
 - one guest running apache bench (of default HTML page)
   ab -n 490000 -c 400 http://localhost/

This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.

In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.

benchmark 2:
 - Same VM config as benchmark 1
 - all three guests running sysbench memory benchmark

This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)

The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.

benchmark 3:
 - same 3 guests and CPU configuration as benchmark 1 and 2.
 - two idle guests
 - 1 guest running STREAM benchmark

This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).

benchmark 4:
 - same 3 guests as previous benchmarks
 - two guests running sysbench --memory, distinctly different cache heavy
   workload
 - one guest running STREAM benchmark.

Similar improvements to benchmark 3.

benchmark 5:
 - 1 guest, 8 VCPUs, Ubuntu 14.04
 - Host configured with split core (SMT8, subcores-per-core=4)
 - STREAM benchmark

In this benchmark, we see a 10-20% performance improvement across the board
of STREAM benchmark results with this patch.

Based on preliminary investigation and microbenchmarks
by Prerna Saxena <prerna@linux.vnet.ibm.com>

Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2014-07-28 15:23:17 +02:00
..
book3s_32_mmu_host.c KVM: PPC: Book3S: Make magic page properly 4k mappable 2014-07-28 15:23:11 +02:00
book3s_32_mmu.c KVM: PPC: Book3S: Stop PTE lookup on write errors 2014-07-28 15:23:10 +02:00
book3s_32_sr.S
book3s_64_mmu_host.c KVM: PPC: Book3S: Make magic page properly 4k mappable 2014-07-28 15:23:11 +02:00
book3s_64_mmu_hv.c KVM: PPC: Allow kvmppc_get_last_inst() to fail 2014-07-28 15:23:14 +02:00
book3s_64_mmu.c KVM: PPC: Disable NX for old magic page using guests 2014-05-30 14:26:24 +02:00
book3s_64_slb.S KVM: PPC: Book3S PR: Rework SLB switching code 2014-05-30 14:26:30 +02:00
book3s_64_vio_hv.c KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE 2014-03-26 23:34:27 +11:00
book3s_64_vio.c ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flag 2013-08-26 13:19:56 +03:00
book3s_emulate.c KVM: PPC: BOOK3S: PR: Emulate instruction counter 2014-07-28 15:22:10 +02:00
book3s_exports.c KVM: PPC: Make shared struct aka magic page guest endian 2014-05-30 14:26:21 +02:00
book3s_hv_builtin.c KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled 2014-07-28 15:22:18 +02:00
book3s_hv_cma.c powerpc/kvm: Use 256K chunk to track both RMA and hash page table allocation. 2013-07-08 16:21:13 +02:00
book3s_hv_cma.h powerpc/kvm: Use 256K chunk to track both RMA and hash page table allocation. 2013-07-08 16:21:13 +02:00
book3s_hv_interrupts.S powerpc: No need to use dot symbols when branching to a function 2014-04-23 10:05:16 +10:00
book3s_hv_ras.c KVM: PPC: Book3S HV: Access guest VPA in BE 2014-07-28 15:22:22 +02:00
book3s_hv_rm_mmu.c KVM: PPC: Book3S HV: Make HTAB code LE host aware 2014-07-28 15:22:22 +02:00
book3s_hv_rm_xics.c KVM: PPC: Book3S HV: Add support for real mode ICP in XICS emulation 2013-04-26 20:27:32 +02:00
book3s_hv_rmhandlers.S KVM: PPC: Book3S HV: Fix ABIv2 on LE 2014-07-28 15:22:25 +02:00
book3s_hv.c Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 2014-07-28 15:23:17 +02:00
book3s_interrupts.S KVM: PPC: Book3S PR: Fix ABIv2 on LE 2014-07-28 15:22:15 +02:00
book3s_mmu_hpte.c kvm: powerpc: book3s: pr: move PR related tracepoints to a separate header 2013-10-17 15:36:22 +02:00
book3s_paired_singles.c KVM: PPC: Allow kvmppc_get_last_inst() to fail 2014-07-28 15:23:14 +02:00
book3s_pr_papr.c KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call 2014-07-28 15:23:16 +02:00
book3s_pr.c KVM: PPC: Book3S: Fix LPCR one_reg interface 2014-07-28 15:23:16 +02:00
book3s_rmhandlers.S KVM: PPC: Book3S PR: Fix ABIv2 on LE 2014-07-28 15:22:15 +02:00
book3s_rtas.c KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian 2014-05-30 14:26:20 +02:00
book3s_segment.S KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR 2014-05-30 14:26:22 +02:00
book3s_xics.c KVM: PPC: fix couple of memory leaks in MPIC/XICS devices 2014-01-09 10:14:54 +01:00
book3s_xics.h KVM: PPC: Book3S: Add API for in-kernel XICS emulation 2013-05-02 15:28:36 +02:00
book3s.c KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication 2014-07-28 15:23:16 +02:00
book3s.h kvm: powerpc: book3s: Allow the HV and PR selection per virtual machine 2013-10-17 18:42:36 +02:00
booke_emulate.c kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7 2014-07-28 15:23:12 +02:00
booke_interrupts.S KVM: PPC: Remove 440 support 2014-07-28 15:23:15 +02:00
booke.c KVM: PPC: Bookehv: Get vcpu's last instruction for emulation 2014-07-28 15:23:14 +02:00
booke.h KVM: PPC: Remove 440 support 2014-07-28 15:23:15 +02:00
bookehv_interrupts.S KVM: PPC: Remove 440 support 2014-07-28 15:23:15 +02:00
e500_emulate.c KVM: PPC: e500: Emulate power management control SPR 2014-07-28 15:22:27 +02:00
e500_mmu_host.c KVM: PPC: Bookehv: Get vcpu's last instruction for emulation 2014-07-28 15:23:14 +02:00
e500_mmu_host.h
e500_mmu.c KVM: PPC: e500: Fix bad address type in deliver_tlb_misss() 2014-01-27 16:00:54 +01:00
e500.c KVM: PPC: Add devname:kvm aliases for modules 2014-01-09 10:14:00 +01:00
e500.h kvm: powerpc: use caching attributes as per linux pte 2014-01-09 10:15:08 +01:00
e500mc.c KVM: PPC: Booke-hv: Add one reg interface for SPRG9 2014-07-28 15:23:15 +02:00
emulate.c KVM: PPC: Allow kvmppc_get_last_inst() to fail 2014-07-28 15:23:14 +02:00
fpu.S
irq.h KVM: PPC: Book3S: Add API for in-kernel XICS emulation 2013-05-02 15:28:36 +02:00
Kconfig KVM: PPC: Remove 440 support 2014-07-28 15:23:15 +02:00
Makefile KVM: PPC: Remove 440 support 2014-07-28 15:23:15 +02:00
mpic.c KVM: PPC: MPIC: Reset IRQ source private members 2014-05-30 14:26:26 +02:00
powerpc.c KVM: PPC: Remove 440 support 2014-07-28 15:23:15 +02:00
timing.c
timing.h
trace_booke.h kvm: powerpc: booke: Move booke related tracepoints to separate header 2013-10-17 15:37:16 +02:00
trace_pr.h KVM: PPC: Make shared struct aka magic page guest endian 2014-05-30 14:26:21 +02:00
trace.h kvm: powerpc: booke: Move booke related tracepoints to separate header 2013-10-17 15:37:16 +02:00