Pull locking updates from Ingo Molnar:
"The main changes in this cycle are:
- rwsem scalability improvements, phase #2, by Waiman Long, which are
rather impressive:
"On a 2-socket 40-core 80-thread Skylake system with 40 reader
and writer locking threads, the min/mean/max locking operations
done in a 5-second testing window before the patchset were:
40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255
After the patchset, they became:
40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"
There's a lot of changes to the locking implementation that makes
it similar to qrwlock, including owner handoff for more fair
locking.
Another microbenchmark shows how across the spectrum the
improvements are:
"With a locking microbenchmark running on 5.1 based kernel, the
total locking rates (in kops/s) on a 2-socket Skylake system
with equal numbers of readers and writers (mixed) before and
after this patchset were:
# of Threads Before Patch After Patch
------------ ------------ -----------
2 2,618 4,193
4 1,202 3,726
8 802 3,622
16 729 3,359
32 319 2,826
64 102 2,744"
The changes are extensive and the patch-set has been through
several iterations addressing various locking workloads. There
might be more regressions, but unless they are pathological I
believe we want to use this new implementation as the baseline
going forward.
- jump-label optimizations by Daniel Bristot de Oliveira: the primary
motivation was to remove IPI disturbance of isolated RT-workload
CPUs, which resulted in the implementation of batched jump-label
updates. Beyond the improvement of the real-time characteristics
kernel, in one test this patchset improved static key update
overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
as well.
- atomic64_t cross-arch type cleanups by Mark Rutland: over the last
~10 years of atomic64_t existence the various types used by the
APIs only had to be self-consistent within each architecture -
which means they became wildly inconsistent across architectures.
Mark puts and end to this by reworking all the atomic64
implementations to use 's64' as the base type for atomic64_t, and
to ensure that this type is consistently used for parameters and
return values in the API, avoiding further problems in this area.
- A large set of small improvements to lockdep by Yuyang Du: type
cleanups, output cleanups, function return type and othr cleanups
all around the place.
- A set of percpu ops cleanups and fixes by Peter Zijlstra.
- Misc other changes - please see the Git log for more details"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
locking/lockdep: increase size of counters for lockdep statistics
locking/atomics: Use sed(1) instead of non-standard head(1) option
locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
x86/jump_label: Make tp_vec_nr static
x86/percpu: Optimize raw_cpu_xchg()
x86/percpu, sched/fair: Avoid local_clock()
x86/percpu, x86/irq: Relax {set,get}_irq_regs()
x86/percpu: Relax smp_processor_id()
x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
locking/rwsem: Guard against making count negative
locking/rwsem: Adaptive disabling of reader optimistic spinning
locking/rwsem: Enable time-based spinning on reader-owned rwsem
locking/rwsem: Make rwsem->owner an atomic_long_t
locking/rwsem: Enable readers spinning on writer
locking/rwsem: Clarify usage of owner's nonspinaable bit
locking/rwsem: Wake up almost all readers in wait queue
locking/rwsem: More optimal RT task handling of null owner
locking/rwsem: Always release wait_lock before waking up tasks
locking/rwsem: Implement lock handoff to prevent lock starvation
locking/rwsem: Make rwsem_spin_on_owner() return owner state
...
Pull SMP/hotplug updates from Thomas Gleixner:
"A small set of updates for SMP and CPU hotplug:
- Abort disabling secondary CPUs in the freezer when a wakeup is
pending instead of evaluating it only after all CPUs have been
offlined.
- Remove the shared annotation for the strict per CPU cfd_data in the
smp function call core code.
- Remove the return values of smp_call_function() and on_each_cpu()
as they are unconditionally 0. Fixup the few callers which actually
bothered to check the return value"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Remove smp_call_function() and on_each_cpu() return values
smp: Do not mark call_function_data as shared
cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending
cpu/hotplug: Fix notify_cpu_starting() reference in bringup_wait_for_ap()
Pull s390 updates from Vasily Gorbik:
- Improve stop_machine wait logic: replace cpu_relax_yield call in
generic stop_machine function with a weak stop_machine_yield
function. This is overridden on s390, which yields the current cpu to
the neighbouring cpu after a couple of retries, instead of blindly
giving up the cpu to the hipervisor. This significantly improves
stop_machine performance on s390 in overcommitted scenarios.
This includes common code changes which have been Acked by Peter
Zijlstra and Thomas Gleixner.
- Improve jump label transformation speed: transform jump labels
without using stop_machine.
- Refactoring of the vfio-ccw cp handling, simplifying the code and
avoiding unneeded allocating/copying.
- Various vfio-ccw fixes (ccw translation, state machine).
- Add support for vfio-ap queue interrupt control in the guest. This
includes s390 kvm changes which have been Acked by Christian
Borntraeger.
- Add protected virtualization support for virtio-ccw.
- Enforce both CONFIG_SMP and CONFIG_HOTPLUG_CPU, which allows to
remove some code which most likely isn't working at all, besides that
s390 didn't even compile for !CONFIG_SMP.
- Support for special flagged EP11 CPRBs for zcrypt.
- Handle PCI devices with no support for new MIO instructions.
- Avoid KASAN false positives in reworked stack unwinder.
- Couple of fixes for the QDIO layer.
- Convert s390 specific documentation to ReST format.
- Let s390 crypto modules return -ENODEV instead of -EOPNOTSUPP if
hardware is missing. This way our modules behave like most other
modules and which is also what systemd's systemd-modules-load.service
expects.
- Replace defconfig with performance_defconfig, so there is one config
file less to maintain.
- Remove the SCLP call home device driver, which was never useful.
- Cleanups all over the place.
* tag 's390-5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (83 commits)
docs: s390: s390dbf: typos and formatting, update crash command
docs: s390: unify and update s390dbf kdocs at debug.c
docs: s390: restore important non-kdoc parts of s390dbf.rst
vfio-ccw: Fix the conversion of Format-0 CCWs to Format-1
s390/pci: correctly handle MIO opt-out
s390/pci: deal with devices that have no support for MIO instructions
s390: ap: kvm: Enable PQAP/AQIC facility for the guest
s390: ap: implement PAPQ AQIC interception in kernel
vfio: ap: register IOMMU VFIO notifier
s390: ap: kvm: add PQAP interception for AQIC
s390/unwind: cleanup unused READ_ONCE_TASK_STACK
s390/kasan: avoid false positives during stack unwind
s390/qdio: don't touch the dsci in tiqdio_add_input_queues()
s390/qdio: (re-)initialize tiqdio list entries
s390/dasd: Fix a precision vs width bug in dasd_feature_list()
s390/cio: introduce driver_override on the css bus
vfio-ccw: make convert_ccw0_to_ccw1 static
vfio-ccw: Remove copy_ccw_from_iova()
vfio-ccw: Factor out the ccw0-to-ccw1 transition
vfio-ccw: Copy CCW data outside length calculation
...
Pull arm64 updates from Catalin Marinas:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
- Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
manage the permissions of executable vmalloc regions more strictly
- Slight performance improvement by keeping softirqs enabled while
touching the FPSIMD/SVE state (kernel_neon_begin/end)
- Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
XAFLAG and AXFLAG instructions for floating point comparison flags
manipulation) and FRINT (rounding floating point numbers to integers)
- Re-instate ARM64_PSEUDO_NMI support which was previously marked as
BROKEN due to some bugs (now fixed)
- Improve parking of stopped CPUs and implement an arm64-specific
panic_smp_self_stop() to avoid warning on not being able to stop
secondary CPUs during panic
- perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
platforms
- perf: DDR performance monitor support for iMX8QXP
- cache_line_size() can now be set from DT or ACPI/PPTT if provided to
cope with a system cache info not exposed via the CPUID registers
- Avoid warning on hardware cache line size greater than
ARCH_DMA_MINALIGN if the system is fully coherent
- arm64 do_page_fault() and hugetlb cleanups
- Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
- Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
'arm_boot_flags' introduced in 5.1)
- CONFIG_RANDOMIZE_BASE now enabled in defconfig
- Allow the selection of ARM64_MODULE_PLTS, currently only done via
RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
over into the vmalloc area
- Make ZONE_DMA32 configurable
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
perf: arm_spe: Enable ACPI/Platform automatic module loading
arm_pmu: acpi: spe: Add initial MADT/SPE probing
ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
ACPI/PPTT: Modify node flag detection to find last IDENTICAL
x86/entry: Simplify _TIF_SYSCALL_EMU handling
arm64: rename dump_instr as dump_kernel_instr
arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
arm64: Implement panic_smp_self_stop()
arm64: Improve parking of stopped CPUs
arm64: Expose FRINT capabilities to userspace
arm64: Expose ARMv8.5 CondM capability to userspace
arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
arm64: ARM64_MODULES_PLTS must depend on MODULES
arm64: bpf: do not allocate executable memory
arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
arm64: module: create module allocations without exec permissions
arm64: Allow user selection of ARM64_MODULE_PLTS
acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
arm64: Allow selecting Pseudo-NMI again
...
* pm-sleep:
PM: sleep: Drop dev_pm_skip_next_resume_phases()
ACPI: PM: Drop unused function and function header
ACPI: PM: Introduce "poweroff" callbacks for ACPI PM domain and LPSS
ACPI: PM: Simplify and fix PM domain hibernation callbacks
PCI: PM: Simplify bus-level hibernation callbacks
PM: ACPI/PCI: Resume all devices during hibernation
kernel: power: swap: use kzalloc() instead of kmalloc() followed by memset()
PM: sleep: Update struct wakeup_source documentation
drivers: base: power: remove wakeup_sources_stats_dentry variable
PM: suspend: Rename pm_suspend_via_s2idle()
PM: sleep: Show how long dpm_suspend_start() and dpm_suspend_end() take
PM: hibernate: powerpc: Expose pfn_is_nosave() prototype
To increase readability/maintainability, replace hard coded
instructions values by symbolic names.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Fix R_PPC64_ENTRY case, the addi reads from r2 not r12]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To increase readability/maintainability, replace hard coded
instructions values by symbolic names.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC_HA() PPC_HI() and PPC_LO() macros are nice macros. Move them
from module64.c to ppc-opcode.h in order to use them in other places.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Clean up formatting in new code, drop duplicates in ftrace.c]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The comment here is wrong, the addi reads from r2 not r12. The code is
correct, 0x38420000 = addi r2,r2,0.
Fixes: a61674bdfc ("powerpc/module: Handle R_PPC64_ENTRY relocations")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch modifies the generation of uImage by handing over
the selected compression type instead of forcing gzip
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some SCC functions like the QMC requires an extended parameter RAM.
On modern 8xx (ie 866 and 885), SPI area can already be relocated,
allowing the use of those functions on SCC2. But SCC3 and SCC4
parameter RAM collide with SMC1 and SMC2 parameter RAMs.
This patch adds microcode to allow the relocation of both SMC1 and
SMC2, and relocate them at offsets 0x1ec0 and 0x1fc0.
Those offsets are by default for the CPM1 DSP1 and DSP2, but there
is no kernel driver using them at the moment so this area can be
reused.
This microcode is provided by Freescale/NXP in Engineering Bulletin
EB662 ("MPC8xx I2C/SPI and SMC Relocation Microcode Packages")
dated 2006. The binary code is public. The source is not available.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Change microcode functions to use IO accessors and get rid
of volatile attributes.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The CPM registers RCCR and CPMCR1..4 registers has to be set in
accordance with the microcode patch beeing programmed. Lets
define them as part of the patch set and refactor their
programming from that definition.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Define patch name together with the patch code, and refactor
the associated printk() while replacing it by a pr_info()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add empty microcode tables so that all tables are defined
all the time. Regroup the writing of the 3 tables regardless
of the selected microcode.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Create a function to refactor the writing of CPM microcode arrays.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Compact obscure microcode arrays by putting 4 values per line
in order to reduce number of lines in the file to increase
readability.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
verify_patch() has been opted out since many years, and
the comment suggests it doesn't work. So drop it.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Only 8xx selects CPM1 and related CONFIG options are already
in platforms/8xx/Kconfig
Move the related C files to platforms/8xx/.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Minor formatting fixes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch defines C helpers to retrieve the size of
cache blocks and uses them in the cacheflush functions.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On most arches having function flush_dcache_range(), including PPC32,
this function does a writeback and invalidation of the cache bloc.
On PPC64, flush_dcache_range() only does a writeback while
flush_inval_dcache_range() does the invalidation in addition.
In addition it looks like within arch/powerpc/, there are no PPC64
platforms using flush_dcache_range()
This patch drops the existing 64 bits version of flush_dcache_range()
and renames flush_inval_dcache_range() into flush_dcache_range().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Cache instructions (dcbz, dcbi, dcbf and dcbst) take two registers
that are summed to obtain the target address. Using 'Z' constraint
and '%y0' argument gives GCC the opportunity to use both registers
instead of only one with the second being forced to 0.
Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This makes sure we don't enable HugeTLB if the cache is not configured.
I am still not sure about this. IMHO hugetlb support should be a hardware
support derivative and any cache allocation failure should be handled as I did
in the earlier patch. But then if we were not able to create hugetlb page table
cache, we can as well declare hugetlb support disabled thereby avoiding calling
into allocation routines.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We only check for hugetlb allocations, because with hugetlb we do conditional
registration. For PGD/PUD/PMD levels we register them always in
pgtable_cache_init.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This fixes kernel crash that arises due to not handling page table allocation
failures while allocating hugetlb page table.
Fixes: e2b3d202d1 ("powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we have switched the page table walk to use pmd_is_leaf we can now
revert commit 8adddf349f ("powerpc/mm/radix: Make Radix require HUGETLB_PAGE")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
large devmap usage is dependent on THP. Hence once check is sufficient.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Even when we have HugeTLB and THP disabled, kernel linear map can still be
mapped with hugepages. This is only an issue with radix translation because hash
MMU doesn't map kernel linear range in linux page table and other kernel
map areas are not mapped using hugepage.
Add config independent helpers and put WARN_ON() when we don't expect things
to be mapped via hugepages.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We used uuid_parse to convert uuid string from device tree to two u64
components. We want to make sure we look at the uuid read from device
tree in an endian-neutral fashion. For now, I am picking little-endian
to be format so that we don't end up doing an additional conversion.
The reason to store in a specific endian format is to enable reading
the namespace created with a little-endian kernel config on a
big-endian kernel. We do store the device tree uuid string as a 64-bit
little-endian cookie in the label area. When booting the kernel we
also compare this cookie against what is read from the device tree.
For this, to work we have to store and compare these values in a CPU
endian config independent fashion.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
SCM_READ/WRITE_MEATADATA hcall supports multibyte read/write. This patch
updates the metadata read/write to use 1, 2, 4 or 8 byte read/write as
mentioned in PAPR document.
READ/WRITE_METADATA hcall supports the 1, 2, 4, or 8 bytes read/write.
For other values hcall results H_P3.
Hypervisor stores the metadata contents in big-endian format and in-order
to enable read/write in different granularity, we need to switch the contents
to big-endian before calling HCALL.
Based on an patch from Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The device tree node is documented as below:
“ibm,cache-flush-required”:
property name indicates Cache Flush Required for this Persistent Memory Segment to persist memory
prop-encoded-array: None, this is a name only property.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Allocation from altmap area can fail based on vmemmap page size used.
Add kernel info message to indicate the failure. That allows the user
to identify whether they are really using persistent memory reserved
space for per-page metadata.
The message looks like:
[ 136.587212] altmap block allocation failed, falling back to system memory
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If we fail to parse min_common_depth from device tree we boot with
numa disabled. Reflect the same by updating numa_enabled variable
to false. Also, switch all min_common_depth failure check to
if (!numa_enabled) check.
This helps us to avoid checking for both in different code paths.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If we fail to parse the associativity array we should default to
NUMA_NO_NODE instead of NODE 0. Rest of the code fallback to the
right default if we find the numa node value NUMA_NO_NODE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We use mmu_vmemmap_psize to find the page size for mapping the vmmemap area.
With radix translation, we are suboptimally setting this value to PAGE_SIZE.
We do check for 2M page size support and update mmu_vmemap_psize to use
hugepage size but we suboptimally reset the value to PAGE_SIZE in
radix__early_init_mmu(). This resulted in always mapping vmemmap area with
64K page size.
Fixes: 2bfd65e45e ("powerpc/mm/radix: Add radix callbacks for early init routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With hash translation and 4K PAGE_SIZE config, we need to make sure we don't
use 64K page size for vmemmap.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since commit 0034d395f8 ("powerpc/mm/hash64: Map all the kernel
regions in the same 0xc range") __kernel_virt_size is not used
anymore.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When enabling or disabling the vcpu dispatch statistics, we do a lot of
work including allocating/deallocating memory across all possible cpus
for the DTL buffer. In order to guard against hogging the cpu for too
long, track the time we're taking and yield the processor if necessary.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For Shared Processor LPARs, the POWER Hypervisor maintains a
relatively static mapping of the LPAR processors (vcpus) to physical
processor chips (representing the "home" node) and tries to always
dispatch vcpus on their associated physical processor chip. However,
under certain scenarios, vcpus may be dispatched on a different
processor chip (away from its home node). The actual physical
processor number on which a certain vcpu is dispatched is available to
the guest in the 'processor_id' field of each DTL entry.
The guest can discover the home node of each vcpu through the
H_HOME_NODE_ASSOCIATIVITY(flags=1) hcall. The guest can also discover
the associativity of physical processors, as represented in the DTL
entry, through the H_HOME_NODE_ASSOCIATIVITY(flags=2) hcall.
These can then be compared to determine if the vcpu was dispatched on
its home node or not. If the vcpu was not dispatched on the home node,
it is possible to determine if the vcpu was dispatched in a different
chip, socket or drawer.
Introduce a procfs file /proc/powerpc/vcpudispatch_stats that can be
used to obtain these statistics. Writing '1' to this file enables
collecting the statistics, while writing '0' disables the statistics.
The statistics themselves are available by reading the procfs file. By
default, the DTLB log for each vcpu is processed 50 times a second so
as not to miss any entries. This processing frequency can be changed
through /proc/powerpc/vcpudispatch_stats_freq.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
hcall_vphn() is specific to pseries and will be used in a subsequent
patch. So, move it to a more appropriate place under
arch/powerpc/platforms/pseries. Also merge vphn.h into lppaca.h
and update vphn selftest to use the new files.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
H_HOME_NODE_ASSOCIATIVITY hcall can take two different flags and return
different associativity information in each case. Generalize the
existing hcall_vphn() function to take flags as an argument and to
return the result. Update the only existing user to pass the proper
arguments.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since we would be introducing a new user of the DTL buffer in a
subsequent patch, we need a way to gatekeep use of the DTL buffer.
The current debugfs interface for DTL allows registering and opening
cpu-specific DTL buffers. Cpu specific files are exposed under
debugfs 'powerpc/dtl/' node, and changing 'dtl_event_mask' in the same
directory enables controlling the event mask used when registering DTL
buffer for a particular cpu.
Subsequently, we will be introducing a user of the DTL buffers that
registers access to the DTL buffers across all cpus with the same event
mask. To ensure these two users do not step on each other, we introduce
a rwlock to gatekeep DTL buffer access. This fits the requirement of the
current debugfs interface wanting to allow multiple independent
cpu-specific users (read lock), and the subsequent user wanting
exclusive access (write lock).
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Introduce new helpers for DTL buffer allocation and registration and
have the existing code use those.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Don't split error messages across lines, for grepability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is enabled, we always initialize
DTL enable mask to DTL_LOG_PREEMPT (0x2). There are no other places
where the mask is changed. As such, when reading the DTL log buffer
through debugfs, there is no need to save and restore the previous mask
value.
We don't need to save and restore the earlier mask value if
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not enabled. So, remove the field
from the structure as well.
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>