The instruction latency information can be recorded on
some platforms, e.g., the Intel Sapphire Rapids server. With both memory
latency (weight) and the new instruction latency information, users can
easily locate the expensive load instructions, and also understand the time
spent in different stages. The users can optimize their applications in
different pipeline stages.
Add a new field "ins_lat" to filter the instruction latency information,
which is available with sample type PERF_SAMPLE_WEIGHT_STRUCT.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joe Mario <jmario@redhat.com>
Link: https://lore.kernel.org/r/1632929894-102778-2-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add a new dlfilter to show cycles.
Cycle counts are accumulated per CPU (or per thread if CPU is not recorded)
from IPC information, and printed together with the change since the last
print, at the start of each line. Separate counts are kept for branches,
instructions or other events.
Note also, the itrace A option can be useful to provide higher granularity
cycle information.
Example:
$ perf record -e intel_pt/cyc/u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.044 MB perf.data ]
$ perf script --itrace=A --call-trace --dlfilter dlfilter-show-cycles.so --deltatime | head
0 perf-exec 8509 [001] 0.000000000: psb offs: 0
0 perf-exec 8509 [001] 0.000000000: cbr: 42 freq: 4219 MHz (156%)
833 833 uname 8509 [001] 0.000047689: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _start
833 uname 8509 [001] 0.000003261: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_start
2015 1182 uname 8509 [001] 0.000000282: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_start
2676 661 uname 8509 [001] 0.000002629: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_start
3612 936 uname 8509 [001] 0.000001232: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_start
4579 967 uname 8509 [001] 0.000002519: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_start
6145 1566 uname 8509 [001] 0.000001050: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_setup_hash
6239 94 uname 8509 [001] 0.000000023: (/usr/lib/x86_64-linux-gnu/ld-2.31.so ) _dl_sysdep_start
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20211027080334.365596-5-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Normally, for cycle-acccurate mode, IPC values are an exact number of
instructions and cycles. Due to the granularity of timestamps, that happens
only when a CYC packet correlates to the event.
Support the itrace 'A' option, to use instead, the number of cycles
associated with the current timestamp. This provides IPC information for
every change of timestamp, but at the expense of accuracy. Due to the
granularity of timestamps, the actual number of cycles increases even
though the cycles reported does not. The number of instructions is known,
but if IPC is reported, cycles can be too low and so IPC is too high. Note
that inaccuracy decreases as the period of sampling increases i.e. if the
number of cycles is too low by a small amount, that becomes less
significant if the number of cycles is large.
Furthermore, it can be used in conjunction with dlfilter-show-cycles.so
to provide higher granularity cycle information.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20211027080334.365596-4-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Like all locally-built programs, dlfilters may need to be re-built if
shared libraries they use change. Also there may be unexpected results
if the dfilter uses different versions of the shared libraries that perf
uses.
Note those things in the documentation.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https //lore.kernel.org/r/20210811101036.17986-5-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf_dlfilter_fns must not be const, because it is not.
Declaring it const can result in it being mapped read-only, causing a
segfaullt when it is written. Update documentation accordingly.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 8defa7147d5572 ("perf script Add API for filtering via dynamically loaded shared object")
Link: https //lore.kernel.org/r/20210811101036.17986-2-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The Intel PT decoder limits the number of unconditional branches (e.g.
jmps) decoded without consuming any trace packets. Generally, a loop
needs a conditional branch which generates a TNT packet, whereas a "ret"
instruction will generate a TIP or TNT packet. So exceeding the limit is
assumed to be a never-ending loop, which can happen if there has been a
decoding error putting the decoder at the wrong place in the code.
Up until now, the limit of 10000 has been enough but some analytic
purposes have been reported to exceed that.
Increase the limit to 100000, and make it configurable via perf config
intel-pt.max-loops. Also amend the "Never-ending loop" message to
mention the configuration entry.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lore.kernel.org/lkml/20210701175132.3977-1-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Perf has supported the CPU_PMU_CAPS feature to display a list of CPU PMU
capabilities. But on a hybrid platform, it may have several CPU PMUs (such
as "cpu_core" and "cpu_atom"). The CPU_PMU_CAPS feature is hard to extend
to support multiple CPU PMUs well if it needs to be compatible for the case
of old perf data file + new perf tool.
So for better compatibility we now create a new feature HYBRID_CPU_PMU_CAPS
in the header.
For the perf.data generated on hybrid platform,
root@otcpl-adl-s-2:~# perf report --header-only -I
# cpu_core pmu capabilities: branches=32, max_precise=3, pmu_name=alderlake_hybrid
# cpu_atom pmu capabilities: branches=32, max_precise=3, pmu_name=alderlake_hybrid
# missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID DIR_FORMAT COMPRESSED CPU_PMU_CAPS CLOCK_DATA
For the perf.data generated on non-hybrid platform
root@kbl-ppc:~# perf report --header-only -I
# cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
# missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID DIR_FORMAT COMPRESSED CLOCK_DATA HYBRID_TOPOLOGY HYBRID_CPU_PMU_CAPS
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210514122948.9472-3-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
It is useful to let the user know about the hybrid topology.
Add the HYBRID_TOPOLOGY feature in header to indicate the core CPUs
and the atom CPUs.
With this patch a perf.data generated on a hybrid platform reports
the hybrid CPU list:
root@otcpl-adl-s-2:~# perf report --header-only -I
...
# hybrid cpu system:
# cpu_core cpu list : 0-15
# cpu_atom cpu list : 16-23
For a perf.data generated on a non-hybrid platform, reports a message
that HYBRID_TOPOLOGY is missing:
root@kbl-ppc:~# perf report --header-only -I
...
# missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID DIR_FORMAT COMPRESSED CLOCK_DATA HYBRID_TOPOLOGY
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20210514122948.9472-2-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Intel PT timestamps are affected by virtualization. Add a new option
that will allow the Intel PT decoder to correlate the timestamps and
translate the virtual machine timestamps to host timestamps.
The advantages of making this a separate step, rather than a part of
normal decoding are that it is simpler to implement, and it needs to
be done only once.
This patch adds only the option. Later patches add Intel PT support.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/r/20210430070309.17624-6-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>