linux/tools/perf/Documentation/perf-stat.txt

646 lines
23 KiB
Plaintext
Raw Normal View History

perf-stat(1)
============
NAME
----
perf-stat - Run a command and gather performance counter statistics
SYNOPSIS
--------
[verse]
'perf stat' [-e <EVENT> | --event=EVENT] [-a] <command>
'perf stat' [-e <EVENT> | --event=EVENT] [-a] \-- <command> [<options>]
'perf stat' [-e <EVENT> | --event=EVENT] [-a] record [-o file] \-- <command> [<options>]
'perf stat' report [-i file]
DESCRIPTION
-----------
This command runs a command and gathers performance counter statistics
from it.
OPTIONS
-------
<command>...::
Any command you can specify in a shell.
perf stat record: Add record command Add 'perf stat record' command support. It creates simple (header only) perf.data file ATM. The record command could be specified anywhere among stat options. All stat command options are valid for stat record command with '-o' option exception. If specified for record command it denotes the perf data file name. Committer note: Set sample_type to PERF_SAMPLE_IDENTIFIER, which should be harmless while avoiding that older tools show confusing messages, for instance, with sample_type = 0, we get: $ perf stat record usleep 1 Performance counter stats for 'usleep 1': 0.630237 task-clock (msec) # 0.528 CPUs utilized 1 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 52 page-faults # 0.083 M/sec 978,312 cycles # 1.552 GHz 671,931 stalled-cycles-frontend # 68.68% frontend cycles idle <not supported> stalled-cycles-backend 646,379 instructions # 0.66 insns per cycle # 1.04 stalled cycles per insn 131,046 branches # 207.931 M/sec 7,073 branch-misses # 5.40% of all branches 0.001193240 seconds time elapsed $ oldperf evlist WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? non matching sample_type $ While with sample_type set to PERF_SAMPLE_IDENTIFIER, after we re-run 'perf stat record usleep' we get: $ oldperf evlist WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? task-clock context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses $ Which at least shows the names of the events in the perf.data file. Additionally, such files, when passed to 'perf report' will produce: $ oldperf report --stdio WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? Warning: Kernel address maps (/proc/{kallsyms,modules}) were restricted. Check /proc/sys/kernel/kptr_restrict before running 'perf record'. As no suitable kallsyms nor vmlinux was found, kernel samples can't be resolved. Samples in kernel modules can't be resolved as well. Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. # $ Which is confusing and can be solved by just adding the kernel mmap record, which will also remove that warning about the data size field being equal to zero, after generating the mmap record: $ perf stat record usleep 1 Performance counter stats for 'usleep 1': 0.600796 task-clock (msec) # 0.478 CPUs utilized 1 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 54 page-faults # 0.090 M/sec 886,844 cycles # 1.476 GHz 582,169 stalled-cycles-frontend # 65.65% frontend cycles idle <not supported> stalled-cycles-backend 638,344 instructions # 0.72 insns per cycle # 0.91 stalled cycles per insn 130,204 branches # 216.719 M/sec 7,500 branch-misses # 5.76% of all branches 0.001255897 seconds time elapsed $ oldperf evlist task-clock context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses $ oldperf report --stdio Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. # [acme@zoo linux]$ No warnings, sensible output about what are the events in the perf.data file and also a "file has no samples" message, which indeed it doesn't. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Kan Liang <kan.liang@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: htp://lkml.kernel.org/r/1446734469-11352-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-11-05 14:40:46 +00:00
record::
See STAT RECORD.
report::
See STAT REPORT.
-e::
--event=::
Select the PMU event. Selection can be:
- a symbolic event name (use 'perf list' to list all events)
perf docs: Add info on AMD raw event encoding AMD processors have events with event select codes and unit masks larger than a byte. The core PMU, for example, uses 12-bit event select codes split between bits 0-7 and 32-35 of the PERF_CTL MSRs as can be seen from /sys/bus/event_sources/devices/cpu/format/*. The Processor Programming Reference (PPR) lists the event codes as unified 12-bit hexadecimal values instead and the split between the bits is not apparent to someone who is not aware of the layout of the PERF_CTL MSRs. 8-bit event select codes continue to work as the layout matches that of the PERF_CTL MSRs i.e. bits 0-7 for event select and 8-15 for unit mask. This adds more details in the perf man pages about using /sys/bus/event_sources/devices/*/format/* for determining the correct raw event encoding scheme. E.g. the "op_cache_hit_miss.op_cache_hit" event with code 0x28f and umask 0x03 can be programmed using its symbolic name as: $ sudo perf --debug perf-event-open stat -e op_cache_hit_miss.op_cache_hit sleep 1 ------------------------------------------------------------ perf_event_attr: type 4 size 128 config 0x20000038f sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 enable_on_exec 1 exclude_guest 1 ------------------------------------------------------------ [...] One might use a simple eventsel+umask combination based on what the current man pages say and incorrectly program the event as: $ sudo perf --debug perf-event-open stat -e r0328f sleep 1 ------------------------------------------------------------ perf_event_attr: type 4 size 128 config 0x328f sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 enable_on_exec 1 exclude_guest 1 ------------------------------------------------------------ [...] When it should have been based on the format from sysfs: $ cat /sys/bus/event_source/devices/cpu/format/event config:0-7,32-35 $ sudo perf --debug perf-event-open stat -e r20000038f sleep 1 ------------------------------------------------------------ perf_event_attr: type 4 size 128 config 0x20000038f sample_type IDENTIFIER read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING disabled 1 inherit 1 enable_on_exec 1 exclude_guest 1 ------------------------------------------------------------ [...] Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Ananth Narayan <ananth.narayan@amd.com> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Robert Richter <rrichter@amd.com> Cc: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20211123084613.243792-1-sandipan.das@amd.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-11-23 08:46:12 +00:00
- a raw PMU event in the form of rN where N is a hexadecimal value
that represents the raw register encoding with the layout of the
event control registers as described by entries in
perf docs: Correct typo of event_sources The sysfs directory is called event_source. Before: $ ls -la /sys/bus/event_sources/devices/cpu/format/ ls: cannot access '/sys/bus/event_sources/devices/cpu/format/': No such file or directory $ After: $ ls -la /sys/bus/event_source/devices/cpu/format/ total 0 drwxr-xr-x. 2 root root 0 Jun 2 15:36 . drwxr-xr-x. 6 root root 0 Jun 2 15:35 .. -r--r--r--. 1 root root 4096 Jun 2 15:36 any -r--r--r--. 1 root root 4096 Jun 2 15:36 cmask -r--r--r--. 1 root root 4096 Jun 2 15:36 edge -r--r--r--. 1 root root 4096 Jun 2 15:36 event -r--r--r--. 1 root root 4096 Jun 2 15:36 frontend -r--r--r--. 1 root root 4096 Jun 2 15:36 inv -r--r--r--. 1 root root 4096 Jun 2 15:36 ldlat -r--r--r--. 1 root root 4096 Jun 2 15:36 offcore_rsp -r--r--r--. 1 root root 4096 Jun 2 15:36 pc -r--r--r--. 1 root root 4096 Jun 2 15:36 umask $ Reviewed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Bayduraev <alexey.v.bayduraev@linux.intel.com> Cc: Alyssa Ross <hi@alyssa.is> Cc: German Gomez <german.gomez@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Joshua Martinez <joshuamart@google.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Like Xu <likexu@tencent.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Zhengjun Xing <zhengjun.xing@linux.intel.com> Link: https://lore.kernel.org/r/20220603045744.2815559-1-irogers@google.com Reported-by: Kevin Nomura <nomurak@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-06-03 04:57:44 +00:00
/sys/bus/event_source/devices/cpu/format/*.
- a symbolic or raw PMU event followed by an optional colon
and a list of event modifiers, e.g., cpu-cycles:p. See the
linkperf:perf-list[1] man page for details on event modifiers.
- a symbolically formed event like 'pmu/param1=0x3,param2/' where
param1 and param2 are defined as formats for the PMU in
/sys/bus/event_source/devices/<pmu>/format/*
perf stat: Support 'percore' event qualifier With this patch, we can use the 'percore' event qualifier in perf-stat. root@skl:/tmp# perf stat -e cpu/event=0,umask=0x3,percore=1/,cpu/event=0,umask=0x3/ -a -A -I1000 1.000773050 S0-C0 98,352,832 cpu/event=0,umask=0x3,percore=1/ (50.01%) 1.000773050 S0-C1 103,763,057 cpu/event=0,umask=0x3,percore=1/ (50.02%) 1.000773050 S0-C2 196,776,995 cpu/event=0,umask=0x3,percore=1/ (50.02%) 1.000773050 S0-C3 176,493,779 cpu/event=0,umask=0x3,percore=1/ (50.02%) 1.000773050 CPU0 47,699,641 cpu/event=0,umask=0x3/ (50.02%) 1.000773050 CPU1 49,052,451 cpu/event=0,umask=0x3/ (49.98%) 1.000773050 CPU2 102,771,422 cpu/event=0,umask=0x3/ (49.98%) 1.000773050 CPU3 100,784,662 cpu/event=0,umask=0x3/ (49.98%) 1.000773050 CPU4 43,171,342 cpu/event=0,umask=0x3/ (49.98%) 1.000773050 CPU5 54,152,158 cpu/event=0,umask=0x3/ (49.98%) 1.000773050 CPU6 93,618,410 cpu/event=0,umask=0x3/ (49.98%) 1.000773050 CPU7 74,477,589 cpu/event=0,umask=0x3/ (49.99%) In this example, we count the event 'ref-cycles' per-core and per-CPU in one perf stat command-line. From the output, we can see: S0-C0 = CPU0 + CPU4 S0-C1 = CPU1 + CPU5 S0-C2 = CPU2 + CPU6 S0-C3 = CPU3 + CPU7 So the result is expected (tiny difference is ignored). Note that, the 'percore' event qualifier needs to use with option '-A'. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1555077590-27664-4-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-04-12 13:59:49 +00:00
'percore' is a event qualifier that sums up the event counts for both
hardware threads in a core. For example:
perf stat -A -a -e cpu/event,percore=1/,otherevent ...
- a symbolically formed event like 'pmu/config=M,config1=N,config2=K/'
where M, N, K are numbers (in decimal, hex, octal format).
Acceptable values for each of 'config', 'config1' and 'config2'
parameters are defined by corresponding entries in
/sys/bus/event_source/devices/<pmu>/format/*
Note that the last two syntaxes support prefix and glob matching in
the PMU name to simplify creation of events across multiple instances
of the same type of PMU in large systems (e.g. memory controller PMUs).
Multiple PMU instances are typical for uncore PMUs, so the prefix
'uncore_' is also ignored when performing this match.
-i::
--no-inherit::
child tasks do not inherit counters
-p::
--pid=<pid>::
stat events on existing process id (comma separated list)
-t::
--tid=<tid>::
stat events on existing thread id (comma separated list)
perf stat: Enable counting events for BPF programs Introduce 'perf stat -b' option, which counts events for BPF programs, like: [root@localhost ~]# ~/perf stat -e ref-cycles,cycles -b 254 -I 1000 1.487903822 115,200 ref-cycles 1.487903822 86,012 cycles 2.489147029 80,560 ref-cycles 2.489147029 73,784 cycles 3.490341825 60,720 ref-cycles 3.490341825 37,797 cycles 4.491540887 37,120 ref-cycles 4.491540887 31,963 cycles The example above counts 'cycles' and 'ref-cycles' of BPF program of id 254. This is similar to bpftool-prog-profile command, but more flexible. 'perf stat -b' creates per-cpu perf_event and loads fentry/fexit BPF programs (monitor-progs) to the target BPF program (target-prog). The monitor-progs read perf_event before and after the target-prog, and aggregate the difference in a BPF map. Then the user space reads data from these maps. A new 'struct bpf_counter' is introduced to provide a common interface that uses BPF programs/maps to count perf events. Committer notes: Removed all but bpf_counter.h includes from evsel.h, not needed at all. Also BPF map lookups for PERCPU_ARRAYs need to have as its value receive buffer passed to the kernel libbpf_num_possible_cpus() entries, not evsel__nr_cpus(evsel), as the former uses /sys/devices/system/cpu/possible while the later uses /sys/devices/system/cpu/online, which may be less than the 'possible' number making the bpf map lookup overwrite memory and cause hard to debug memory corruption. We need to continue using evsel__nr_cpus(evsel) when accessing the perf_counts array tho, not to overwrite another are of memory :-) Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lore.kernel.org/lkml/20210120163031.GU12699@kernel.org/ Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20201229214214.3413833-4-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-12-29 21:42:14 +00:00
-b::
--bpf-prog::
stat events on existing bpf program id (comma separated list),
requiring root rights. bpftool-prog could be used to find program
id all bpf programs in the system. For example:
# bpftool prog | head -n 1
17247: tracepoint name sys_enter tag 192d548b9d754067 gpl
# perf stat -e cycles,instructions --bpf-prog 17247 --timeout 1000
Performance counter stats for 'BPF program(s) 17247':
85,967 cycles
28,982 instructions # 0.34 insn per cycle
1.102235068 seconds time elapsed
perf stat: Introduce 'bperf' to share hardware PMCs with BPF The perf tool uses performance monitoring counters (PMCs) to monitor system performance. The PMCs are limited hardware resources. For example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. Modern data center systems use these PMCs in many different ways: system level monitoring, (maybe nested) container level monitoring, per process monitoring, profiling (in sample mode), etc. In some cases, there are more active perf_events than available hardware PMCs. To allow all perf_events to have a chance to run, it is necessary to do expensive time multiplexing of events. On the other hand, many monitoring tools count the common metrics (cycles, instructions). It is a waste to have multiple tools create multiple perf_events of "cycles" and occupy multiple PMCs. bperf tries to reduce such wastes by allowing multiple perf_events of "cycles" or "instructions" (at different scopes) to share PMUs. Instead of having each perf-stat session to read its own perf_events, bperf uses BPF programs to read the perf_events and aggregate readings to BPF maps. Then, the perf-stat session(s) reads the values from these BPF maps. Please refer to the comment before the definition of bperf_ops for the description of bperf architecture. bperf is off by default. To enable it, pass --bpf-counters option to perf-stat. bperf uses a BPF hashmap to share information about BPF programs and maps used by bperf. This map is pinned to bpffs. The default path is /sys/fs/bpf/perf_attr_map. The user could change the path with option --bpf-attr-map. Committer testing: # dmesg|grep "Performance Events" -A5 [ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver. [ 0.225280] ... version: 0 [ 0.225280] ... bit width: 48 [ 0.225281] ... generic registers: 6 [ 0.225281] ... value mask: 0000ffffffffffff [ 0.225281] ... max period: 00007fffffffffff # # for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done [1] 2436231 [2] 2436232 [3] 2436233 [4] 2436234 [5] 2436235 [6] 2436236 # perf stat -a -e cycles,instructions sleep 0.1 Performance counter stats for 'system wide': 310,326,987 cycles (41.87%) 236,143,290 instructions # 0.76 insn per cycle (41.87%) 0.100800885 seconds time elapsed # We can see that the counters were enabled for this workload 41.87% of the time. Now with --bpf-counters: # for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done [1] 2436514 [2] 2436515 [3] 2436516 [4] 2436517 [5] 2436518 [6] 2436519 [7] 2436520 [8] 2436521 [9] 2436522 [10] 2436523 [11] 2436524 [12] 2436525 [13] 2436526 [14] 2436527 [15] 2436528 [16] 2436529 [17] 2436530 [18] 2436531 [19] 2436532 [20] 2436533 [21] 2436534 [22] 2436535 [23] 2436536 [24] 2436537 [25] 2436538 [26] 2436539 [27] 2436540 [28] 2436541 [29] 2436542 [30] 2436543 [31] 2436544 [32] 2436545 # # ls -la /sys/fs/bpf/perf_attr_map -rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map # bpftool map | grep bperf | wc -l 64 # # bpftool map | tail 1265: percpu_array name accum_readings flags 0x0 key 4B value 24B max_entries 1 memlock 4096B 1266: hash name filter flags 0x0 key 4B value 4B max_entries 1 memlock 4096B 1267: array name bperf_fo.bss flags 0x400 key 4B value 8B max_entries 1 memlock 4096B btf_id 996 pids perf(2436545) 1268: percpu_array name accum_readings flags 0x0 key 4B value 24B max_entries 1 memlock 4096B 1269: hash name filter flags 0x0 key 4B value 4B max_entries 1 memlock 4096B 1270: array name bperf_fo.bss flags 0x400 key 4B value 8B max_entries 1 memlock 4096B btf_id 997 pids perf(2436541) 1285: array name pid_iter.rodata flags 0x480 key 4B value 4B max_entries 1 memlock 4096B btf_id 1017 frozen pids bpftool(2437504) 1286: array flags 0x0 key 4B value 32B max_entries 1 memlock 4096B # # bpftool map dump id 1268 | tail value (CPU 21): 8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00 80 fd 2a d1 4d 00 00 00 value (CPU 22): 7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00 a4 8a 2e ee 4d 00 00 00 value (CPU 23): a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00 b2 34 94 f6 4d 00 00 00 Found 1 element # bpftool map dump id 1268 | tail value (CPU 21): c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00 20 c6 fc 83 4e 00 00 00 value (CPU 22): 9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00 3e 0c df 89 4e 00 00 00 value (CPU 23): 18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00 5b 69 ed 83 4e 00 00 00 Found 1 element # bpftool map dump id 1268 | tail value (CPU 21): f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00 92 67 4c ba 4e 00 00 00 value (CPU 22): dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00 d9 32 7a c5 4e 00 00 00 value (CPU 23): bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00 7c 73 87 bf 4e 00 00 00 Found 1 element # # perf stat --bpf-counters -a -e cycles,instructions sleep 0.1 Performance counter stats for 'system wide': 119,410,122 cycles 152,105,479 instructions # 1.27 insn per cycle 0.101395093 seconds time elapsed # See? We had the counters enabled all the time. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-03-16 21:18:35 +00:00
--bpf-counters::
Use BPF programs to aggregate readings from perf_events. This
allows multiple perf-stat sessions that are counting the same metric (cycles,
instructions, etc.) to share hardware counters.
To use BPF programs on common events by default, use
"perf config stat.bpf-counter-events=<list_of_events>".
perf stat: Introduce 'bperf' to share hardware PMCs with BPF The perf tool uses performance monitoring counters (PMCs) to monitor system performance. The PMCs are limited hardware resources. For example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. Modern data center systems use these PMCs in many different ways: system level monitoring, (maybe nested) container level monitoring, per process monitoring, profiling (in sample mode), etc. In some cases, there are more active perf_events than available hardware PMCs. To allow all perf_events to have a chance to run, it is necessary to do expensive time multiplexing of events. On the other hand, many monitoring tools count the common metrics (cycles, instructions). It is a waste to have multiple tools create multiple perf_events of "cycles" and occupy multiple PMCs. bperf tries to reduce such wastes by allowing multiple perf_events of "cycles" or "instructions" (at different scopes) to share PMUs. Instead of having each perf-stat session to read its own perf_events, bperf uses BPF programs to read the perf_events and aggregate readings to BPF maps. Then, the perf-stat session(s) reads the values from these BPF maps. Please refer to the comment before the definition of bperf_ops for the description of bperf architecture. bperf is off by default. To enable it, pass --bpf-counters option to perf-stat. bperf uses a BPF hashmap to share information about BPF programs and maps used by bperf. This map is pinned to bpffs. The default path is /sys/fs/bpf/perf_attr_map. The user could change the path with option --bpf-attr-map. Committer testing: # dmesg|grep "Performance Events" -A5 [ 0.225277] Performance Events: Fam17h+ core perfctr, AMD PMU driver. [ 0.225280] ... version: 0 [ 0.225280] ... bit width: 48 [ 0.225281] ... generic registers: 6 [ 0.225281] ... value mask: 0000ffffffffffff [ 0.225281] ... max period: 00007fffffffffff # # for a in $(seq 6) ; do perf stat -a -e cycles,instructions sleep 100000 & done [1] 2436231 [2] 2436232 [3] 2436233 [4] 2436234 [5] 2436235 [6] 2436236 # perf stat -a -e cycles,instructions sleep 0.1 Performance counter stats for 'system wide': 310,326,987 cycles (41.87%) 236,143,290 instructions # 0.76 insn per cycle (41.87%) 0.100800885 seconds time elapsed # We can see that the counters were enabled for this workload 41.87% of the time. Now with --bpf-counters: # for a in $(seq 32) ; do perf stat --bpf-counters -a -e cycles,instructions sleep 100000 & done [1] 2436514 [2] 2436515 [3] 2436516 [4] 2436517 [5] 2436518 [6] 2436519 [7] 2436520 [8] 2436521 [9] 2436522 [10] 2436523 [11] 2436524 [12] 2436525 [13] 2436526 [14] 2436527 [15] 2436528 [16] 2436529 [17] 2436530 [18] 2436531 [19] 2436532 [20] 2436533 [21] 2436534 [22] 2436535 [23] 2436536 [24] 2436537 [25] 2436538 [26] 2436539 [27] 2436540 [28] 2436541 [29] 2436542 [30] 2436543 [31] 2436544 [32] 2436545 # # ls -la /sys/fs/bpf/perf_attr_map -rw-------. 1 root root 0 Mar 23 14:53 /sys/fs/bpf/perf_attr_map # bpftool map | grep bperf | wc -l 64 # # bpftool map | tail 1265: percpu_array name accum_readings flags 0x0 key 4B value 24B max_entries 1 memlock 4096B 1266: hash name filter flags 0x0 key 4B value 4B max_entries 1 memlock 4096B 1267: array name bperf_fo.bss flags 0x400 key 4B value 8B max_entries 1 memlock 4096B btf_id 996 pids perf(2436545) 1268: percpu_array name accum_readings flags 0x0 key 4B value 24B max_entries 1 memlock 4096B 1269: hash name filter flags 0x0 key 4B value 4B max_entries 1 memlock 4096B 1270: array name bperf_fo.bss flags 0x400 key 4B value 8B max_entries 1 memlock 4096B btf_id 997 pids perf(2436541) 1285: array name pid_iter.rodata flags 0x480 key 4B value 4B max_entries 1 memlock 4096B btf_id 1017 frozen pids bpftool(2437504) 1286: array flags 0x0 key 4B value 32B max_entries 1 memlock 4096B # # bpftool map dump id 1268 | tail value (CPU 21): 8f f3 bc ca 00 00 00 00 80 fd 2a d1 4d 00 00 00 80 fd 2a d1 4d 00 00 00 value (CPU 22): 7e d5 64 4d 00 00 00 00 a4 8a 2e ee 4d 00 00 00 a4 8a 2e ee 4d 00 00 00 value (CPU 23): a7 78 3e 06 01 00 00 00 b2 34 94 f6 4d 00 00 00 b2 34 94 f6 4d 00 00 00 Found 1 element # bpftool map dump id 1268 | tail value (CPU 21): c6 8b d9 ca 00 00 00 00 20 c6 fc 83 4e 00 00 00 20 c6 fc 83 4e 00 00 00 value (CPU 22): 9c b4 d2 4d 00 00 00 00 3e 0c df 89 4e 00 00 00 3e 0c df 89 4e 00 00 00 value (CPU 23): 18 43 66 06 01 00 00 00 5b 69 ed 83 4e 00 00 00 5b 69 ed 83 4e 00 00 00 Found 1 element # bpftool map dump id 1268 | tail value (CPU 21): f2 6e db ca 00 00 00 00 92 67 4c ba 4e 00 00 00 92 67 4c ba 4e 00 00 00 value (CPU 22): dc 8e e1 4d 00 00 00 00 d9 32 7a c5 4e 00 00 00 d9 32 7a c5 4e 00 00 00 value (CPU 23): bd 2b 73 06 01 00 00 00 7c 73 87 bf 4e 00 00 00 7c 73 87 bf 4e 00 00 00 Found 1 element # # perf stat --bpf-counters -a -e cycles,instructions sleep 0.1 Performance counter stats for 'system wide': 119,410,122 cycles 152,105,479 instructions # 1.27 insn per cycle 0.101395093 seconds time elapsed # See? We had the counters enabled all the time. Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: kernel-team@fb.com Link: http://lore.kernel.org/lkml/20210316211837.910506-2-songliubraving@fb.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-03-16 21:18:35 +00:00
--bpf-attr-map::
With option "--bpf-counters", different perf-stat sessions share
information about shared BPF programs and maps via a pinned hashmap.
Use "--bpf-attr-map" to specify the path of this pinned hashmap.
The default path is /sys/fs/bpf/perf_attr_map.
perf tools: Add optional support for libpfm4 This patch links perf with the libpfm4 library if it is available and LIBPFM4 is passed to the build. The libpfm4 library contains hardware event tables for all processors supported by perf_events. It is a helper library that helps convert from a symbolic event name to the event encoding required by the underlying kernel interface. This library is open-source and available from: http://perfmon2.sf.net. With this patch, it is possible to specify full hardware events by name. Hardware filters are also supported. Events must be specified via the --pfm-events and not -e option. Both options are active at the same time and it is possible to mix and match: $ perf stat --pfm-events inst_retired:any_p:c=1:i -e cycles .... One needs to explicitely ask for its inclusion by using the LIBPFM4 make command line option, ie its opt-in rather than opt-out of feature detection and build support. Signed-off-by: Stephane Eranian <eranian@google.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Igor Lubashev <ilubashe@akamai.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jiwei Sun <jiwei.sun@windriver.com> Cc: John Garry <john.garry@huawei.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yonghong Song <yhs@fb.com> Cc: bpf@vger.kernel.org Cc: netdev@vger.kernel.org Cc: yuzhoujian <yuzhoujian@didichuxing.com> Link: http://lore.kernel.org/lkml/20200505182943.218248-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-05-05 18:29:43 +00:00
ifdef::HAVE_LIBPFM[]
--pfm-events events::
Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net)
including support for event filters. For example '--pfm-events
inst_retired:any_p:u:c=1:i'. More than one event can be passed to the
option using the comma separator. Hardware events and generic hardware
events cannot be mixed together. The latter must be used with the -e
option. The -e option and this one can be mixed and matched. Events
can be grouped using the {} notation.
endif::HAVE_LIBPFM[]
-a::
--all-cpus::
system-wide collection from all CPUs (default if no target is specified)
--no-scale::
Don't scale/normalize counter values
-d::
--detailed::
print more detailed statistics, can be specified up to 3 times
-d: detailed events, L1 and LLC data cache
-d -d: more detailed events, dTLB and iTLB events
-d -d -d: very detailed events, adding prefetch events
-r::
--repeat=<n>::
repeat command and print average + stddev (max: 100). 0 means forever.
perf stat: add perf stat -B to pretty print large numbers It is hard to read very large numbers so provide an option to perf stat to separate thousands using a separator. The patch leverages the locale support of stdio. You need to set your LC_NUMERIC appropriately, for instance LC_NUMERIC=en_US.UTF8. You need to pass -B to activate this feature. This way existing scripts parsing the output do not need to be changed. Here is an example. $ perf stat noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.347031 task-clock-msecs # 0.998 CPUs 61 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 118 page-faults # 0.000 M/sec 4,138,410,900 cycles # 2070.917 M/sec (scaled from 70.01%) 2,062,650,268 instructions # 0.498 IPC (scaled from 70.01%) 2,057,653,466 branches # 1029.678 M/sec (scaled from 70.01%) 40,267 branch-misses # 0.002 % (scaled from 30.04%) 2,055,961,348 cache-references # 1028.831 M/sec (scaled from 30.03%) 53,725 cache-misses # 0.027 M/sec (scaled from 30.02%) 2.001393933 seconds time elapsed $ perf stat -B noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.297883 task-clock-msecs # 0.998 CPUs 59 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 119 page-faults # 0.000 M/sec 4,131,380,160 cycles # 2067.450 M/sec (scaled from 70.01%) 2,059,096,507 instructions # 0.498 IPC (scaled from 70.01%) 2,054,681,303 branches # 1028.216 M/sec (scaled from 70.01%) 25,650 branch-misses # 0.001 % (scaled from 30.05%) 2,056,283,014 cache-references # 1029.017 M/sec (scaled from 30.03%) 47,097 cache-misses # 0.024 M/sec (scaled from 30.02%) 2.001391016 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4bf28fe8.914ed80a.01ca.fffff5f5@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-05-18 13:00:01 +00:00
-B::
--big-num::
print large numbers with thousands' separators according to locale.
Enabled by default. Use "--no-big-num" to disable.
Default setting can be changed with "perf config stat.big-num=false".
perf stat: add perf stat -B to pretty print large numbers It is hard to read very large numbers so provide an option to perf stat to separate thousands using a separator. The patch leverages the locale support of stdio. You need to set your LC_NUMERIC appropriately, for instance LC_NUMERIC=en_US.UTF8. You need to pass -B to activate this feature. This way existing scripts parsing the output do not need to be changed. Here is an example. $ perf stat noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.347031 task-clock-msecs # 0.998 CPUs 61 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 118 page-faults # 0.000 M/sec 4,138,410,900 cycles # 2070.917 M/sec (scaled from 70.01%) 2,062,650,268 instructions # 0.498 IPC (scaled from 70.01%) 2,057,653,466 branches # 1029.678 M/sec (scaled from 70.01%) 40,267 branch-misses # 0.002 % (scaled from 30.04%) 2,055,961,348 cache-references # 1028.831 M/sec (scaled from 30.03%) 53,725 cache-misses # 0.027 M/sec (scaled from 30.02%) 2.001393933 seconds time elapsed $ perf stat -B noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.297883 task-clock-msecs # 0.998 CPUs 59 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 119 page-faults # 0.000 M/sec 4,131,380,160 cycles # 2067.450 M/sec (scaled from 70.01%) 2,059,096,507 instructions # 0.498 IPC (scaled from 70.01%) 2,054,681,303 branches # 1028.216 M/sec (scaled from 70.01%) 25,650 branch-misses # 0.001 % (scaled from 30.05%) 2,056,283,014 cache-references # 1029.017 M/sec (scaled from 30.03%) 47,097 cache-misses # 0.024 M/sec (scaled from 30.02%) 2.001391016 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4bf28fe8.914ed80a.01ca.fffff5f5@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-05-18 13:00:01 +00:00
-C::
--cpu=::
Count only on the list of CPUs provided. Multiple CPUs can be provided as a
comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2.
In per-thread mode, this option is ignored. The -a option is still necessary
to activate system-wide monitoring. Default is to count on all CPUs.
perf stat: Add no-aggregation mode to -a This patch adds a new -A option to perf stat. If specified then perf stat does not aggregate counts across all monitored CPUs in system-wide mode, i.e., when using -a. This option is not supported in per-thread mode. Being able to get a per-cpu breakdown is useful to detect imbalances between CPUs when running a uniform workload than spans all monitored CPUs. The second version corrects the missing cpumap[] support, so that it works when the -C option is used. The third version fixes a missing cpumap[] in print_counter() and removes a stray patch in builtin-trace.c. Examples on a 4-way system: # perf stat -a -e cycles,instructions -- sleep 1 Performance counter stats for 'sleep 1': 9592808135 cycles 3490380006 instructions # 0.364 IPC 1.001584632 seconds time elapsed # perf stat -a -A -e cycles,instructions -- sleep 1 Performance counter stats for 'sleep 1': CPU0 2398163767 cycles CPU1 2398180817 cycles CPU2 2398217115 cycles CPU3 2398247483 cycles CPU0 872282046 instructions # 0.364 IPC CPU1 873481776 instructions # 0.364 IPC CPU2 872638127 instructions # 0.364 IPC CPU3 872437789 instructions # 0.364 IPC 1.001556052 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <4ce257b5.1e07e30a.7b6b.3aa9@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-11-16 09:05:01 +00:00
-A::
--no-aggr::
Do not aggregate counts across all monitored CPUs.
perf stat: Add no-aggregation mode to -a This patch adds a new -A option to perf stat. If specified then perf stat does not aggregate counts across all monitored CPUs in system-wide mode, i.e., when using -a. This option is not supported in per-thread mode. Being able to get a per-cpu breakdown is useful to detect imbalances between CPUs when running a uniform workload than spans all monitored CPUs. The second version corrects the missing cpumap[] support, so that it works when the -C option is used. The third version fixes a missing cpumap[] in print_counter() and removes a stray patch in builtin-trace.c. Examples on a 4-way system: # perf stat -a -e cycles,instructions -- sleep 1 Performance counter stats for 'sleep 1': 9592808135 cycles 3490380006 instructions # 0.364 IPC 1.001584632 seconds time elapsed # perf stat -a -A -e cycles,instructions -- sleep 1 Performance counter stats for 'sleep 1': CPU0 2398163767 cycles CPU1 2398180817 cycles CPU2 2398217115 cycles CPU3 2398247483 cycles CPU0 872282046 instructions # 0.364 IPC CPU1 873481776 instructions # 0.364 IPC CPU2 872638127 instructions # 0.364 IPC CPU3 872437789 instructions # 0.364 IPC 1.001556052 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <4ce257b5.1e07e30a.7b6b.3aa9@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-11-16 09:05:01 +00:00
-n::
--null::
null run - Don't start any counters.
This can be useful to measure just elapsed wall-clock time - or to assess the
raw overhead of perf stat itself, without running any counters.
-v::
--verbose::
be more verbose (show counter open errors, etc)
perf stat: Add csv-style output This patch adds an option (-x/--field-separator) to print counts using a CSV-style output. The user can pass a custom separator. This makes it very easy to import counts directly into your favorite spreadsheet without having to write scripts. Example: $ perf stat --field-separator=, -a -- sleep 1 4009.961740,task-clock-msecs 13,context-switches 2,CPU-migrations 189,page-faults 9596385684,cycles 3493659441,instructions 872897069,branches 41562,branch-misses 22424,cache-references 1289,cache-misses Works also in non-aggregated mode: $ perf stat -x , -a -A -- sleep 1 CPU0,1002.526168,task-clock-msecs CPU1,1002.528365,task-clock-msecs CPU2,1002.523360,task-clock-msecs CPU3,1002.519878,task-clock-msecs CPU0,1,context-switches CPU1,5,context-switches CPU2,5,context-switches CPU3,6,context-switches CPU0,0,CPU-migrations CPU1,1,CPU-migrations CPU2,0,CPU-migrations CPU3,1,CPU-migrations CPU0,2,page-faults CPU1,6,page-faults CPU2,9,page-faults CPU3,174,page-faults CPU0,2399439771,cycles CPU1,2380369063,cycles CPU2,2399142710,cycles CPU3,2373161192,cycles CPU0,872900618,instructions CPU1,873030960,instructions CPU2,872714525,instructions CPU3,874460580,instructions CPU0,221556839,branches CPU1,218134342,branches CPU2,218161730,branches CPU3,218284093,branches CPU0,18556,branch-misses CPU1,1449,branch-misses CPU2,3447,branch-misses CPU3,12714,branch-misses CPU0,8330,cache-references CPU1,313844,cache-references CPU2,47993728,cache-references CPU3,826481,cache-references CPU0,272,cache-misses CPU1,5360,cache-misses CPU2,1342193,cache-misses CPU3,13992,cache-misses This second version adds the ability to name a separator and uses field-separator as the long option to be consistent with perf report. Commiter note: Since we enabled --big-num by default in 201e0b0 and -x can't be used with it, we need to notice if the user explicitely enabled or disabled -B, add code to disable big_num if the user didn't explicitely set --big_num when -x is used. Cc: David S. Miller <davem@davemloft.net> Cc: Frederik Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: paulus@samba.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <4cf68aa7.0fedd80a.5294.1203@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-12-01 16:49:05 +00:00
-x SEP::
--field-separator SEP::
print counts using a CSV-style output to make it easy to import directly into
spreadsheets. Columns are separated by the string specified in SEP.
--table:: Display time for each run (-r option), in a table format, e.g.:
$ perf stat --null -r 5 --table perf bench sched pipe
Performance counter stats for 'perf bench sched pipe' (5 runs):
# Table of individual measurements:
5.189 (-0.293) #
5.189 (-0.294) #
5.186 (-0.296) #
5.663 (+0.181) ##
6.186 (+0.703) ####
# Final result:
5.483 +- 0.198 seconds time elapsed ( +- 3.62% )
-G name::
--cgroup name::
monitor only in the container (cgroup) called "name". This option is available only
in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to
container "name" are monitored when they run on the monitored CPUs. Multiple cgroups
can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup
to first event, second cgroup to second event and so on. It is possible to provide
an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have
corresponding events, i.e., they always refer to events defined earlier on the command
line. If the user wants to track multiple events for a specific cgroup, the user can
use '-e e1 -e e2 -G foo,foo' or just use '-e e1 -e e2 -G foo'.
If wanting to monitor, say, 'cycles' for a cgroup and also for system wide, this
command line can be used: 'perf stat -e cycles -G cgroup_name -a -e cycles'.
perf stat: Add --for-each-cgroup option The --for-each-cgroup option is a syntax sugar to monitor large number of cgroups easily. Current command line requires to list all the events and cgroups even if users want to monitor same events for each cgroup. This patch addresses that usage by copying given events for each cgroup on user's behalf. For instance, if they want to monitor 6 events for 200 cgroups each they should write 1200 event names (with -e) AND 1200 cgroup names (with -G) on the command line. But with this change, they can just specify 6 events and 200 cgroups with a new option. A simpler example below: It wants to measure 3 events for 2 cgroups ('A' and 'B'). The result is that total 6 events are counted like below. $ perf stat -a -e cpu-clock,cycles,instructions --for-each-cgroup A,B sleep 1 Performance counter stats for 'system wide': 988.18 msec cpu-clock A # 0.987 CPUs utilized 3,153,761,702 cycles A # 3.200 GHz (100.00%) 8,067,769,847 instructions A # 2.57 insn per cycle (100.00%) 982.71 msec cpu-clock B # 0.982 CPUs utilized 3,136,093,298 cycles B # 3.182 GHz (99.99%) 8,109,619,327 instructions B # 2.58 insn per cycle (99.99%) 1.001228054 seconds time elapsed Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200924124455.336326-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-24 12:44:52 +00:00
--for-each-cgroup name::
Expand event list for each cgroup in "name" (allow multiple cgroups separated
by comma). It also support regex patterns to match multiple groups. This has same
effect that repeating -e option and -G option for each event x name. This option
cannot be used with -G/--cgroup option.
perf stat: Add --for-each-cgroup option The --for-each-cgroup option is a syntax sugar to monitor large number of cgroups easily. Current command line requires to list all the events and cgroups even if users want to monitor same events for each cgroup. This patch addresses that usage by copying given events for each cgroup on user's behalf. For instance, if they want to monitor 6 events for 200 cgroups each they should write 1200 event names (with -e) AND 1200 cgroup names (with -G) on the command line. But with this change, they can just specify 6 events and 200 cgroups with a new option. A simpler example below: It wants to measure 3 events for 2 cgroups ('A' and 'B'). The result is that total 6 events are counted like below. $ perf stat -a -e cpu-clock,cycles,instructions --for-each-cgroup A,B sleep 1 Performance counter stats for 'system wide': 988.18 msec cpu-clock A # 0.987 CPUs utilized 3,153,761,702 cycles A # 3.200 GHz (100.00%) 8,067,769,847 instructions A # 2.57 insn per cycle (100.00%) 982.71 msec cpu-clock B # 0.982 CPUs utilized 3,136,093,298 cycles B # 3.182 GHz (99.99%) 8,109,619,327 instructions B # 2.58 insn per cycle (99.99%) 1.001228054 seconds time elapsed Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200924124455.336326-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-24 12:44:52 +00:00
-o file::
--output file::
Print the output into the designated file.
--append::
Append to the output file designated with the -o option. Ignored if -o is not specified.
--log-fd::
Log output to fd, instead of stderr. Complementary to --output, and mutually exclusive
with it. --append may be used here. Examples:
3>results perf stat --log-fd 3 \-- $cmd
3>>results perf stat --log-fd 3 --append \-- $cmd
--control=fifo:ctl-fifo[,ack-fifo]::
--control=fd:ctl-fd[,ack-fd]::
ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows.
Listen on ctl-fd descriptor for command to control measurement ('enable': enable events,
'disable': disable events). Measurements can be started with events disabled using
--delay=-1 option. Optionally send control command completion ('ack\n') to ack-fd descriptor
to synchronize with the controlling process. Example of bash shell script to enable and
disable events during measurements:
#!/bin/bash
ctl_dir=/tmp/
ctl_fifo=${ctl_dir}perf_ctl.fifo
test -p ${ctl_fifo} && unlink ${ctl_fifo}
mkfifo ${ctl_fifo}
exec {ctl_fd}<>${ctl_fifo}
ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo
test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo}
mkfifo ${ctl_ack_fifo}
exec {ctl_fd_ack}<>${ctl_ack_fifo}
perf stat -D -1 -e cpu-cycles -a -I 1000 \
--control fd:${ctl_fd},${ctl_fd_ack} \
\-- sleep 30 &
perf_pid=$!
sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})"
sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})"
exec {ctl_fd_ack}>&-
unlink ${ctl_ack_fifo}
exec {ctl_fd}>&-
unlink ${ctl_fifo}
wait -n ${perf_pid}
exit $?
--pre::
--post::
Pre and post measurement hooks, e.g.:
perf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' \-- make -s -j64 O=defconfig-build/ bzImage
2013-01-29 11:47:44 +00:00
-I msecs::
--interval-print msecs::
Print count deltas every N milliseconds (minimum: 1ms)
perf stat: Reduce min --interval-print to 10ms The --interval-print parameter was limited to 100ms. However, for example, 10ms is required to do sophisticated bandwidth analysis using uncore events. The test shows that the overhead of the system-wide uncore monitoring with 10ms interval is only ~2%. So this patch reduces the minimal interval-print allowd to 10ms. But 10ms may not work well for all cases. For example, when the cpus/threads number is very large, for system-wide core event monitoring the overhead could be high. To handle this issue, a warning will be displayed when the interval-print is set between 10ms to 100ms. So users can make a decision according to their specific cases. # perf stat -e uncore_imc_1/cas_count_read/ -a --interval-print 10 -- sleep 1 print interval < 100ms. The overhead percentage could be high in some cases. Please proceed with caution. # time counts unit events 0.010200451 0.10 MiB uncore_imc_1/cas_count_read/ 0.020475117 0.02 MiB uncore_imc_1/cas_count_read/ 0.030692800 0.01 MiB uncore_imc_1/cas_count_read/ 0.040948161 0.02 MiB uncore_imc_1/cas_count_read/ 0.051159564 0.00 MiB uncore_imc_1/cas_count_read/ Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/1443776674-42511-1-git-send-email-kan.liang@intel.com [ Added warning about overhead when using sub 100ms intervals to the man page ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-10-02 09:04:34 +00:00
The overhead percentage could be high in some cases, for instance with small, sub 100ms intervals. Use with caution.
example: 'perf stat -I 1000 -e cycles -a sleep 5'
perf stat: Improve runtime stat for interval mode For interval mode, the metric is printed after the '#' character if it exists. But it's not calculated by the counts generated in this interval. See the following examples: root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2 # time counts unit events 1.000422803 764,809 inst_retired.any # 2.9 CPI 1.000422803 2,234,932 cycles 2.001464585 1,960,061 inst_retired.any # 1.6 CPI 2.001464585 4,022,591 cycles The second CPI should not be 1.6 (4,022,591/1,960,061 is 2.1) root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2 # time counts unit events 1.000429493 2,869,311 cycles 1.000429493 816,875 instructions # 0.28 insn per cycle 2.001516426 9,260,973 cycles 2.001516426 5,250,634 instructions # 0.87 insn per cycle The second 'insn per cycle' should not be 0.87 (5,250,634/9,260,973 is 0.57). The current code uses a global variable 'rt_stat' for tracking and updating the std dev of runtime stat. Unlike the counts, 'rt_stat' is not reset for interval. While the counts are reset for interval. perf_stat_process_counter() { if (config->interval) init_stats(ps->res_stats); } So for interval mode, the 'rt_stat' variable should be reset too. This patch resets 'rt_stat' before read_counters(), so the runtime stat is only calculated by the counts generated in this interval. With this patch: root@kbl-ppc:~# perf stat -M CPI -I1000 --interval-count 2 # time counts unit events 1.000420924 2,408,818 inst_retired.any # 2.1 CPI 1.000420924 5,010,111 cycles 2.001448579 2,798,407 inst_retired.any # 1.6 CPI 2.001448579 4,599,861 cycles root@kbl-ppc:~# perf stat -e cycles,instructions -I1000 --interval-count 2 # time counts unit events 1.000428555 2,769,714 cycles 1.000428555 774,462 instructions # 0.28 insn per cycle 2.001471562 3,595,904 cycles 2.001471562 1,243,703 instructions # 0.35 insn per cycle Now the second 'insn per cycle' and CPI are calculated by the counts generated in this interval. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-By: Kajol Jain <kjain@linux.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200420145417.6864-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-04-20 14:54:17 +00:00
If the metric exists, it is calculated by the counts generated in this interval and the metric is printed after #.
perf stat: Add support to print counts for fixed times Introduce a new option to print counts for fixed number of times and update 'perf stat' documentation accordingly. Show below is the output of the new option for perf stat. $ perf stat -I 1000 --interval-count 2 -e cycles -a # time counts unit events 1.002827089 93,884,870 cycles 2.004231506 56,573,446 cycles We can just print the counts for several times with this newly introduced option. The usage of it is a little like 'vmstat', and it should be used together with "-I" option. $ vmstat -n 1 2 procs ---------memory-------------- --swap- ----io-- -system-- ------cpu--- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 78270544 547484 51732076 0 0 0 20 1 1 1 0 99 0 0 0 0 0 78270512 547484 51732080 0 0 0 16 477 1555 0 0 100 0 0 Changes since v3: - merge interval_count check and times check to one line. - fix the wrong indent in stat.h - use stat_config.times instead of 'times' in cmd_stat function. Changes since v2: - none. Changes since v1: - change the name of the new option "times-print" to "interval-count". - keep the new option interval specifically. Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1517217923-8302-2-git-send-email-ufo19890607@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-29 09:25:22 +00:00
--interval-count times::
Print count deltas for fixed number of times.
This option should be used together with "-I" option.
example: 'perf stat -I 1000 --interval-count 2 -e cycles -a'
--interval-clear::
Clear the screen before next interval.
perf stat: Add support to print counts after a period of time Introduce a new option to print counts after N milliseconds and update 'perf stat' documentation accordingly. Show below is the output of the new option for perf stat. $ perf stat --time 2000 -e cycles -a Performance counter stats for 'system wide': 157,260,423 cycles 2.003060766 seconds time elapsed We can print the count deltas after N milliseconds with this new introduced option. This option is not supported with "-I" option. In addition, according to Kangliang's patch(19afd10410957), the monitoring overhead for system-wide core event could be very high if the interval-print parameter was below 100ms, and the limitation value is 10ms. So the same warning will be displayed when the time is set between 10ms to 100ms, and the minimal time is limited to 10ms. Users can make a decision according to their spcific cases. Committer notes: This actually stops the workload after the specified time, then prints the counts. So I renamed the option to --timeout and updated the documentation to state that it will not just print the counts after the specified time, but will really stop the 'perf stat' session and print the counts. The rename from 'time' to 'timeout' also fixes the build in systems where 'time' is used by glibc and can't be used as a name of a variable, such as centos:5 and centos:6. Changes since v3: - none. Changes since v2: - modify the time check in __run_perf_stat func to keep some consistency with the workload case. - add the warning when the time is set between 10ms to 100ms. - add the pr_err when the time is set below 10ms. Changes since v1: - none. Signed-off-by: yuzhoujian <yuzhoujian@didichuxing.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Milian Wolff <milian.wolff@kdab.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1517217923-8302-3-git-send-email-ufo19890607@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-01-29 09:25:23 +00:00
--timeout msecs::
Stop the 'perf stat' session and print count deltas after N milliseconds (minimum: 10 ms).
This option is not supported with the "-I" option.
example: 'perf stat --time 2000 -e cycles -a'
perf stat: Implement --metric-only mode Add a new mode to only print metrics. Sometimes we don't care about the raw values, just want the computed metrics. This allows more compact printing, so with -I each sample is only a single line. This also allows easier plotting and processing with other tools. The main target is with using --topdown, but it also works with -T and standard perf stat. A few metrics are not supported. To avoiding having to hardcode all the metrics in the code it uses a two pass approach: first compute dummy metrics and only print the headers in the print_metric callback. Then use the callback to print the actual values. There are some additional changes in the stat printout code to handle all metrics being on a single line. One issue is that the column code doesn't know in advance what events are not supported by the CPU, and it would be hard to find out as this could change based on dynamic conditions. That causes empty columns in some cases. The output can be fairly wide, often you may need more than 80 columns. Example: % perf stat -a -I 1000 --metric-only 1.001452803 frontend cycles idle insn per cycle stalled cycles per insn branch-misses of all branches 1.001452803 158.91% 0.66 2.39 2.92% 2.002192321 180.63% 0.76 2.08 2.96% 3.003088282 150.59% 0.62 2.57 2.84% 4.004369835 196.20% 0.98 1.62 3.79% 5.005227314 231.98% 0.84 1.90 4.71% v2: Lots of updates. v3: Use slightly narrower columns v4: Add comment Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/1457049458-28956-6-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-03-03 23:57:36 +00:00
--metric-only::
Only print computed metrics. Print them in a single line.
Don't show any raw values. Not supported with --per-thread.
perf stat: Implement --metric-only mode Add a new mode to only print metrics. Sometimes we don't care about the raw values, just want the computed metrics. This allows more compact printing, so with -I each sample is only a single line. This also allows easier plotting and processing with other tools. The main target is with using --topdown, but it also works with -T and standard perf stat. A few metrics are not supported. To avoiding having to hardcode all the metrics in the code it uses a two pass approach: first compute dummy metrics and only print the headers in the print_metric callback. Then use the callback to print the actual values. There are some additional changes in the stat printout code to handle all metrics being on a single line. One issue is that the column code doesn't know in advance what events are not supported by the CPU, and it would be hard to find out as this could change based on dynamic conditions. That causes empty columns in some cases. The output can be fairly wide, often you may need more than 80 columns. Example: % perf stat -a -I 1000 --metric-only 1.001452803 frontend cycles idle insn per cycle stalled cycles per insn branch-misses of all branches 1.001452803 158.91% 0.66 2.39 2.92% 2.002192321 180.63% 0.76 2.08 2.96% 3.003088282 150.59% 0.62 2.57 2.84% 4.004369835 196.20% 0.98 1.62 3.79% 5.005227314 231.98% 0.84 1.90 4.71% v2: Lots of updates. v3: Use slightly narrower columns v4: Add comment Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/1457049458-28956-6-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-03-03 23:57:36 +00:00
--per-socket::
Aggregate counts per processor socket for system-wide mode measurements. This
is a useful mode to detect imbalance between sockets. To enable this mode,
use --per-socket in addition to -a. (system-wide). The output includes the
socket number and the number of online processors on that socket. This is
useful to gauge the amount of aggregation.
--per-die::
Aggregate counts per processor die for system-wide mode measurements. This
is a useful mode to detect imbalance between dies. To enable this mode,
use --per-die in addition to -a. (system-wide). The output includes the
die number and the number of online processors on that die. This is
useful to gauge the amount of aggregation.
perf stat: Support per-cluster aggregation Some platforms have 'cluster' topology and CPUs in the cluster will share resources like L3 Cache Tag (for HiSilicon Kunpeng SoC) or L2 cache (for Intel Jacobsville). Currently parsing and building cluster topology have been supported since [1]. perf stat has already supported aggregation for other topologies like die or socket, etc. It'll be useful to aggregate per-cluster to find problems like L3T bandwidth contention. This patch add support for "--per-cluster" option for per-cluster aggregation. Also update the docs and related test. The output will be like: [root@localhost tmp]# perf stat -a -e LLC-load --per-cluster -- sleep 5 Performance counter stats for 'system wide': S56-D0-CLS158 4 1,321,521,570 LLC-load S56-D0-CLS594 4 794,211,453 LLC-load S56-D0-CLS1030 4 41,623 LLC-load S56-D0-CLS1466 4 41,646 LLC-load S56-D0-CLS1902 4 16,863 LLC-load S56-D0-CLS2338 4 15,721 LLC-load S56-D0-CLS2774 4 22,671 LLC-load [...] On a legacy system without cluster or cluster support, the output will be look like: [root@localhost perf]# perf stat -a -e cycles --per-cluster -- sleep 1 Performance counter stats for 'system wide': S56-D0-CLS0 64 18,011,485 cycles S7182-D0-CLS0 64 16,548,835 cycles Note that this patch doesn't mix the cluster information in the outputs of --per-core to avoid breaking any tools/scripts using it. Note that perf recently supports "--per-cache" aggregation, but it's not the same with the cluster although cluster CPUs may share some cache resources. For example on my machine all clusters within a die share the same L3 cache: $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list 0-31 $ cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list 0-3 [1] commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die") Tested-by: Jie Zhan <zhanjie9@hisilicon.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Cc: james.clark@arm.com Cc: 21cnbao@gmail.com Cc: prime.zeng@hisilicon.com Cc: Jonathan.Cameron@huawei.com Cc: fanghao11@huawei.com Cc: linuxarm@huawei.com Cc: tim.c.chen@intel.com Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240208024026.2691-1-yangyicong@huawei.com
2024-02-08 02:40:26 +00:00
--per-cluster::
Aggregate counts per processor cluster for system-wide mode measurement. This
is a useful mode to detect imbalance between clusters. To enable this mode,
use --per-cluster in addition to -a. (system-wide). The output includes the
cluster number and the number of online processors on that cluster. This is
useful to gauge the amount of aggregation. The information of cluster ID and
related CPUs can be gotten from /sys/devices/system/cpu/cpuX/topology/cluster_{id, cpus}.
perf stat: Add "--per-cache" aggregation option and document it This patch adds support for "--per-cache" option for aggregation at a particular cache level and documents the same. Following is the output of 'perf stat' with aggregation at L3 for the event "ls_dmnd_fills_from_sys.ext_cache_remote" on a dual socket 3rd Generation EPYC Processor (2 x 64C/128T - 16 LLCs) when running hackbench pinned to 4 LLCs: $ sudo perf stat --per-cache=L3 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- \ taskset -c 0-15,64-79,128-143,192-207 \ perf bench sched messaging -p -t -l 100000 -g 8 ... Performance counter stats for 'system wide': S0-D0-L3-ID0 16 9,500,803 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID8 16 6,338,099 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID16 16 355,005 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID24 16 22,067 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID32 16 16,321 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID40 16 11,619 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID48 16 4,238 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID56 16 31,158 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID64 16 28,242,452 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID72 16 22,906,973 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID80 16 72,898 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID88 16 56,907 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID96 16 20,456 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID104 16 40,913 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID112 16 78,113 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID120 16 37,897 ls_dmnd_fills_from_sys.ext_cache_remote Also support 'perf stat record' and 'perf stat report' with the ability to specify a different cache level to aggregate data at when running 'perf stat report'. $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- \ taskset -c 0-15,64-79,128-143,192-207 \ perf bench sched messaging -p -t -l 100000 -g 8 ... Performance counter stats for 'system wide': S0-D0-L2-ID0 2 1,442,061 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID1 2 1,548,994 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID2 2 1,553,557 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID3 2 1,420,122 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID4 2 1,465,461 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID5 2 1,455,153 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID6 2 1,595,237 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID7 2 1,499,321 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID8 2 1,919,025 ls_dmnd_fills_from_sys.ext_cache_remote ... S1-D1-L2-ID127 2 21,295 ls_dmnd_fills_from_sys.ext_cache_remote $ sudo perf stat report --per-cache=L3 Performance counter stats for 'perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\ taskset -c 0-15,64-79,128-143,192-207 \ perf bench sched messaging -p -t -l 100000 -g 8': S0-D0-L3-ID0 16 11,979,906 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID8 16 14,257,202 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID16 16 377,484 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID24 16 27,224 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID32 16 26,816 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID40 16 14,461 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID48 16 10,499 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID56 16 53,817 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID64 16 27,361,987 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID72 16 37,299,024 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID80 16 84,125 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID88 16 64,561 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID96 16 13,403 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID104 16 20,138 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID112 16 93,220 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID120 16 35,465 ls_dmnd_fills_from_sys.ext_cache_remote On the above system, the domain covered by S0-D0-L3-ID0 contains S0-D0-L2-ID0 to S0-D0-L2-ID7, the corresponding count for L3-ID0 is equal to the sum of counts for L2-ID0 to L2-ID7. Add documentation for the newly introduced "--per-cache" option. Suggested-by: Gautham Shenoy <gautham.shenoy@amd.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ananth Narayan <ananth.narayan@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Sandipan Das <sandipan.das@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Wen Pu <puwen@hygon.cn> Link: https://lore.kernel.org/r/20230517172745.5833-5-kprateek.nayak@amd.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-17 17:27:44 +00:00
--per-cache::
Aggregate counts per cache instance for system-wide mode measurements. By
default, the aggregation happens for the cache level at the highest index
in the system. To specify a particular level, mention the cache level
alongside the option in the format [Ll][1-9][0-9]*. For example:
Using option "--per-cache=l3" or "--per-cache=L3" will aggregate the
information at the boundary of the level 3 cache in the system.
--per-core::
Aggregate counts per physical processor for system-wide mode measurements. This
is a useful mode to detect imbalance between physical cores. To enable this mode,
use --per-core in addition to -a. (system-wide). The output includes the
core number and the number of online logical processors on that physical processor.
perf stat: Introduce --per-thread option Currently all the -p option PID arguments tasks values get aggregated and printed as single values. Adding --per-tasks option to print values per task. $ perf stat -e cycles,instructions --per-thread -p 30190,30242 ^C Performance counter stats for process id '30190,30242': cat-30190 0 cycles yes-30242 3,842,525,421 cycles cat-30190 0 instructions yes-30242 10,370,817,010 instructions 1.143155657 seconds time elapsed Also works under interval mode: $ perf stat -e cycles,instructions --per-thread -p 30190,30242 -I 1000 # time comm-pid counts unit events 1.000073435 cat-30190 89,058 cycles 1.000073435 yes-30242 3,360,786,902 cycles (100.00%) 1.000073435 cat-30190 14,066 instructions 1.000073435 yes-30242 9,069,937,462 instructions 2.000204830 cat-30190 0 cycles 2.000204830 yes-30242 3,351,667,626 cycles 2.000204830 cat-30190 0 instructions 2.000204830 yes-30242 9,045,796,885 instructions ^C 2.771286639 cat-30190 0 cycles 2.771286639 yes-30242 2,593,884,166 cycles 2.771286639 cat-30190 0 instructions 2.771286639 yes-30242 7,001,171,191 instructions It works only with -t and -p options, otherwise following error is printed: $ perf stat -e cycles --per-thread -I 1000 ls The --per-thread option is only available when monitoring via -p -t options. -p, --pid <pid> stat events on existing process id -t, --tid <tid> stat events on existing thread id Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1435310967-14570-23-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-06-26 09:29:27 +00:00
--per-thread::
Aggregate counts per monitored threads, when monitoring threads (-t option)
or processes (-p option).
perf stat: Add --per-node agregation support Adding new --per-node option to aggregate counts per NUMA nodes for system-wide mode measurements. You can specify --per-node in live mode: # perf stat -a -I 1000 -e cycles --per-node # time node cpus counts unit events 1.000542550 N0 20 6,202,097 cycles 1.000542550 N1 20 639,559 cycles 2.002040063 N0 20 7,412,495 cycles 2.002040063 N1 20 2,185,577 cycles 3.003451699 N0 20 6,508,917 cycles 3.003451699 N1 20 765,607 cycles ... Or in the record/report stat session: # perf stat record -a -I 1000 -e cycles # time counts unit events 1.000536937 10,008,468 cycles 2.002090152 9,578,539 cycles 3.003625233 7,647,869 cycles 4.005135036 7,032,086 cycles ^C 4.340902364 3,923,893 cycles # perf stat report --per-node # time node cpus counts unit events 1.000536937 N0 20 9,355,086 cycles 1.000536937 N1 20 653,382 cycles 2.002090152 N0 20 7,712,838 cycles 2.002090152 N1 20 1,865,701 cycles 3.003625233 N0 20 6,604,441 cycles 3.003625233 N1 20 1,043,428 cycles 4.005135036 N0 20 6,350,522 cycles 4.005135036 N1 20 681,564 cycles 4.340902364 N0 20 3,403,188 cycles 4.340902364 N1 20 520,705 cycles Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Joe Mario <jmario@redhat.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20190904073415.723-4-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2019-08-28 08:17:43 +00:00
--per-node::
Aggregate counts per NUMA nodes for system-wide mode measurements. This
is a useful mode to detect imbalance between NUMA nodes. To enable this
mode, use --per-node in addition to -a. (system-wide).
perf stat: Add support for --initial-delay option When measuring workloads the startup phase -- doing page faults, dynamic linking, opening files -- is often very different from the rest of the workload. Especially with smaller kernels and using counter multiplexing this can give significant measurement errors. Multiplexing assumes that the workload is mostly the same over longer periods. But at startup there is typically some spike of activity which is relatively short. If many groups are multiplexing the one group seeing the spike, and which is then scaled up over the time to run all groups, may see a significant error. Also in general it's often not useful to measure the startup, because it is so different from the rest. One way around this is to use interval mode and discard the first sample, but this can be awkward because interval mode doesn't support intervals of less than 100ms, and also a useful interval is not necessarily the same as a useful startup delay. This patch adds a new --initial-delay / -D option to skip measuring for the startup phase. The time can be specified in ms Here's a simple example: perf stat -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 3,721 page-faults ... If we just wait 20 ms the number of page faults is 1/3 less: perf stat -D 20 -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 2,823 page-faults ... So we filtered out most of the startup noise from bash. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1375490473-1503-4-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-08-03 00:41:11 +00:00
-D msecs::
--delay msecs::
After starting the program, wait msecs before measuring (-1: start with events
disabled). This is useful to filter out the startup phase of the program,
which is often very different.
perf stat: Add support for --initial-delay option When measuring workloads the startup phase -- doing page faults, dynamic linking, opening files -- is often very different from the rest of the workload. Especially with smaller kernels and using counter multiplexing this can give significant measurement errors. Multiplexing assumes that the workload is mostly the same over longer periods. But at startup there is typically some spike of activity which is relatively short. If many groups are multiplexing the one group seeing the spike, and which is then scaled up over the time to run all groups, may see a significant error. Also in general it's often not useful to measure the startup, because it is so different from the rest. One way around this is to use interval mode and discard the first sample, but this can be awkward because interval mode doesn't support intervals of less than 100ms, and also a useful interval is not necessarily the same as a useful startup delay. This patch adds a new --initial-delay / -D option to skip measuring for the startup phase. The time can be specified in ms Here's a simple example: perf stat -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 3,721 page-faults ... If we just wait 20 ms the number of page faults is 1/3 less: perf stat -D 20 -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 2,823 page-faults ... So we filtered out most of the startup noise from bash. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1375490473-1503-4-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-08-03 00:41:11 +00:00
-T::
--transaction::
Print statistics of transactional execution if supported.
--metric-no-group::
By default, events to compute a metric are placed in weak groups. The
group tries to enforce scheduling all or none of the events. The
--metric-no-group option places events outside of groups and may
increase the chance of the event being scheduled - leading to more
accuracy. However, as events may not be scheduled together accuracy
for metrics like instructions per cycle can be lower - as both metrics
may no longer be being measured at the same time.
--metric-no-merge::
By default metric events in different weak groups can be shared if one
group contains all the events needed by another. In such cases one
group will be eliminated reducing event multiplexing and making it so
that certain groups of metrics sum to 100%. A downside to sharing a
group is that the group may require multiplexing and so accuracy for a
small group that need not have multiplexing is lowered. This option
forbids the event merging logic from sharing events between groups and
may be used to increase accuracy in this case.
--metric-no-threshold::
Metric thresholds may increase the number of events necessary to
compute whether a metric has exceeded its threshold expression. This
may not be desirable, for example, as the events can introduce
multiplexing. This option disables the adding of threshold expression
events for a metric. However, if there are sufficient events to
compute the threshold then the threshold is still computed and used to
color the metric's computed value.
--quiet::
Don't print output, warnings or messages. This is useful with perf stat
record below to only write data to the perf.data file.
perf stat record: Add record command Add 'perf stat record' command support. It creates simple (header only) perf.data file ATM. The record command could be specified anywhere among stat options. All stat command options are valid for stat record command with '-o' option exception. If specified for record command it denotes the perf data file name. Committer note: Set sample_type to PERF_SAMPLE_IDENTIFIER, which should be harmless while avoiding that older tools show confusing messages, for instance, with sample_type = 0, we get: $ perf stat record usleep 1 Performance counter stats for 'usleep 1': 0.630237 task-clock (msec) # 0.528 CPUs utilized 1 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 52 page-faults # 0.083 M/sec 978,312 cycles # 1.552 GHz 671,931 stalled-cycles-frontend # 68.68% frontend cycles idle <not supported> stalled-cycles-backend 646,379 instructions # 0.66 insns per cycle # 1.04 stalled cycles per insn 131,046 branches # 207.931 M/sec 7,073 branch-misses # 5.40% of all branches 0.001193240 seconds time elapsed $ oldperf evlist WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? non matching sample_type $ While with sample_type set to PERF_SAMPLE_IDENTIFIER, after we re-run 'perf stat record usleep' we get: $ oldperf evlist WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? task-clock context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses $ Which at least shows the names of the events in the perf.data file. Additionally, such files, when passed to 'perf report' will produce: $ oldperf report --stdio WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? Warning: Kernel address maps (/proc/{kallsyms,modules}) were restricted. Check /proc/sys/kernel/kptr_restrict before running 'perf record'. As no suitable kallsyms nor vmlinux was found, kernel samples can't be resolved. Samples in kernel modules can't be resolved as well. Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. # $ Which is confusing and can be solved by just adding the kernel mmap record, which will also remove that warning about the data size field being equal to zero, after generating the mmap record: $ perf stat record usleep 1 Performance counter stats for 'usleep 1': 0.600796 task-clock (msec) # 0.478 CPUs utilized 1 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 54 page-faults # 0.090 M/sec 886,844 cycles # 1.476 GHz 582,169 stalled-cycles-frontend # 65.65% frontend cycles idle <not supported> stalled-cycles-backend 638,344 instructions # 0.72 insns per cycle # 0.91 stalled cycles per insn 130,204 branches # 216.719 M/sec 7,500 branch-misses # 5.76% of all branches 0.001255897 seconds time elapsed $ oldperf evlist task-clock context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses $ oldperf report --stdio Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. # [acme@zoo linux]$ No warnings, sensible output about what are the events in the perf.data file and also a "file has no samples" message, which indeed it doesn't. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Kan Liang <kan.liang@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: htp://lkml.kernel.org/r/1446734469-11352-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-11-05 14:40:46 +00:00
STAT RECORD
-----------
Stores stat data into perf data file.
-o file::
--output file::
Output file name.
STAT REPORT
-----------
Reads and reports stat data from perf data file.
-i file::
--input file::
Input file name.
--per-socket::
Aggregate counts per processor socket for system-wide mode measurements.
--per-die::
Aggregate counts per processor die for system-wide mode measurements.
perf stat: Support per-cluster aggregation Some platforms have 'cluster' topology and CPUs in the cluster will share resources like L3 Cache Tag (for HiSilicon Kunpeng SoC) or L2 cache (for Intel Jacobsville). Currently parsing and building cluster topology have been supported since [1]. perf stat has already supported aggregation for other topologies like die or socket, etc. It'll be useful to aggregate per-cluster to find problems like L3T bandwidth contention. This patch add support for "--per-cluster" option for per-cluster aggregation. Also update the docs and related test. The output will be like: [root@localhost tmp]# perf stat -a -e LLC-load --per-cluster -- sleep 5 Performance counter stats for 'system wide': S56-D0-CLS158 4 1,321,521,570 LLC-load S56-D0-CLS594 4 794,211,453 LLC-load S56-D0-CLS1030 4 41,623 LLC-load S56-D0-CLS1466 4 41,646 LLC-load S56-D0-CLS1902 4 16,863 LLC-load S56-D0-CLS2338 4 15,721 LLC-load S56-D0-CLS2774 4 22,671 LLC-load [...] On a legacy system without cluster or cluster support, the output will be look like: [root@localhost perf]# perf stat -a -e cycles --per-cluster -- sleep 1 Performance counter stats for 'system wide': S56-D0-CLS0 64 18,011,485 cycles S7182-D0-CLS0 64 16,548,835 cycles Note that this patch doesn't mix the cluster information in the outputs of --per-core to avoid breaking any tools/scripts using it. Note that perf recently supports "--per-cache" aggregation, but it's not the same with the cluster although cluster CPUs may share some cache resources. For example on my machine all clusters within a die share the same L3 cache: $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list 0-31 $ cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list 0-3 [1] commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die") Tested-by: Jie Zhan <zhanjie9@hisilicon.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Cc: james.clark@arm.com Cc: 21cnbao@gmail.com Cc: prime.zeng@hisilicon.com Cc: Jonathan.Cameron@huawei.com Cc: fanghao11@huawei.com Cc: linuxarm@huawei.com Cc: tim.c.chen@intel.com Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240208024026.2691-1-yangyicong@huawei.com
2024-02-08 02:40:26 +00:00
--per-cluster::
Aggregate counts perf processor cluster for system-wide mode measurements.
perf stat: Add "--per-cache" aggregation option and document it This patch adds support for "--per-cache" option for aggregation at a particular cache level and documents the same. Following is the output of 'perf stat' with aggregation at L3 for the event "ls_dmnd_fills_from_sys.ext_cache_remote" on a dual socket 3rd Generation EPYC Processor (2 x 64C/128T - 16 LLCs) when running hackbench pinned to 4 LLCs: $ sudo perf stat --per-cache=L3 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- \ taskset -c 0-15,64-79,128-143,192-207 \ perf bench sched messaging -p -t -l 100000 -g 8 ... Performance counter stats for 'system wide': S0-D0-L3-ID0 16 9,500,803 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID8 16 6,338,099 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID16 16 355,005 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID24 16 22,067 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID32 16 16,321 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID40 16 11,619 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID48 16 4,238 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID56 16 31,158 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID64 16 28,242,452 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID72 16 22,906,973 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID80 16 72,898 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID88 16 56,907 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID96 16 20,456 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID104 16 40,913 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID112 16 78,113 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID120 16 37,897 ls_dmnd_fills_from_sys.ext_cache_remote Also support 'perf stat record' and 'perf stat report' with the ability to specify a different cache level to aggregate data at when running 'perf stat report'. $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- \ taskset -c 0-15,64-79,128-143,192-207 \ perf bench sched messaging -p -t -l 100000 -g 8 ... Performance counter stats for 'system wide': S0-D0-L2-ID0 2 1,442,061 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID1 2 1,548,994 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID2 2 1,553,557 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID3 2 1,420,122 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID4 2 1,465,461 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID5 2 1,455,153 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID6 2 1,595,237 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID7 2 1,499,321 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L2-ID8 2 1,919,025 ls_dmnd_fills_from_sys.ext_cache_remote ... S1-D1-L2-ID127 2 21,295 ls_dmnd_fills_from_sys.ext_cache_remote $ sudo perf stat report --per-cache=L3 Performance counter stats for 'perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\ taskset -c 0-15,64-79,128-143,192-207 \ perf bench sched messaging -p -t -l 100000 -g 8': S0-D0-L3-ID0 16 11,979,906 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID8 16 14,257,202 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID16 16 377,484 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID24 16 27,224 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID32 16 26,816 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID40 16 14,461 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID48 16 10,499 ls_dmnd_fills_from_sys.ext_cache_remote S0-D0-L3-ID56 16 53,817 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID64 16 27,361,987 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID72 16 37,299,024 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID80 16 84,125 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID88 16 64,561 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID96 16 13,403 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID104 16 20,138 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID112 16 93,220 ls_dmnd_fills_from_sys.ext_cache_remote S1-D1-L3-ID120 16 35,465 ls_dmnd_fills_from_sys.ext_cache_remote On the above system, the domain covered by S0-D0-L3-ID0 contains S0-D0-L2-ID0 to S0-D0-L2-ID7, the corresponding count for L3-ID0 is equal to the sum of counts for L2-ID0 to L2-ID7. Add documentation for the newly introduced "--per-cache" option. Suggested-by: Gautham Shenoy <gautham.shenoy@amd.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ananth Narayan <ananth.narayan@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Sandipan Das <sandipan.das@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Wen Pu <puwen@hygon.cn> Link: https://lore.kernel.org/r/20230517172745.5833-5-kprateek.nayak@amd.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-05-17 17:27:44 +00:00
--per-cache::
Aggregate counts per cache instance for system-wide mode measurements. By
default, the aggregation happens for the cache level at the highest index
in the system. To specify a particular level, mention the cache level
alongside the option in the format [Ll][1-9][0-9]*. For example: Using
option "--per-cache=l3" or "--per-cache=L3" will aggregate the
information at the boundary of the level 3 cache in the system.
--per-core::
Aggregate counts per physical processor for system-wide mode measurements.
perf stat: Support JSON metrics in perf stat Add generic support for standalone metrics specified in JSON files to perf stat. A metric is a formula that uses multiple events to compute a higher level result (e.g. IPC). Previously metrics were always tied to an event and automatically enabled with that event. But now change it that we can have standalone metrics. They are in the same JSON data structure as events, but don't have an event name. We also allow to organize the metrics in metric groups, which allows a short cut to select several related metrics at once. Add a new -M / --metrics option to perf stat that adds the metrics or metric groups specified. Add the core code to manage and parse the metric groups. They are collected from the JSON data structures into a separate rblist. When computing shadow values look for metrics in that list. Then they are computed using the existing saved values infrastructure in stat-shadow.c The actual JSON metrics are in a separate pull request. % perf stat -M Summary --metric-only -a sleep 1 Performance counter stats for 'system wide': Instructions CLKS CPU_Utilization GFLOPs SMT_2T_Utilization Kernel_Utilization 317614222.0 1392930775.0 0.0 0.0 0.2 0.1 1.001497549 seconds time elapsed % perf stat -M GFLOPs flops Performance counter stats for 'flops': 3,999,541,471 fp_comp_ops_exe.sse_scalar_single # 1.2 GFLOPs (66.65%) 14 fp_comp_ops_exe.sse_scalar_double (66.65%) 0 fp_comp_ops_exe.sse_packed_double (66.67%) 0 fp_comp_ops_exe.sse_packed_single (66.70%) 0 simd_fp_256.packed_double (66.70%) 0 simd_fp_256.packed_single (66.67%) 0 duration_time 3.238372845 seconds time elapsed v2: Add missing header file v3: Move find_map to pmu.c Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/20170831194036.30146-7-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-08-31 19:40:31 +00:00
-M::
--metrics::
Print metrics or metricgroups specified in a comma separated list.
For a group all metrics from the group are added.
The events from the metrics are automatically measured.
See perf list output for the possible metrics and metricgroups.
perf stat: Support JSON metrics in perf stat Add generic support for standalone metrics specified in JSON files to perf stat. A metric is a formula that uses multiple events to compute a higher level result (e.g. IPC). Previously metrics were always tied to an event and automatically enabled with that event. But now change it that we can have standalone metrics. They are in the same JSON data structure as events, but don't have an event name. We also allow to organize the metrics in metric groups, which allows a short cut to select several related metrics at once. Add a new -M / --metrics option to perf stat that adds the metrics or metric groups specified. Add the core code to manage and parse the metric groups. They are collected from the JSON data structures into a separate rblist. When computing shadow values look for metrics in that list. Then they are computed using the existing saved values infrastructure in stat-shadow.c The actual JSON metrics are in a separate pull request. % perf stat -M Summary --metric-only -a sleep 1 Performance counter stats for 'system wide': Instructions CLKS CPU_Utilization GFLOPs SMT_2T_Utilization Kernel_Utilization 317614222.0 1392930775.0 0.0 0.0 0.2 0.1 1.001497549 seconds time elapsed % perf stat -M GFLOPs flops Performance counter stats for 'flops': 3,999,541,471 fp_comp_ops_exe.sse_scalar_single # 1.2 GFLOPs (66.65%) 14 fp_comp_ops_exe.sse_scalar_double (66.65%) 0 fp_comp_ops_exe.sse_packed_double (66.67%) 0 fp_comp_ops_exe.sse_packed_single (66.70%) 0 simd_fp_256.packed_double (66.70%) 0 simd_fp_256.packed_single (66.67%) 0 duration_time 3.238372845 seconds time elapsed v2: Add missing header file v3: Move find_map to pmu.c Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/20170831194036.30146-7-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-08-31 19:40:31 +00:00
When threshold information is available for a metric, the
color red is used to signify a metric has exceeded a threshold
while green shows it hasn't. The default color means that
no threshold information was available or the threshold
couldn't be computed.
-A::
--no-aggr::
perf stat: Combine the -A/--no-aggr and --no-merge options The -A or --no-aggr option disables aggregation of core events: $ perf stat -A -e cycles,data_total -a true Performance counter stats for 'system wide': CPU0 1,287,665 cycles CPU1 1,831,681 cycles CPU2 27,345,998 cycles CPU3 1,964,799 cycles CPU4 236,174 cycles CPU5 3,302,825 cycles CPU6 9,201,446 cycles CPU7 1,403,043 cycles CPU0 110.90 MiB data_total 0.008961761 seconds time elapsed The --no-merge option disables the aggregation of uncore events: $ perf stat --no-merge -e cycles,data_total -a true Performance counter stats for 'system wide': 38,482,778 cycles 15.04 MiB data_total [uncore_imc_free_running_1] 15.00 MiB data_total [uncore_imc_free_running_0] 0.005915155 seconds time elapsed Having two options confuses users who generally don't appreciate the difference in PMUs. Keep all the options but make it so they all disable aggregation both of core and uncore events: $ perf stat -A -e cycles,data_total -a true Performance counter stats for 'system wide': CPU0 85,878 cycles CPU1 88,179 cycles CPU2 60,872 cycles CPU3 3,265,567 cycles CPU4 82,357 cycles CPU5 83,383 cycles CPU6 84,156 cycles CPU7 220,803 cycles CPU0 2.38 MiB data_total [uncore_imc_free_running_0] CPU0 2.38 MiB data_total [uncore_imc_free_running_1] 0.001397205 seconds time elapsed Update the relevant 'perf stat' man page information. Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com> Cc: Changbin Du <changbin.du@huawei.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Kaige Ye <ye@kaige.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231214060256.2094017-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-14 06:02:56 +00:00
--no-merge::
Do not aggregate/merge counts across monitored CPUs or PMUs.
When multiple events are created from a single event specification,
stat will, by default, aggregate the event counts and show the result
in a single row. This option disables that behavior and shows the
individual events and counts.
Multiple events are created from a single event specification when:
1. PID monitoring isn't requested and the system has more than one
CPU. For example, a system with 8 SMT threads will have one event
opened on each thread and aggregation is performed across them.
2. Prefix or glob wildcard matching is used for the PMU name. For
example, multiple memory controller PMUs may exist typically with a
suffix of _0, _1, etc. By default the event counts will all be
combined if the PMU is specified without the suffix such as
uncore_imc rather than uncore_imc_0.
3. Aliases, which are listed immediately after the Kernel PMU events
by perf list, are used.
--hybrid-merge::
Merge core event counts from all core PMUs. In hybrid or big.LITTLE
systems by default each core PMU will report its count
separately. This option forces core PMU counts to be combined to give
a behavior closer to having a single CPU type in the system.
perf stat: Basic support for TopDown in perf stat Add basic plumbing for TopDown in perf stat TopDown is intended to replace the frontend cycles idle/ backend cycles idle metrics in standard perf stat output. These metrics are not reliable in many workloads, due to out of order effects. This implements a new --topdown mode in perf stat (similar to --transaction) that measures the pipe line bottlenecks using standardized formulas. The measurement can be all done with 5 counters (one fixed counter) The result are four metrics: FrontendBound, BackendBound, BadSpeculation, Retiring that describe the CPU pipeline behavior on a high level. The full top down methology has many hierarchical metrics. This implementation only supports level 1 which can be collected without multiplexing. A full implementation of top down on top of perf is available in pmu-tools toplev. (http://github.com/andikleen/pmu-tools) The current version works on Intel Core CPUs starting with Sandy Bridge, and Atom CPUs starting with Silvermont. In principle the generic metrics should be also implementable on other out of order CPUs. TopDown level 1 uses a set of abstracted metrics which are generic to out of order CPU cores (although some CPUs may not implement all of them): topdown-total-slots Available slots in the pipeline topdown-slots-issued Slots issued into the pipeline topdown-slots-retired Slots successfully retired topdown-fetch-bubbles Pipeline gaps in the frontend topdown-recovery-bubbles Pipeline gaps during recovery from misspeculation These metrics then allow to compute four useful metrics: FrontendBound, BackendBound, Retiring, BadSpeculation. Add a new --topdown options to enable events. When --topdown is specified set up events for all topdown events supported by the kernel. Add topdown-* as a special case to the event parser, as is needed for all events containing -. The actual code to compute the metrics is in follow-on patches. v2: Use standard sysctl read function. v3: Move x86 specific code to arch/ v4: Enable --metric-only implicitly for topdown. v5: Add --single-thread option to not force per core mode v6: Fix output order of topdown metrics v7: Allow combining with -d v8: Remove --single-thread again v9: Rename functions, adding arch_ and topdown_. v10: Expand man page and describe TopDown better Paste intro into commit description. Print error when malloc fails. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/1464119559-17203-1-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-05-30 15:49:42 +00:00
--topdown::
perf doc: Refresh topdown documentation perf stat now supports --topdown for any platform with the TopdownL1 metric group including Intel before Icelake. Tweak the documentation to reflect this. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Florian Fischer <florian.fischer@muhq.space> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jing Zhang <renyu.zj@linux.alibaba.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Sandipan Das <sandipan.das@amd.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-stm32@st-md-mailman.stormreply.com Link: https://lore.kernel.org/r/20230219092848.639226-43-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-19 09:28:39 +00:00
Print top-down metrics supported by the CPU. This allows to determine
bottle necks in the CPU pipeline for CPU bound workloads, by breaking
the cycles consumed down into frontend bound, backend bound, bad
speculation and retiring.
perf stat: Basic support for TopDown in perf stat Add basic plumbing for TopDown in perf stat TopDown is intended to replace the frontend cycles idle/ backend cycles idle metrics in standard perf stat output. These metrics are not reliable in many workloads, due to out of order effects. This implements a new --topdown mode in perf stat (similar to --transaction) that measures the pipe line bottlenecks using standardized formulas. The measurement can be all done with 5 counters (one fixed counter) The result are four metrics: FrontendBound, BackendBound, BadSpeculation, Retiring that describe the CPU pipeline behavior on a high level. The full top down methology has many hierarchical metrics. This implementation only supports level 1 which can be collected without multiplexing. A full implementation of top down on top of perf is available in pmu-tools toplev. (http://github.com/andikleen/pmu-tools) The current version works on Intel Core CPUs starting with Sandy Bridge, and Atom CPUs starting with Silvermont. In principle the generic metrics should be also implementable on other out of order CPUs. TopDown level 1 uses a set of abstracted metrics which are generic to out of order CPU cores (although some CPUs may not implement all of them): topdown-total-slots Available slots in the pipeline topdown-slots-issued Slots issued into the pipeline topdown-slots-retired Slots successfully retired topdown-fetch-bubbles Pipeline gaps in the frontend topdown-recovery-bubbles Pipeline gaps during recovery from misspeculation These metrics then allow to compute four useful metrics: FrontendBound, BackendBound, Retiring, BadSpeculation. Add a new --topdown options to enable events. When --topdown is specified set up events for all topdown events supported by the kernel. Add topdown-* as a special case to the event parser, as is needed for all events containing -. The actual code to compute the metrics is in follow-on patches. v2: Use standard sysctl read function. v3: Move x86 specific code to arch/ v4: Enable --metric-only implicitly for topdown. v5: Add --single-thread option to not force per core mode v6: Fix output order of topdown metrics v7: Allow combining with -d v8: Remove --single-thread again v9: Rename functions, adding arch_ and topdown_. v10: Expand man page and describe TopDown better Paste intro into commit description. Print error when malloc fails. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/1464119559-17203-1-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-05-30 15:49:42 +00:00
Frontend bound means that the CPU cannot fetch and decode instructions fast
enough. Backend bound means that computation or memory access is the bottle
neck. Bad Speculation means that the CPU wasted cycles due to branch
mispredictions and similar issues. Retiring means that the CPU computed without
an apparently bottleneck. The bottleneck is only the real bottleneck
if the workload is actually bound by the CPU and not by something else.
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
This enables --metric-only, unless overridden with --no-metric-only.
The following restrictions only apply to older Intel CPUs and Atom,
on newer CPUs (IceLake and later) TopDown can be collected for any thread:
perf stat: Basic support for TopDown in perf stat Add basic plumbing for TopDown in perf stat TopDown is intended to replace the frontend cycles idle/ backend cycles idle metrics in standard perf stat output. These metrics are not reliable in many workloads, due to out of order effects. This implements a new --topdown mode in perf stat (similar to --transaction) that measures the pipe line bottlenecks using standardized formulas. The measurement can be all done with 5 counters (one fixed counter) The result are four metrics: FrontendBound, BackendBound, BadSpeculation, Retiring that describe the CPU pipeline behavior on a high level. The full top down methology has many hierarchical metrics. This implementation only supports level 1 which can be collected without multiplexing. A full implementation of top down on top of perf is available in pmu-tools toplev. (http://github.com/andikleen/pmu-tools) The current version works on Intel Core CPUs starting with Sandy Bridge, and Atom CPUs starting with Silvermont. In principle the generic metrics should be also implementable on other out of order CPUs. TopDown level 1 uses a set of abstracted metrics which are generic to out of order CPU cores (although some CPUs may not implement all of them): topdown-total-slots Available slots in the pipeline topdown-slots-issued Slots issued into the pipeline topdown-slots-retired Slots successfully retired topdown-fetch-bubbles Pipeline gaps in the frontend topdown-recovery-bubbles Pipeline gaps during recovery from misspeculation These metrics then allow to compute four useful metrics: FrontendBound, BackendBound, Retiring, BadSpeculation. Add a new --topdown options to enable events. When --topdown is specified set up events for all topdown events supported by the kernel. Add topdown-* as a special case to the event parser, as is needed for all events containing -. The actual code to compute the metrics is in follow-on patches. v2: Use standard sysctl read function. v3: Move x86 specific code to arch/ v4: Enable --metric-only implicitly for topdown. v5: Add --single-thread option to not force per core mode v6: Fix output order of topdown metrics v7: Allow combining with -d v8: Remove --single-thread again v9: Rename functions, adding arch_ and topdown_. v10: Expand man page and describe TopDown better Paste intro into commit description. Print error when malloc fails. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: http://lkml.kernel.org/r/1464119559-17203-1-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-05-30 15:49:42 +00:00
The top down metrics are collected per core instead of per
CPU thread. Per core mode is automatically enabled
and -a (global monitoring) is needed, requiring root rights or
perf.perf_event_paranoid=-1.
Topdown uses the full Performance Monitoring Unit, and needs
disabling of the NMI watchdog (as root):
echo 0 > /proc/sys/kernel/nmi_watchdog
for best results. Otherwise the bottlenecks may be inconsistent
on workload with changing phases.
To interpret the results it is usually needed to know on which
CPUs the workload runs on. If needed the CPUs can be forced using
taskset.
perf stat record: Add record command Add 'perf stat record' command support. It creates simple (header only) perf.data file ATM. The record command could be specified anywhere among stat options. All stat command options are valid for stat record command with '-o' option exception. If specified for record command it denotes the perf data file name. Committer note: Set sample_type to PERF_SAMPLE_IDENTIFIER, which should be harmless while avoiding that older tools show confusing messages, for instance, with sample_type = 0, we get: $ perf stat record usleep 1 Performance counter stats for 'usleep 1': 0.630237 task-clock (msec) # 0.528 CPUs utilized 1 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 52 page-faults # 0.083 M/sec 978,312 cycles # 1.552 GHz 671,931 stalled-cycles-frontend # 68.68% frontend cycles idle <not supported> stalled-cycles-backend 646,379 instructions # 0.66 insns per cycle # 1.04 stalled cycles per insn 131,046 branches # 207.931 M/sec 7,073 branch-misses # 5.40% of all branches 0.001193240 seconds time elapsed $ oldperf evlist WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? non matching sample_type $ While with sample_type set to PERF_SAMPLE_IDENTIFIER, after we re-run 'perf stat record usleep' we get: $ oldperf evlist WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? task-clock context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses $ Which at least shows the names of the events in the perf.data file. Additionally, such files, when passed to 'perf report' will produce: $ oldperf report --stdio WARNING: The perf.data file's data size field is 0 which is unexpected. Was the 'perf record' command properly terminated? Warning: Kernel address maps (/proc/{kallsyms,modules}) were restricted. Check /proc/sys/kernel/kptr_restrict before running 'perf record'. As no suitable kallsyms nor vmlinux was found, kernel samples can't be resolved. Samples in kernel modules can't be resolved as well. Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. # $ Which is confusing and can be solved by just adding the kernel mmap record, which will also remove that warning about the data size field being equal to zero, after generating the mmap record: $ perf stat record usleep 1 Performance counter stats for 'usleep 1': 0.600796 task-clock (msec) # 0.478 CPUs utilized 1 context-switches # 0.002 M/sec 0 cpu-migrations # 0.000 K/sec 54 page-faults # 0.090 M/sec 886,844 cycles # 1.476 GHz 582,169 stalled-cycles-frontend # 65.65% frontend cycles idle <not supported> stalled-cycles-backend 638,344 instructions # 0.72 insns per cycle # 0.91 stalled cycles per insn 130,204 branches # 216.719 M/sec 7,500 branch-misses # 5.76% of all branches 0.001255897 seconds time elapsed $ oldperf evlist task-clock context-switches cpu-migrations page-faults cycles stalled-cycles-frontend stalled-cycles-backend instructions branches branch-misses $ oldperf report --stdio Error: The perf.data file has no samples! # To display the perf.data header info, please use --header/--header-only options. # [acme@zoo linux]$ No warnings, sensible output about what are the events in the perf.data file and also a "file has no samples" message, which indeed it doesn't. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Kan Liang <kan.liang@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: htp://lkml.kernel.org/r/1446734469-11352-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2015-11-05 14:40:46 +00:00
perf stat: Support L2 Topdown events The TMA method level 2 metrics is supported from the Intel Sapphire Rapids server, which expose four L2 Topdown metrics events to user space. There are eight L2 events in total. The other four L2 Topdown metrics events are calculated from the corresponding L1 and the exposed L2 events. Now, the --topdown prints the complete top-down metrics that supported by the CPU. For the Intel Sapphire Rapids server, there are 4 L1 events and 8 L2 events displyed in one line. Add a new option, --td-level, to display the top-down statistics that equal to or lower than the input level. The L2 event is marked only when both its L1 parent event and itself crosse the threshold. Here is an example: $ perf stat --topdown --td-level=2 --no-metric-only sleep 1 Topdown accuracy may decrease when measuring long periods. Please print the result regularly, e.g. -I1000 Performance counter stats for 'sleep 1': 16,734,390 slots 2,100,001 topdown-retiring # 12.6% retiring 2,034,376 topdown-bad-spec # 12.3% bad speculation 4,003,128 topdown-fe-bound # 24.1% frontend bound 328,125 topdown-heavy-ops # 2.0% heavy operations # 10.6% light operations 1,968,751 topdown-br-mispredict # 11.9% branch mispredict # 0.4% machine clears 2,953,127 topdown-fetch-lat # 17.8% fetch latency # 6.3% fetch bandwidth 5,906,255 topdown-mem-bound # 35.6% memory bound # 15.4% core bound Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/1612296553-21962-9-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-02-02 20:09:12 +00:00
--td-level::
perf doc: Refresh topdown documentation perf stat now supports --topdown for any platform with the TopdownL1 metric group including Intel before Icelake. Tweak the documentation to reflect this. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexandre Torgue <alexandre.torgue@foss.st.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Cc: Caleb Biggers <caleb.biggers@intel.com> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Florian Fischer <florian.fischer@muhq.space> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jing Zhang <renyu.zj@linux.alibaba.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Perry Taylor <perry.taylor@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Sandipan Das <sandipan.das@amd.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-stm32@st-md-mailman.stormreply.com Link: https://lore.kernel.org/r/20230219092848.639226-43-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-02-19 09:28:39 +00:00
Print the top-down statistics that equal the input level. It allows
users to print the interested top-down metrics level instead of the
level 1 top-down metrics.
As the higher levels gather more metrics and use more counters they
will be less accurate. By convention a metric can be examined by
appending '_group' to it and this will increase accuracy compared to
gathering all metrics for a level. For example, level 1 analysis may
highlight 'tma_frontend_bound'. This metric may be drilled into with
'tma_frontend_bound_group' with
'perf stat -M tma_frontend_bound_group...'.
perf stat: Support L2 Topdown events The TMA method level 2 metrics is supported from the Intel Sapphire Rapids server, which expose four L2 Topdown metrics events to user space. There are eight L2 events in total. The other four L2 Topdown metrics events are calculated from the corresponding L1 and the exposed L2 events. Now, the --topdown prints the complete top-down metrics that supported by the CPU. For the Intel Sapphire Rapids server, there are 4 L1 events and 8 L2 events displyed in one line. Add a new option, --td-level, to display the top-down statistics that equal to or lower than the input level. The L2 event is marked only when both its L1 parent event and itself crosse the threshold. Here is an example: $ perf stat --topdown --td-level=2 --no-metric-only sleep 1 Topdown accuracy may decrease when measuring long periods. Please print the result regularly, e.g. -I1000 Performance counter stats for 'sleep 1': 16,734,390 slots 2,100,001 topdown-retiring # 12.6% retiring 2,034,376 topdown-bad-spec # 12.3% bad speculation 4,003,128 topdown-fe-bound # 24.1% frontend bound 328,125 topdown-heavy-ops # 2.0% heavy operations # 10.6% light operations 1,968,751 topdown-br-mispredict # 11.9% branch mispredict # 0.4% machine clears 2,953,127 topdown-fetch-lat # 17.8% fetch latency # 6.3% fetch bandwidth 5,906,255 topdown-mem-bound # 35.6% memory bound # 15.4% core bound Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/1612296553-21962-9-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-02-02 20:09:12 +00:00
Error out if the input is higher than the supported max level.
--smi-cost::
Measure SMI cost if msr/aperf/ and msr/smi/ events are supported.
During the measurement, the /sys/device/cpu/freeze_on_smi will be set to
freeze core counters on SMI.
The aperf counter will not be effected by the setting.
The cost of SMI can be measured by (aperf - unhalted core cycles).
In practice, the percentages of SMI cycles is very useful for performance
oriented analysis. --metric_only will be applied by default.
The output is SMI cycles%, equals to (aperf - unhalted core cycles) / aperf
Users who wants to get the actual value can apply --no-metric-only.
--all-kernel::
Configure all used events to run in kernel space.
--all-user::
Configure all used events to run in user space.
perf stat: Show percore counts in per CPU output We have supported the event modifier "percore" which sums up the event counts for all hardware threads in a core and show the counts per core. For example, # perf stat -e cpu/event=cpu-cycles,percore/ -a -A -- sleep 1 Performance counter stats for 'system wide': S0-D0-C0 395,072 cpu/event=cpu-cycles,percore/ S0-D0-C1 851,248 cpu/event=cpu-cycles,percore/ S0-D0-C2 954,226 cpu/event=cpu-cycles,percore/ S0-D0-C3 1,233,659 cpu/event=cpu-cycles,percore/ This patch provides a new option "--percore-show-thread". It is used with event modifier "percore" together to sum up the event counts for all hardware threads in a core but show the counts per hardware thread. This is essentially a replacement for the any bit (which is gone in Icelake). Per core counts are useful for some formulas, e.g. CoreIPC. The original percore version was inconvenient to post process. This variant matches the output of the any bit. With this patch, for example, # perf stat -e cpu/event=cpu-cycles,percore/ -a -A --percore-show-thread -- sleep 1 Performance counter stats for 'system wide': CPU0 2,453,061 cpu/event=cpu-cycles,percore/ CPU1 1,823,921 cpu/event=cpu-cycles,percore/ CPU2 1,383,166 cpu/event=cpu-cycles,percore/ CPU3 1,102,652 cpu/event=cpu-cycles,percore/ CPU4 2,453,061 cpu/event=cpu-cycles,percore/ CPU5 1,823,921 cpu/event=cpu-cycles,percore/ CPU6 1,383,166 cpu/event=cpu-cycles,percore/ CPU7 1,102,652 cpu/event=cpu-cycles,percore/ We can see counts are duplicated in CPU pairs (CPU0/CPU4, CPU1/CPU5, CPU2/CPU6, CPU3/CPU7). The interval mode also works. For example, # perf stat -e cpu/event=cpu-cycles,percore/ -a -A --percore-show-thread -I 1000 # time CPU counts unit events 1.000425421 CPU0 925,032 cpu/event=cpu-cycles,percore/ 1.000425421 CPU1 430,202 cpu/event=cpu-cycles,percore/ 1.000425421 CPU2 436,843 cpu/event=cpu-cycles,percore/ 1.000425421 CPU3 1,192,504 cpu/event=cpu-cycles,percore/ 1.000425421 CPU4 925,032 cpu/event=cpu-cycles,percore/ 1.000425421 CPU5 430,202 cpu/event=cpu-cycles,percore/ 1.000425421 CPU6 436,843 cpu/event=cpu-cycles,percore/ 1.000425421 CPU7 1,192,504 cpu/event=cpu-cycles,percore/ If we offline CPU5, the result is: # perf stat -e cpu/event=cpu-cycles,percore/ -a -A --percore-show-thread -- sleep 1 Performance counter stats for 'system wide': CPU0 2,752,148 cpu/event=cpu-cycles,percore/ CPU1 1,009,312 cpu/event=cpu-cycles,percore/ CPU2 2,784,072 cpu/event=cpu-cycles,percore/ CPU3 2,427,922 cpu/event=cpu-cycles,percore/ CPU4 2,752,148 cpu/event=cpu-cycles,percore/ CPU6 2,784,072 cpu/event=cpu-cycles,percore/ CPU7 2,427,922 cpu/event=cpu-cycles,percore/ 1.001416041 seconds time elapsed v4: --- Ravi Bangoria reports an issue in v3. Once we offline a CPU, the output is not correct. The issue is we should use the cpu idx in print_percore_thread rather than using the cpu value. v3: --- 1. Fix the interval mode output error 2. Use cpu value (not cpu index) in config->aggr_get_id(). 3. Refine the code according to Jiri's comments. v2: --- Add the explanation in change log. This is essentially a replacement for the any bit. No code change. Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200214080452.26402-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-02-14 08:04:52 +00:00
--percore-show-thread::
The event modifier "percore" has supported to sum up the event counts
for all hardware threads in a core and show the counts per core.
This option with event modifier "percore" enabled also sums up the event
counts for all hardware threads in a core but show the sum counts per
hardware thread. This is essentially a replacement for the any bit and
convenient for post processing.
perf stat: Turn off summary for interval mode by default There's a risk that outputting interval mode summaries by default breaks CSV consumers. It already broke pmu-tools/toplev. So now we turn off the summary by default but we create a new option '--summary' to enable the summary. This is active even when not using CSV mode. Before: root@kbl-ppc:~# perf stat -I1000 --interval-count 2 # time counts unit events 1.000265904 8,005.73 msec cpu-clock # 8.006 CPUs utilized 1.000265904 601 context-switches # 0.075 K/sec 1.000265904 10 cpu-migrations # 0.001 K/sec 1.000265904 0 page-faults # 0.000 K/sec 1.000265904 66,746,521 cycles # 0.008 GHz 1.000265904 71,874,398 instructions # 1.08 insn per cycle 1.000265904 13,356,781 branches # 1.668 M/sec 1.000265904 298,756 branch-misses # 2.24% of all branches 2.001857667 8,012.52 msec cpu-clock # 8.013 CPUs utilized 2.001857667 164 context-switches # 0.020 K/sec 2.001857667 10 cpu-migrations # 0.001 K/sec 2.001857667 2 page-faults # 0.000 K/sec 2.001857667 5,822,188 cycles # 0.001 GHz 2.001857667 2,186,170 instructions # 0.38 insn per cycle 2.001857667 442,378 branches # 0.055 M/sec 2.001857667 44,750 branch-misses # 10.12% of all branches Performance counter stats for 'system wide': 16,018.25 msec cpu-clock # 7.993 CPUs utilized 765 context-switches # 0.048 K/sec 20 cpu-migrations # 0.001 K/sec 2 page-faults # 0.000 K/sec 72,568,709 cycles # 0.005 GHz 74,060,568 instructions # 1.02 insn per cycle 13,799,159 branches # 0.861 M/sec 343,506 branch-misses # 2.49% of all branches 2.004118489 seconds time elapsed After: root@kbl-ppc:~# perf stat -I1000 --interval-count 2 # time counts unit events 1.001336393 8,013.28 msec cpu-clock # 8.013 CPUs utilized 1.001336393 82 context-switches # 0.010 K/sec 1.001336393 8 cpu-migrations # 0.001 K/sec 1.001336393 0 page-faults # 0.000 K/sec 1.001336393 4,199,121 cycles # 0.001 GHz 1.001336393 1,373,991 instructions # 0.33 insn per cycle 1.001336393 270,681 branches # 0.034 M/sec 1.001336393 31,659 branch-misses # 11.70% of all branches 2.003905006 8,020.52 msec cpu-clock # 8.021 CPUs utilized 2.003905006 184 context-switches # 0.023 K/sec 2.003905006 8 cpu-migrations # 0.001 K/sec 2.003905006 2 page-faults # 0.000 K/sec 2.003905006 5,446,190 cycles # 0.001 GHz 2.003905006 2,312,547 instructions # 0.42 insn per cycle 2.003905006 451,691 branches # 0.056 M/sec 2.003905006 37,925 branch-misses # 8.40% of all branches root@kbl-ppc:~# perf stat -I1000 --interval-count 2 --summary # time counts unit events 1.001313128 8,013.20 msec cpu-clock # 8.013 CPUs utilized 1.001313128 83 context-switches # 0.010 K/sec 1.001313128 8 cpu-migrations # 0.001 K/sec 1.001313128 0 page-faults # 0.000 K/sec 1.001313128 4,470,950 cycles # 0.001 GHz 1.001313128 1,440,045 instructions # 0.32 insn per cycle 1.001313128 283,222 branches # 0.035 M/sec 1.001313128 33,576 branch-misses # 11.86% of all branches 2.003857385 8,020.34 msec cpu-clock # 8.020 CPUs utilized 2.003857385 154 context-switches # 0.019 K/sec 2.003857385 8 cpu-migrations # 0.001 K/sec 2.003857385 2 page-faults # 0.000 K/sec 2.003857385 4,515,676 cycles # 0.001 GHz 2.003857385 2,180,449 instructions # 0.48 insn per cycle 2.003857385 435,254 branches # 0.054 M/sec 2.003857385 31,179 branch-misses # 7.16% of all branches Performance counter stats for 'system wide': 16,033.53 msec cpu-clock # 7.992 CPUs utilized 237 context-switches # 0.015 K/sec 16 cpu-migrations # 0.001 K/sec 2 page-faults # 0.000 K/sec 8,986,626 cycles # 0.001 GHz 3,620,494 instructions # 0.40 insn per cycle 718,476 branches # 0.045 M/sec 64,755 branch-misses # 9.01% of all branches 2.006124542 seconds time elapsed Fixes: c7e5b328a8d4 ("perf stat: Report summary for interval mode") Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200903010113.32232-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2020-09-03 01:01:13 +00:00
--summary::
Print summary for interval mode (-I).
perf stat: Align CSV output for summary mode The 'perf stat' subcommand supports the request for a summary of the interval counter readings. But the summary lines break the CSV output so it's hard for scripts to parse the result. Before: # perf stat -x, -I1000 --interval-count 1 --summary 1.001323097,8013.48,msec,cpu-clock,8013483384,100.00,8.013,CPUs utilized 1.001323097,270,,context-switches,8013513297,100.00,0.034,K/sec 1.001323097,13,,cpu-migrations,8013530032,100.00,0.002,K/sec 1.001323097,184,,page-faults,8013546992,100.00,0.023,K/sec 1.001323097,20574191,,cycles,8013551506,100.00,0.003,GHz 1.001323097,10562267,,instructions,8013564958,100.00,0.51,insn per cycle 1.001323097,2019244,,branches,8013575673,100.00,0.252,M/sec 1.001323097,106152,,branch-misses,8013585776,100.00,5.26,of all branches 8013.48,msec,cpu-clock,8013483384,100.00,7.984,CPUs utilized 270,,context-switches,8013513297,100.00,0.034,K/sec 13,,cpu-migrations,8013530032,100.00,0.002,K/sec 184,,page-faults,8013546992,100.00,0.023,K/sec 20574191,,cycles,8013551506,100.00,0.003,GHz 10562267,,instructions,8013564958,100.00,0.51,insn per cycle 2019244,,branches,8013575673,100.00,0.252,M/sec 106152,,branch-misses,8013585776,100.00,5.26,of all branches The summary line loses the timestamp column, which breaks the CSV output. We add a column at the original 'timestamp' position and it just says 'summary' for the summary line. After: # perf stat -x, -I1000 --interval-count 1 --summary 1.001196053,8012.72,msec,cpu-clock,8012722903,100.00,8.013,CPUs utilized 1.001196053,218,,context-switches,8012753271,100.00,0.027,K/sec 1.001196053,9,,cpu-migrations,8012769767,100.00,0.001,K/sec 1.001196053,0,,page-faults,8012786257,100.00,0.000,K/sec 1.001196053,15004518,,cycles,8012790637,100.00,0.002,GHz 1.001196053,7954691,,instructions,8012804027,100.00,0.53,insn per cycle 1.001196053,1590259,,branches,8012814766,100.00,0.198,M/sec 1.001196053,82601,,branch-misses,8012824365,100.00,5.19,of all branches summary,8012.72,msec,cpu-clock,8012722903,100.00,7.986,CPUs utilized summary,218,,context-switches,8012753271,100.00,0.027,K/sec summary,9,,cpu-migrations,8012769767,100.00,0.001,K/sec summary,0,,page-faults,8012786257,100.00,0.000,K/sec summary,15004518,,cycles,8012790637,100.00,0.002,GHz summary,7954691,,instructions,8012804027,100.00,0.53,insn per cycle summary,1590259,,branches,8012814766,100.00,0.198,M/sec summary,82601,,branch-misses,8012824365,100.00,5.19,of all branches Now it's easy for script to analyse the summary lines. Of course, we also consider not to break possible existing scripts which can continue to use the broken CSV format by using a new '--no-csv-summary.' option. # perf stat -x, -I1000 --interval-count 1 --summary --no-csv-summary 1.001213261,8012.67,msec,cpu-clock,8012672327,100.00,8.013,CPUs utilized 1.001213261,197,,context-switches,8012703742,100.00,24.586,/sec 1.001213261,9,,cpu-migrations,8012720902,100.00,1.123,/sec 1.001213261,644,,page-faults,8012738266,100.00,80.373,/sec 1.001213261,18350698,,cycles,8012744109,100.00,0.002,GHz 1.001213261,12745021,,instructions,8012759001,100.00,0.69,insn per cycle 1.001213261,2458033,,branches,8012770864,100.00,306.768,K/sec 1.001213261,102107,,branch-misses,8012781751,100.00,4.15,of all branches 8012.67,msec,cpu-clock,8012672327,100.00,7.985,CPUs utilized 197,,context-switches,8012703742,100.00,24.586,/sec 9,,cpu-migrations,8012720902,100.00,1.123,/sec 644,,page-faults,8012738266,100.00,80.373,/sec 18350698,,cycles,8012744109,100.00,0.002,GHz 12745021,,instructions,8012759001,100.00,0.69,insn per cycle 2458033,,branches,8012770864,100.00,306.768,K/sec 102107,,branch-misses,8012781751,100.00,4.15,of all branches This option can be enabled in perf config by setting the variable 'stat.no-csv-summary'. # perf config stat.no-csv-summary=true # perf config -l stat.no-csv-summary=true # perf stat -x, -I1000 --interval-count 1 --summary 1.001330198,8013.28,msec,cpu-clock,8013279201,100.00,8.013,CPUs utilized 1.001330198,205,,context-switches,8013308394,100.00,25.583,/sec 1.001330198,10,,cpu-migrations,8013324681,100.00,1.248,/sec 1.001330198,0,,page-faults,8013340926,100.00,0.000,/sec 1.001330198,8027742,,cycles,8013344503,100.00,0.001,GHz 1.001330198,2871717,,instructions,8013356501,100.00,0.36,insn per cycle 1.001330198,553564,,branches,8013366204,100.00,69.081,K/sec 1.001330198,54021,,branch-misses,8013375952,100.00,9.76,of all branches 8013.28,msec,cpu-clock,8013279201,100.00,7.985,CPUs utilized 205,,context-switches,8013308394,100.00,25.583,/sec 10,,cpu-migrations,8013324681,100.00,1.248,/sec 0,,page-faults,8013340926,100.00,0.000,/sec 8027742,,cycles,8013344503,100.00,0.001,GHz 2871717,,instructions,8013356501,100.00,0.36,insn per cycle 553564,,branches,8013366204,100.00,69.081,K/sec 54021,,branch-misses,8013375952,100.00,9.76,of all branches Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210319070156.20394-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-03-19 07:01:55 +00:00
--no-csv-summary::
Don't print 'summary' at the first column for CVS summary output.
This option must be used with -x and --summary.
This option can be enabled in perf config by setting the variable
'stat.no-csv-summary'.
$ perf config stat.no-csv-summary=true
perf stat: Support --cputype option for hybrid events In previous patch, we have supported the syntax which enables the event on a specified pmu, such as: cpu_core/<event>/ cpu_atom/<event>/ While this syntax is not very easy for applying on a set of events or applying on a group. In following example, we have to explicitly assign the pmu prefix. # ./perf stat -e '{cpu_core/cycles/,cpu_core/instructions/}' -- sleep 1 Performance counter stats for 'sleep 1': 1,158,545 cpu_core/cycles/ 1,003,113 cpu_core/instructions/ 1.002428712 seconds time elapsed A much easier way is: # ./perf stat --cputype core -e '{cycles,instructions}' -- sleep 1 Performance counter stats for 'sleep 1': 1,101,071 cpu_core/cycles/ 939,892 cpu_core/instructions/ 1.002363142 seconds time elapsed For this example, the '--cputype' enables the events from specified pmu (cpu_core). If '--cputype' conflicts with pmu prefix, '--cputype' is ignored. # ./perf stat --cputype core -e cycles,cpu_atom/instructions/ -a -- sleep 1 Performance counter stats for 'system wide': 21,003,407 cpu_core/cycles/ 367,886 cpu_atom/instructions/ 1.002203520 seconds time elapsed Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20210909062215.10278-1-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2021-09-09 06:22:15 +00:00
--cputype::
Only enable events on applying cpu with this type for hybrid platform
(e.g. core or atom)"
EXAMPLES
--------
$ perf stat \-- make
perf stat: Display user and system time Adding the support to read rusage data once the workload is finished and display the system/user time values: $ perf stat --null perf bench sched pipe ... Performance counter stats for 'perf bench sched pipe': 5.342599256 seconds time elapsed 2.544434000 seconds user 4.549691000 seconds sys It works only in non -r mode and only for workload target. So as of now, for workload targets, we display 3 types of timings. The time we meassure in perf stat from enable to disable+period: 5.342599256 seconds time elapsed The time spent in user and system lands, displayed only for workload session/target: 2.544434000 seconds user 4.549691000 seconds sys Those times are the very same displayed by 'time' tool. They are returned by wait4 call via the getrusage struct interface. Committer notes: Had to rename some variables to avoid this on older systems such as centos:6: builtin-stat.c: In function 'print_footer': builtin-stat.c:1831: warning: declaration of 'stime' shadows a global declaration /usr/include/time.h:297: warning: shadowed declaration is here Committer testing: # perf stat --null time perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 5.526 [sec] 5.526534 usecs/op 180945 ops/sec 1.00user 6.25system 0:05.52elapsed 131%CPU (0avgtext+0avgdata 8056maxresident)k 0inputs+0outputs (0major+606minor)pagefaults 0swaps Performance counter stats for 'time perf bench sched pipe': 5.530978744 seconds time elapsed 1.004037000 seconds user 6.259937000 seconds sys # Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180605121313.31337-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-06-05 12:13:13 +00:00
Performance counter stats for 'make':
perf stat: Display user and system time Adding the support to read rusage data once the workload is finished and display the system/user time values: $ perf stat --null perf bench sched pipe ... Performance counter stats for 'perf bench sched pipe': 5.342599256 seconds time elapsed 2.544434000 seconds user 4.549691000 seconds sys It works only in non -r mode and only for workload target. So as of now, for workload targets, we display 3 types of timings. The time we meassure in perf stat from enable to disable+period: 5.342599256 seconds time elapsed The time spent in user and system lands, displayed only for workload session/target: 2.544434000 seconds user 4.549691000 seconds sys Those times are the very same displayed by 'time' tool. They are returned by wait4 call via the getrusage struct interface. Committer notes: Had to rename some variables to avoid this on older systems such as centos:6: builtin-stat.c: In function 'print_footer': builtin-stat.c:1831: warning: declaration of 'stime' shadows a global declaration /usr/include/time.h:297: warning: shadowed declaration is here Committer testing: # perf stat --null time perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 5.526 [sec] 5.526534 usecs/op 180945 ops/sec 1.00user 6.25system 0:05.52elapsed 131%CPU (0avgtext+0avgdata 8056maxresident)k 0inputs+0outputs (0major+606minor)pagefaults 0swaps Performance counter stats for 'time perf bench sched pipe': 5.530978744 seconds time elapsed 1.004037000 seconds user 6.259937000 seconds sys # Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180605121313.31337-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-06-05 12:13:13 +00:00
83723.452481 task-clock:u (msec) # 1.004 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
3,228,188 page-faults:u # 0.039 M/sec
229,570,665,834 cycles:u # 2.742 GHz
313,163,853,778 instructions:u # 1.36 insn per cycle
69,704,684,856 branches:u # 832.559 M/sec
2,078,861,393 branch-misses:u # 2.98% of all branches
perf stat: Display user and system time Adding the support to read rusage data once the workload is finished and display the system/user time values: $ perf stat --null perf bench sched pipe ... Performance counter stats for 'perf bench sched pipe': 5.342599256 seconds time elapsed 2.544434000 seconds user 4.549691000 seconds sys It works only in non -r mode and only for workload target. So as of now, for workload targets, we display 3 types of timings. The time we meassure in perf stat from enable to disable+period: 5.342599256 seconds time elapsed The time spent in user and system lands, displayed only for workload session/target: 2.544434000 seconds user 4.549691000 seconds sys Those times are the very same displayed by 'time' tool. They are returned by wait4 call via the getrusage struct interface. Committer notes: Had to rename some variables to avoid this on older systems such as centos:6: builtin-stat.c: In function 'print_footer': builtin-stat.c:1831: warning: declaration of 'stime' shadows a global declaration /usr/include/time.h:297: warning: shadowed declaration is here Committer testing: # perf stat --null time perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 5.526 [sec] 5.526534 usecs/op 180945 ops/sec 1.00user 6.25system 0:05.52elapsed 131%CPU (0avgtext+0avgdata 8056maxresident)k 0inputs+0outputs (0major+606minor)pagefaults 0swaps Performance counter stats for 'time perf bench sched pipe': 5.530978744 seconds time elapsed 1.004037000 seconds user 6.259937000 seconds sys # Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180605121313.31337-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-06-05 12:13:13 +00:00
83.409183620 seconds time elapsed
74.684747000 seconds user
8.739217000 seconds sys
TIMINGS
-------
As displayed in the example above we can display 3 types of timings.
We always display the time the counters were enabled/alive:
83.409183620 seconds time elapsed
For workload sessions we also display time the workloads spent in
user/system lands:
74.684747000 seconds user
8.739217000 seconds sys
Those times are the very same as displayed by the 'time' tool.
CSV FORMAT
----------
With -x, perf stat is able to output a not-quite-CSV format output
Commas in the output are not put into "". To make it easy to parse
it is recommended to use a different character like -x \;
The fields are in this order:
- optional usec time stamp in fractions of second (with -I xxx)
- optional CPU, core, or socket identifier
- optional number of logical CPUs aggregated
- counter value
- unit of the counter value or empty
- event name
- run time of counter
- percentage of measurement time the counter was running
- optional variance if multiple values are collected with -r
- optional metric value
- optional unit of metric
Additional metrics may be printed with all earlier fields being empty.
include::intel-hybrid.txt[]
perf stat: Add JSON output option CSV output is tricky to format and column layout changes are susceptible to breaking parsers. New JSON-formatted output has variable names to identify fields that are consistent and informative, making the output parseable. CSV output example: 1.20,msec,task-clock:u,1204272,100.00,0.697,CPUs utilized 0,,context-switches:u,1204272,100.00,0.000,/sec 0,,cpu-migrations:u,1204272,100.00,0.000,/sec 70,,page-faults:u,1204272,100.00,58.126,K/sec JSON output example: {"counter-value" : "3805.723968", "unit" : "msec", "event" : "cpu-clock", "event-runtime" : 3805731510100.00, "pcnt-running" : 100.00, "metric-value" : 4.007571, "metric-unit" : "CPUs utilized"} {"counter-value" : "6166.000000", "unit" : "", "event" : "context-switches", "event-runtime" : 3805723045100.00, "pcnt-running" : 100.00, "metric-value" : 1.620191, "metric-unit" : "K/sec"} {"counter-value" : "466.000000", "unit" : "", "event" : "cpu-migrations", "event-runtime" : 3805727613100.00, "pcnt-running" : 100.00, "metric-value" : 122.447136, "metric-unit" : "/sec"} {"counter-value" : "208.000000", "unit" : "", "event" : "page-faults", "event-runtime" : 3805726799100.00, "pcnt-running" : 100.00, "metric-value" : 54.654516, "metric-unit" : "/sec"} Also added documentation for JSON option. There is some tidy up of CSV code including a potential memory over run in the os.nfields set up. To facilitate this an AGGR_MAX value is added. Committer notes: Fixed up using PRIu64 to format u64 values, not %lu. Committer testing: ⬢[acme@toolbox perf]$ perf stat -j sleep 1 {"counter-value" : "0.731750", "unit" : "msec", "event" : "task-clock:u", "event-runtime" : 731750, "pcnt-running" : 100.00, "metric-value" : 0.000731, "metric-unit" : "CPUs utilized"} {"counter-value" : "0.000000", "unit" : "", "event" : "context-switches:u", "event-runtime" : 731750, "pcnt-running" : 100.00, "metric-value" : 0.000000, "metric-unit" : "/sec"} {"counter-value" : "0.000000", "unit" : "", "event" : "cpu-migrations:u", "event-runtime" : 731750, "pcnt-running" : 100.00, "metric-value" : 0.000000, "metric-unit" : "/sec"} {"counter-value" : "75.000000", "unit" : "", "event" : "page-faults:u", "event-runtime" : 731750, "pcnt-running" : 100.00, "metric-value" : 102.494021, "metric-unit" : "K/sec"} {"counter-value" : "578765.000000", "unit" : "", "event" : "cycles:u", "event-runtime" : 379366, "pcnt-running" : 49.00, "metric-value" : 0.790933, "metric-unit" : "GHz"} {"counter-value" : "1298.000000", "unit" : "", "event" : "stalled-cycles-frontend:u", "event-runtime" : 768020, "pcnt-running" : 100.00, "metric-value" : 0.224271, "metric-unit" : "frontend cycles idle"} {"counter-value" : "21984.000000", "unit" : "", "event" : "stalled-cycles-backend:u", "event-runtime" : 768020, "pcnt-running" : 100.00, "metric-value" : 3.798433, "metric-unit" : "backend cycles idle"} {"counter-value" : "468197.000000", "unit" : "", "event" : "instructions:u", "event-runtime" : 768020, "pcnt-running" : 100.00, "metric-value" : 0.808959, "metric-unit" : "insn per cycle"} {"metric-value" : 0.046955, "metric-unit" : "stalled cycles per insn"} {"counter-value" : "103335.000000", "unit" : "", "event" : "branches:u", "event-runtime" : 768020, "pcnt-running" : 100.00, "metric-value" : 141.216262, "metric-unit" : "M/sec"} {"counter-value" : "2381.000000", "unit" : "", "event" : "branch-misses:u", "event-runtime" : 388654, "pcnt-running" : 50.00, "metric-value" : 2.304156, "metric-unit" : "of all branches"} ⬢[acme@toolbox perf]$ Signed-off-by: Claire Jensen <cjense@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alyssa Ross <hi@alyssa.is> Cc: Claire Jensen <clairej735@gmail.com> Cc: Florian Fischer <florian.fischer@muhq.space> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Like Xu <likexu@tencent.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sandipan Das <sandipan.das@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Link: https://lore.kernel.org/r/20220805200105.2020995-2-irogers@google.com Signed-off-by: Ian Rogers <irogers@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2022-08-05 20:01:04 +00:00
JSON FORMAT
-----------
With -j, perf stat is able to print out a JSON format output
that can be used for parsing.
- timestamp : optional usec time stamp in fractions of second (with -I)
- optional aggregate options:
- core : core identifier (with --per-core)
- die : die identifier (with --per-die)
- socket : socket identifier (with --per-socket)
- node : node identifier (with --per-node)
- thread : thread identifier (with --per-thread)
- counter-value : counter value
- unit : unit of the counter value or empty
- event : event name
- variance : optional variance if multiple values are collected (with -r)
- runtime : run time of counter
- metric-value : optional metric value
- metric-unit : optional unit of metric
SEE ALSO
--------
linkperf:perf-top[1], linkperf:perf-list[1]