linux/tools/perf/Documentation/perf-stat.txt

166 lines
4.9 KiB
Plaintext
Raw Normal View History

perf-stat(1)
============
NAME
----
perf-stat - Run a command and gather performance counter statistics
SYNOPSIS
--------
[verse]
'perf stat' [-e <EVENT> | --event=EVENT] [-a] <command>
'perf stat' [-e <EVENT> | --event=EVENT] [-a] -- <command> [<options>]
DESCRIPTION
-----------
This command runs a command and gathers performance counter statistics
from it.
OPTIONS
-------
<command>...::
Any command you can specify in a shell.
-e::
--event=::
Select the PMU event. Selection can be a symbolic event name
(use 'perf list' to list all events) or a raw PMU
event (eventsel+umask) in the form of rNNN where NNN is a
hexadecimal event descriptor.
-i::
--no-inherit::
child tasks do not inherit counters
-p::
--pid=<pid>::
stat events on existing process id (comma separated list)
-t::
--tid=<tid>::
stat events on existing thread id (comma separated list)
-a::
--all-cpus::
system-wide collection from all CPUs
-c::
--scale::
scale/normalize counter values
-r::
--repeat=<n>::
repeat command and print average + stddev (max: 100). 0 means forever.
perf stat: add perf stat -B to pretty print large numbers It is hard to read very large numbers so provide an option to perf stat to separate thousands using a separator. The patch leverages the locale support of stdio. You need to set your LC_NUMERIC appropriately, for instance LC_NUMERIC=en_US.UTF8. You need to pass -B to activate this feature. This way existing scripts parsing the output do not need to be changed. Here is an example. $ perf stat noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.347031 task-clock-msecs # 0.998 CPUs 61 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 118 page-faults # 0.000 M/sec 4,138,410,900 cycles # 2070.917 M/sec (scaled from 70.01%) 2,062,650,268 instructions # 0.498 IPC (scaled from 70.01%) 2,057,653,466 branches # 1029.678 M/sec (scaled from 70.01%) 40,267 branch-misses # 0.002 % (scaled from 30.04%) 2,055,961,348 cache-references # 1028.831 M/sec (scaled from 30.03%) 53,725 cache-misses # 0.027 M/sec (scaled from 30.02%) 2.001393933 seconds time elapsed $ perf stat -B noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.297883 task-clock-msecs # 0.998 CPUs 59 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 119 page-faults # 0.000 M/sec 4,131,380,160 cycles # 2067.450 M/sec (scaled from 70.01%) 2,059,096,507 instructions # 0.498 IPC (scaled from 70.01%) 2,054,681,303 branches # 1028.216 M/sec (scaled from 70.01%) 25,650 branch-misses # 0.001 % (scaled from 30.05%) 2,056,283,014 cache-references # 1029.017 M/sec (scaled from 30.03%) 47,097 cache-misses # 0.024 M/sec (scaled from 30.02%) 2.001391016 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4bf28fe8.914ed80a.01ca.fffff5f5@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-05-18 13:00:01 +00:00
-B::
--big-num::
perf stat: add perf stat -B to pretty print large numbers It is hard to read very large numbers so provide an option to perf stat to separate thousands using a separator. The patch leverages the locale support of stdio. You need to set your LC_NUMERIC appropriately, for instance LC_NUMERIC=en_US.UTF8. You need to pass -B to activate this feature. This way existing scripts parsing the output do not need to be changed. Here is an example. $ perf stat noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.347031 task-clock-msecs # 0.998 CPUs 61 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 118 page-faults # 0.000 M/sec 4,138,410,900 cycles # 2070.917 M/sec (scaled from 70.01%) 2,062,650,268 instructions # 0.498 IPC (scaled from 70.01%) 2,057,653,466 branches # 1029.678 M/sec (scaled from 70.01%) 40,267 branch-misses # 0.002 % (scaled from 30.04%) 2,055,961,348 cache-references # 1028.831 M/sec (scaled from 30.03%) 53,725 cache-misses # 0.027 M/sec (scaled from 30.02%) 2.001393933 seconds time elapsed $ perf stat -B noploop 2 noploop for 2 seconds Performance counter stats for 'noploop 2': 1998.297883 task-clock-msecs # 0.998 CPUs 59 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 119 page-faults # 0.000 M/sec 4,131,380,160 cycles # 2067.450 M/sec (scaled from 70.01%) 2,059,096,507 instructions # 0.498 IPC (scaled from 70.01%) 2,054,681,303 branches # 1028.216 M/sec (scaled from 70.01%) 25,650 branch-misses # 0.001 % (scaled from 30.05%) 2,056,283,014 cache-references # 1029.017 M/sec (scaled from 30.03%) 47,097 cache-misses # 0.024 M/sec (scaled from 30.02%) 2.001391016 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tom Zanussi <tzanussi@gmail.com> LKML-Reference: <4bf28fe8.914ed80a.01ca.fffff5f5@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-05-18 13:00:01 +00:00
print large numbers with thousands' separators according to locale
-C::
--cpu=::
Count only on the list of CPUs provided. Multiple CPUs can be provided as a
comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2.
In per-thread mode, this option is ignored. The -a option is still necessary
to activate system-wide monitoring. Default is to count on all CPUs.
perf stat: Add no-aggregation mode to -a This patch adds a new -A option to perf stat. If specified then perf stat does not aggregate counts across all monitored CPUs in system-wide mode, i.e., when using -a. This option is not supported in per-thread mode. Being able to get a per-cpu breakdown is useful to detect imbalances between CPUs when running a uniform workload than spans all monitored CPUs. The second version corrects the missing cpumap[] support, so that it works when the -C option is used. The third version fixes a missing cpumap[] in print_counter() and removes a stray patch in builtin-trace.c. Examples on a 4-way system: # perf stat -a -e cycles,instructions -- sleep 1 Performance counter stats for 'sleep 1': 9592808135 cycles 3490380006 instructions # 0.364 IPC 1.001584632 seconds time elapsed # perf stat -a -A -e cycles,instructions -- sleep 1 Performance counter stats for 'sleep 1': CPU0 2398163767 cycles CPU1 2398180817 cycles CPU2 2398217115 cycles CPU3 2398247483 cycles CPU0 872282046 instructions # 0.364 IPC CPU1 873481776 instructions # 0.364 IPC CPU2 872638127 instructions # 0.364 IPC CPU3 872437789 instructions # 0.364 IPC 1.001556052 seconds time elapsed Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <4ce257b5.1e07e30a.7b6b.3aa9@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-11-16 09:05:01 +00:00
-A::
--no-aggr::
Do not aggregate counts across all monitored CPUs in system-wide mode (-a).
This option is only valid in system-wide mode.
-n::
--null::
null run - don't start any counters
-v::
--verbose::
be more verbose (show counter open errors, etc)
perf stat: Add csv-style output This patch adds an option (-x/--field-separator) to print counts using a CSV-style output. The user can pass a custom separator. This makes it very easy to import counts directly into your favorite spreadsheet without having to write scripts. Example: $ perf stat --field-separator=, -a -- sleep 1 4009.961740,task-clock-msecs 13,context-switches 2,CPU-migrations 189,page-faults 9596385684,cycles 3493659441,instructions 872897069,branches 41562,branch-misses 22424,cache-references 1289,cache-misses Works also in non-aggregated mode: $ perf stat -x , -a -A -- sleep 1 CPU0,1002.526168,task-clock-msecs CPU1,1002.528365,task-clock-msecs CPU2,1002.523360,task-clock-msecs CPU3,1002.519878,task-clock-msecs CPU0,1,context-switches CPU1,5,context-switches CPU2,5,context-switches CPU3,6,context-switches CPU0,0,CPU-migrations CPU1,1,CPU-migrations CPU2,0,CPU-migrations CPU3,1,CPU-migrations CPU0,2,page-faults CPU1,6,page-faults CPU2,9,page-faults CPU3,174,page-faults CPU0,2399439771,cycles CPU1,2380369063,cycles CPU2,2399142710,cycles CPU3,2373161192,cycles CPU0,872900618,instructions CPU1,873030960,instructions CPU2,872714525,instructions CPU3,874460580,instructions CPU0,221556839,branches CPU1,218134342,branches CPU2,218161730,branches CPU3,218284093,branches CPU0,18556,branch-misses CPU1,1449,branch-misses CPU2,3447,branch-misses CPU3,12714,branch-misses CPU0,8330,cache-references CPU1,313844,cache-references CPU2,47993728,cache-references CPU3,826481,cache-references CPU0,272,cache-misses CPU1,5360,cache-misses CPU2,1342193,cache-misses CPU3,13992,cache-misses This second version adds the ability to name a separator and uses field-separator as the long option to be consistent with perf report. Commiter note: Since we enabled --big-num by default in 201e0b0 and -x can't be used with it, we need to notice if the user explicitely enabled or disabled -B, add code to disable big_num if the user didn't explicitely set --big_num when -x is used. Cc: David S. Miller <davem@davemloft.net> Cc: Frederik Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: paulus@samba.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> LKML-Reference: <4cf68aa7.0fedd80a.5294.1203@mx.google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2010-12-01 16:49:05 +00:00
-x SEP::
--field-separator SEP::
print counts using a CSV-style output to make it easy to import directly into
spreadsheets. Columns are separated by the string specified in SEP.
-G name::
--cgroup name::
monitor only in the container (cgroup) called "name". This option is available only
in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to
container "name" are monitored when they run on the monitored CPUs. Multiple cgroups
can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup
to first event, second cgroup to second event and so on. It is possible to provide
an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have
corresponding events, i.e., they always refer to events defined earlier on the command
line.
-o file::
--output file::
Print the output into the designated file.
--append::
Append to the output file designated with the -o option. Ignored if -o is not specified.
--log-fd::
Log output to fd, instead of stderr. Complementary to --output, and mutually exclusive
with it. --append may be used here. Examples:
3>results perf stat --log-fd 3 -- $cmd
3>>results perf stat --log-fd 3 --append -- $cmd
--pre::
--post::
Pre and post measurement hooks, e.g.:
perf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' -- make -s -j64 O=defconfig-build/ bzImage
2013-01-29 11:47:44 +00:00
-I msecs::
--interval-print msecs::
Print count deltas every N milliseconds (minimum: 100ms)
2013-01-29 11:47:44 +00:00
example: perf stat -I 1000 -e cycles -a sleep 5
--per-socket::
Aggregate counts per processor socket for system-wide mode measurements. This
is a useful mode to detect imbalance between sockets. To enable this mode,
use --per-socket in addition to -a. (system-wide). The output includes the
socket number and the number of online processors on that socket. This is
useful to gauge the amount of aggregation.
--per-core::
Aggregate counts per physical processor for system-wide mode measurements. This
is a useful mode to detect imbalance between physical cores. To enable this mode,
use --per-core in addition to -a. (system-wide). The output includes the
core number and the number of online logical processors on that physical processor.
perf stat: Add support for --initial-delay option When measuring workloads the startup phase -- doing page faults, dynamic linking, opening files -- is often very different from the rest of the workload. Especially with smaller kernels and using counter multiplexing this can give significant measurement errors. Multiplexing assumes that the workload is mostly the same over longer periods. But at startup there is typically some spike of activity which is relatively short. If many groups are multiplexing the one group seeing the spike, and which is then scaled up over the time to run all groups, may see a significant error. Also in general it's often not useful to measure the startup, because it is so different from the rest. One way around this is to use interval mode and discard the first sample, but this can be awkward because interval mode doesn't support intervals of less than 100ms, and also a useful interval is not necessarily the same as a useful startup delay. This patch adds a new --initial-delay / -D option to skip measuring for the startup phase. The time can be specified in ms Here's a simple example: perf stat -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 3,721 page-faults ... If we just wait 20 ms the number of page faults is 1/3 less: perf stat -D 20 -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 2,823 page-faults ... So we filtered out most of the startup noise from bash. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1375490473-1503-4-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-08-03 00:41:11 +00:00
-D msecs::
--delay msecs::
perf stat: Add support for --initial-delay option When measuring workloads the startup phase -- doing page faults, dynamic linking, opening files -- is often very different from the rest of the workload. Especially with smaller kernels and using counter multiplexing this can give significant measurement errors. Multiplexing assumes that the workload is mostly the same over longer periods. But at startup there is typically some spike of activity which is relatively short. If many groups are multiplexing the one group seeing the spike, and which is then scaled up over the time to run all groups, may see a significant error. Also in general it's often not useful to measure the startup, because it is so different from the rest. One way around this is to use interval mode and discard the first sample, but this can be awkward because interval mode doesn't support intervals of less than 100ms, and also a useful interval is not necessarily the same as a useful startup delay. This patch adds a new --initial-delay / -D option to skip measuring for the startup phase. The time can be specified in ms Here's a simple example: perf stat -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 3,721 page-faults ... If we just wait 20 ms the number of page faults is 1/3 less: perf stat -D 20 -e page-faults bash -c 'for i in $(seq 100000) ; do true ; done' ... 2,823 page-faults ... So we filtered out most of the startup noise from bash. Signed-off-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1375490473-1503-4-git-send-email-andi@firstfloor.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-08-03 00:41:11 +00:00
After starting the program, wait msecs before measuring. This is useful to
filter out the startup phase of the program, which is often very different.
-T::
--transaction::
Print statistics of transactional execution if supported.
EXAMPLES
--------
$ perf stat -- make -j
Performance counter stats for 'make -j':
8117.370256 task clock ticks # 11.281 CPU utilization factor
678 context switches # 0.000 M/sec
133 CPU migrations # 0.000 M/sec
235724 pagefaults # 0.029 M/sec
24821162526 CPU cycles # 3057.784 M/sec
18687303457 instructions # 2302.138 M/sec
172158895 cache references # 21.209 M/sec
27075259 cache misses # 3.335 M/sec
Wall-clock time elapsed: 719.554352 msecs
SEE ALSO
--------
linkperf:perf-top[1], linkperf:perf-list[1]