If for some strange reason the buffer_page gets bigger, or the page struct
gets smaller, I want to know this ASAP. The best way is to not let the
kernel compile.
This patch adds code to test the size of the struct buffer_page against the
page struct and will cause compile issues if the buffer_page ever gets bigger
than the page struct.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RINGBUF_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RINGBUF_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 bytes:
type = RINGBUF_TYPE_DATA
len = 2
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0..1]: <7 bytes of data> <1 byte empty>
This event is saved in 12 bytes of the buffer.
An event with 82 bytes of data:
type = RINGBUF_TYPE_DATA
len = 0
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0]: 84 (Note the alignment)
array[1..14]: <82 bytes of data> <2 bytes empty>
The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.
Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.
ring_buffer_event_length(event): get the length of the event.
This is the size of the memory used to record this
event, and not the size of the data pay load.
ring_buffer_time_delta(event): get the time delta of the event
This returns the delta time stamp since the last event.
Note: Even though this is in the header, there should
be no reason to access this directly, accept
for debugging.
ring_buffer_event_data(event): get the data from the event
This is the function to use to get the actual data
from the event. Note, it is only a pointer to the
data inside the buffer. This data must be copied to
another location otherwise you risk it being written
over in the buffer.
ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.
ring_buffer_alloc: create a new ring buffer. Can choose between
overwrite or consumer/producer mode. Overwrite will
overwrite old data, where as consumer producer will
throw away new data if the consumer catches up with the
producer. The consumer/producer is the default.
ring_buffer_free: free the ring buffer.
ring_buffer_resize: resize the buffer. Changes the size of each cpu
buffer. Note, it is up to the caller to provide that
the buffer is not being used while this is happening.
This requirement may go away but do not count on it.
ring_buffer_lock_reserve: locks the ring buffer and allocates an
entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
the buffer.
ring_buffer_write: writes some data into the ring buffer.
ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
consume it. That is, this function increments the head
pointer.
ring_buffer_read_start: Start an iterator of a cpu buffer.
For now, this disables the cpu buffer, until you issue
a finish. This is just because we do not want the iterator
to be overwritten. This restriction may change in the future.
But note, this is used for static reading of a buffer which
is usually done "after" a trace. Live readings would want
to use the ring_buffer_consume above, which will not
disable the ring buffer.
ring_buffer_read_finish: Finishes the read iterator and reenables
the ring buffer.
ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
of the cpu buffer.
ring_buffer_size: returns the size in bytes of each cpu buffer.
Note, the real size is this times the number of CPUs.
ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty
ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
cpu buffer of another buffer. This is handy when you
want to take a snap shot of a running trace on just one
cpu. Having a backup buffer, to swap with facilitates this.
Ftrace max latencies use this.
ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.
ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.
ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.
ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
into nanosecs.
I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure. But this can be done at a later
time without affecting the ring buffer interface.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is possible that the testing thread in the ftrace wakeup test does not
run before we stop the trace. This will cause the trace to fail since nothing
will be in the buffers.
This patch adds a small wait in the wakeup test to allow for the woken task
to run and be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the boot tracer can't handle an entry output, it returns 1.
It should return 0 to relay on other output functions.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The tracing engine resets the ring buffer and the tracers touch it
too during self-tests. These self-tests happen during tracers registering
and work against boot tracing which is logging initcalls.
We have to disable tracing self-tests if the boot-tracer is selected.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Bring the entry to choose the boot tracer on the kernel config.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The tracing engine have now to be init in early_initcall to set the
boot tracer. Only the debugfs settings will be initialized at
fs_initcall time.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add the boot/initcall tracer.
It's primary purpose is to be able to trace the initcalls.
It is intended to be used with scripts/bootgraph.pl after some small
improvements.
Note that it is not active after its init. To avoid tracing (and so
crashing) before the whole tracing engine init, you have to explicitly
call start_boot_trace() after do_pre_smp_initcalls() to enable it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the recent updates to ftrace, there should not be any failures when
modifying the code. If there is, then we need to warn about it.
This patch has a cleaned up version of the code that I used to discover
that the weak symbols were causing failures.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace "none" tracer by the recently created "nop" tracer.
Both are pretty similar except that nop accepts TRACE_PRINT
or TRACE_SPECIAL entries.
And as a consequence, changing the size of the ring buffer now
requires that tracing has already been disabled.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the nop tracer is used as the default tracer by
replacing the "none" tracer, tracing engine depends on it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If nop tracer is selected, some old entries from the previous tracer
could still be enqueued. Tracing have to be reset.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The functions are already 'extern' anyway, so there's no problem
with linkage. Removing these ifdefs also helps find any potential
compiler errors.
Suggested by Andrew Morton.
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CONFIG_DYNAMIC_FTRACE isn't used, neither is mcount_addr. This
patch eliminates that warning.
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A no-op tracer which can serve two purposes:
1. A template for development of a new tracer.
2. A convenient way to see ftrace_printk() calls without
an irrelevant trace making the output messy.
[ mingo@elte.hu: resolved conflicts ]
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow a user to inject a marker (TRACE_PRINT entry) into the trace ring
buffer. The related file operations are derived from code by Frédéric
Weisbecker <fweisbec@gmail.com>.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Also make trace_seq_print_cont() non-static, and add a newline if the
seq buffer can't hold all data.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Offer mmiotrace users a function to inject markers from inside the kernel.
This depends on the trace_vprintk() patch.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
trace_vprintk() for easier implementation of tracer specific *_printk
functions. Add check check for no_tracer, and implement
__ftrace_printk() as a wrapper.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moves the mmiotrace specific functions from trace.c to
trace_mmiotrace.c. Functions trace_wake_up(), tracing_get_trace_entry(),
and tracing_generic_entry_update() are therefore made available outside
trace.c.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This must be brown paper bag week for Steven Rostedt!
While working on ftrace for PPC, I discovered that the hash locking done
when CONFIG_FTRACE_MCOUNT_RECORD is not set, is totally incorrect.
With a cut and paste error, I had the hash lock macro to lock for both
hash_lock _and_ hash_unlock!
This bug did not affect x86 since this bug was introduced when
CONFIG_FTRACE_MCOUNT_RECORD was added to x86.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make most of the tracers depend on DEBUG_KERNEL - that's their intended
purpose. (most distributions have DEBUG_KERNEL enabled anyway so this is
not a practical limitation - but it simplifies the tracing menu in the
normal case)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While profiling the smp behaviour of the scheduler it was needed to know to
which cpu a task got woken.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently ftrace_printk only works with the ftrace tracer, switch it to an
iter_ctrl setting so we can make us of them with other tracers too.
[rostedt@redhat.com: tweak to the disable condition]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
An item in the trace buffer that is bigger than one entry may be split
up using the TRACE_CONT entry. This makes it a virtual single entry.
The current code increments the iterator index even while traversing
TRACE_CONT entries, making it look like the iterator is further than
it actually is.
This patch adds code to not increment the iterator index while skipping
over TRACE_CONT entries.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra provided me with a nice brown paper bag while letting me know
that I was doing a logical AND and not a binary one, making a condition
true more often than it should be.
Luckily, a false true is handled by the calling function and no harm is
done. But this needs to be fixed regardless.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently some of the ftrace output goes skewiff if you have more
than 9 cpus, and some if you have more than 99.
Twiddle with the headers and format strings to make up to 999 cpus
display without causing spacing problems.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The mcount record method of ftrace scans objdump for references to mcount.
Using mcount as the reference to test if the calls to mcount being replaced
are indeed calls to mcount, this use of mcount was also caught as a
location to change. Using a variable that points to the mcount address
moves this reference into the data section that is not scanned, and
we do not use a false location to try and modify.
The warn on code was what was used to detect this bug.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is another tracer using the ftrace infrastructure, that examines
at each function call the size of the stack. If the stack use is greater
than the previous max it is recorded.
You can always see (and set) the max stack size seen. By setting it
to zero will start the recording again. The backtrace is also available.
For example:
# cat /debug/tracing/stack_max_size
1856
# cat /debug/tracing/stack_trace
[<c027764d>] stack_trace_call+0x8f/0x101
[<c021b966>] ftrace_call+0x5/0x8
[<c02553cc>] clocksource_get_next+0x12/0x48
[<c02542a5>] update_wall_time+0x538/0x6d1
[<c0245913>] do_timer+0x23/0xb0
[<c0257657>] tick_do_update_jiffies64+0xd9/0xf1
[<c02576b9>] tick_sched_timer+0x4a/0xad
[<c0250fe6>] __run_hrtimer+0x3e/0x75
[<c02518ed>] hrtimer_interrupt+0xf1/0x154
[<c022c870>] smp_apic_timer_interrupt+0x71/0x84
[<c021b7e9>] apic_timer_interrupt+0x2d/0x34
[<c0238597>] finish_task_switch+0x29/0xa0
[<c05abd13>] schedule+0x765/0x7be
[<c05abfca>] schedule_timeout+0x1b/0x90
[<c05ab4d4>] wait_for_common+0xab/0x101
[<c05ab5ac>] wait_for_completion+0x12/0x14
[<c033cfc3>] blk_execute_rq+0x84/0x99
[<c0402470>] scsi_execute+0xc2/0x105
[<c040250a>] scsi_execute_req+0x57/0x7f
[<c043afe0>] sr_test_unit_ready+0x3e/0x97
[<c043bbd6>] sr_media_change+0x43/0x205
[<c046b59f>] media_changed+0x48/0x77
[<c046b5ff>] cdrom_media_changed+0x31/0x37
[<c043b091>] sr_block_media_changed+0x16/0x18
[<c02b9e69>] check_disk_change+0x1b/0x63
[<c046f4c3>] cdrom_open+0x7a1/0x806
[<c043b148>] sr_block_open+0x78/0x8d
[<c02ba4c0>] do_open+0x90/0x257
[<c02ba869>] blkdev_open+0x2d/0x56
[<c0296a1f>] __dentry_open+0x14d/0x23c
[<c0296b32>] nameidata_to_filp+0x24/0x38
[<c02a1c68>] do_filp_open+0x347/0x626
[<c02967ef>] do_sys_open+0x47/0xbc
[<c02968b0>] sys_open+0x23/0x2b
[<c021aadd>] sysenter_do_call+0x12/0x26
I've tested this on both x86_64 and i386.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After disabling FTRACE_MCOUNT_RECORD via a patch, a dormant build
failure surfaced:
kernel/trace/ftrace.c: In function 'ftrace_record_ip':
kernel/trace/ftrace.c:416: error: incompatible type for argument 1 of '_spin_lock_irqsave'
kernel/trace/ftrace.c:433: error: incompatible type for argument 1 of '_spin_lock_irqsave'
Introduced by commit 6dad8e07f4c10b17b038e84d29f3ca41c2e55cd0 ("ftrace:
add necessary locking for ftrace records").
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The new design of pre-recorded mcounts and updating the code outside of
kstop_machine has changed the way the records themselves are protected.
This patch uses the ftrace_lock to protect the records. Note, the lock
still does not need to be taken within calls that are only called via
kstop_machine, since the that code can not run while the spin lock is held.
Also removed the hash_lock needed for the daemon when MCOUNT_RECORD is
configured. Also did a slight cleanup of an unused variable.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If one of the self tests of ftrace has disabled the function tracer,
do not run the code to convert the mcount calls in modules.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes some mistakes on the tracer in warning messages when
debugfs fails to create tracing files.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: srostedt@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
At OLS I had a lot of interest to be able to have the ftrace buffers
dumped on panic. Usually one would expect to uses kexec and examine
the buffers after a new kernel is loaded. But sometimes the resources
do not permit kdump and kexec, so having an option to still see the
sequence of events up to the crash is very advantageous.
This patch adds the option to have the ftrace buffers dumped to the
console in the latency_trace format on a panic. When the option is set,
the default entries per CPU buffer are lowered to 16384, since the writing
to the serial (if that is the console) may take an awful long time
otherwise.
[
Changes since -v1:
Got alpine to send correctly (as well as spell check working).
Removed config option.
Moved the static variables into ftrace_dump itself.
Gave printk a log level.
]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Based on Randy Dunlap's suggestion, the ftrace_printk kernel-doc belongs
with the ftrace_printk macro that should be used. Not with the
__ftrace_printk internal function.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a feature that can help kernel developers debug their
code using ftrace.
int ftrace_printk(const char *fmt, ...);
This records into the ftrace buffer using printf formatting. The entry
size in the buffers are still a fixed length. A new type has been added
that allows for more entries to be used for a single recording.
The start of the print is still the same as the other entries.
It returns the number of characters written to the ftrace buffer.
For example:
Having a module with the following code:
static int __init ftrace_print_test(void)
{
ftrace_printk("jiffies are %ld\n", jiffies);
return 0;
}
Gives me:
insmod-5441 3...1 7569us : ftrace_print_test: jiffies are 4296626666
for the latency_trace file and:
insmod-5441 [03] 1959.370498: ftrace_print_test jiffies are 4296626666
for the trace file.
Note: Only the infrastructure should go into the kernel. It is to help
facilitate debugging for other kernel developers. Calls to ftrace_printk
is not intended to be left in the kernel, and should be frowned upon just
like scattering printks around in the code.
But having this easily at your fingertips helps the debugging go faster
and bugs be solved quicker.
Maybe later on, we can hook this with markers and have their printf format
be sucked into ftrace output.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some tracers will need to work with more than one entry. In order to do this
the trace_entry structure was split into two fields. One for the start of
all entries, and one to continue an existing entry.
The trace_entry structure now has a "field" entry that consists of the previous
content of the trace_entry, and a "cont" entry that is just a string buffer
the size of the "field" entry.
Thanks to Andrew Morton for suggesting this idea.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a mcount pointer is recorded into a table, it is used to add or
remove calls to mcount (replacing them with nops). If the code is removed
via removing a module, the pointers still exist. At modifying the code
a check is always made to make sure the code being replaced is the code
expected. In-other-words, the code being replaced is compared to what
it is expected to be before being replaced.
There is a very small chance that the code being replaced just happens
to look like code that calls mcount (very small since the call to mcount
is relative). To remove this chance, this patch adds ftrace_release to
allow module unloading to remove the pointers to mcount within the module.
Another change for init calls is made to not trace calls marked with
__init. The tracing can not be started until after init is done anyway.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Seems that freed records can appear in the available_filter_functions list.
This patch fixes that.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch enables the loading of the __mcount_section of modules and
changing all the callers of mcount into nops.
The modification is done before the init_module function is called, so
again, we do not need to use kstop_machine to make these changes.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is the infrastructure to the converting the mcount call sites
recorded by the __mcount_loc section into nops on boot. It also allows
for using these sites to enable tracing as normal. When the __mcount_loc
section is used, the "ftraced" kernel thread is disabled.
This uses the current infrastructure to record the mcount call sites
as well as convert them to nops. The mcount function is kept as a stub
on boot up and not converted to the ftrace_record_ip function. We use the
ftrace_record_ip to only record from the table.
This patch does not handle modules. That comes with a later patch.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch creates a section in the kernel called "__mcount_loc".
This will hold a list of pointers to the mcount relocation for
each call site of mcount.
For example:
objdump -dr init/main.o
[...]
Disassembly of section .text:
0000000000000000 <do_one_initcall>:
0: 55 push %rbp
[...]
000000000000017b <init_post>:
17b: 55 push %rbp
17c: 48 89 e5 mov %rsp,%rbp
17f: 53 push %rbx
180: 48 83 ec 08 sub $0x8,%rsp
184: e8 00 00 00 00 callq 189 <init_post+0xe>
185: R_X86_64_PC32 mcount+0xfffffffffffffffc
[...]
We will add a section to point to each function call.
.section __mcount_loc,"a",@progbits
[...]
.quad .text + 0x185
[...]
The offset to of the mcount call site in init_post is an offset from
the start of the section, and not the start of the function init_post.
The mcount relocation is at the call site 0x185 from the start of the
.text section.
.text + 0x185 == init_post + 0xa
We need a way to add this __mcount_loc section in a way that we do not
lose the relocations after final link. The .text section here will
be attached to all other .text sections after final link and the
offsets will be meaningless. We need to keep track of where these
.text sections are.
To do this, we use the start of the first function in the section.
do_one_initcall. We can make a tmp.s file with this function as a reference
to the start of the .text section.
.section __mcount_loc,"a",@progbits
[...]
.quad do_one_initcall + 0x185
[...]
Then we can compile the tmp.s into a tmp.o
gcc -c tmp.s -o tmp.o
And link it into back into main.o.
ld -r main.o tmp.o -o tmp_main.o
mv tmp_main.o main.o
But we have a problem. What happens if the first function in a section
is not exported, and is a static function. The linker will not let
the tmp.o use it. This case exists in main.o as well.
Disassembly of section .init.text:
0000000000000000 <set_reset_devices>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: e8 00 00 00 00 callq 9 <set_reset_devices+0x9>
5: R_X86_64_PC32 mcount+0xfffffffffffffffc
The first function in .init.text is a static function.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
The lowercase 't' means that set_reset_devices is local and is not exported.
If we simply try to link the tmp.o with the set_reset_devices we end
up with two symbols: one local and one global.
.section __mcount_loc,"a",@progbits
.quad set_reset_devices + 0x10
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
U set_reset_devices
We still have an undefined reference to set_reset_devices, and if we try
to compile the kernel, we will end up with an undefined reference to
set_reset_devices, or even worst, it could be exported someplace else,
and then we will have a reference to the wrong location.
To handle this case, we make an intermediate step using objcopy.
We convert set_reset_devices into a global exported symbol before linking
it with tmp.o and set it back afterwards.
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 T set_reset_devices
00000000000000a8 t __setup_set_reset_devices
000000000000105f t __setup_str_set_reset_devices
0000000000000000 t set_reset_devices
Now we have a section in main.o called __mcount_loc that we can place
somewhere in the kernel using vmlinux.ld.S and access it to convert
all these locations that call mcount into nops before starting SMP
and thus, eliminating the need to do this with kstop_machine.
Note, A well documented perl script (scripts/recordmcount.pl) is used
to do all this in one location.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do not expose users to CONFIG_TRACEPOINTS - tracers can select it
just fine.
update ftrace to select CONFIG_TRACEPOINTS.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: per CPU hrtimers can be migrated from a dead CPU
The hrtimer code has no knowledge about per CPU timers, but we need to
prevent the migration of such timers and warn when such a timer is
active at migration time.
Explicitely mark the timers as per CPU and use a more understandable
mode descriptor for the interrupts safe unlocked callback mode, which
is used by hrtimer_sleeper and the scheduler code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove the following warning with CONFIG_TRACING=y:
kernel/trace/trace.c: In function ‘s_next’:
kernel/trace/trace.c:1186: warning: unused variable ‘last_ent’
Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Replace previous instances of the cpumask_of_cpu_ptr* macros
with a the new (lvalue capable) generic cpumask_of_cpu().
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove the :vim=ft=help tag from trace files.
I used them years ago to syntax-highlight traces and forgot about this hack.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* This patch replaces the dangerous lvalue version of cpumask_of_cpu
with new cpumask_of_cpu_ptr macros. These are patterned after the
node_to_cpumask_ptr macros.
In general terms, if there is a cpumask_of_cpu_map[] then a pointer to
the cpumask_of_cpu_map[cpu] entry is used. The cpumask_of_cpu_map
is provided when there is a large NR_CPUS count, reducing
greatly the amount of code generated and stack space used for
cpumask_of_cpu(). The pointer to the cpumask_t value is needed for
calling set_cpus_allowed_ptr() to reduce the amount of stack space
needed to pass the cpumask_t value.
If there isn't a cpumask_of_cpu_map[], then a temporary variable is
declared and filled in with value from cpumask_of_cpu(cpu) as well as
a pointer variable pointing to this temporary variable. Afterwards,
the pointer is used to reference the cpumask value. The compiler
will optimize out the extra dereference through the pointer as well
as the stack space used for the pointer, resulting in identical code.
A good example of the orthogonal usages is in net/sunrpc/svc.c:
case SVC_POOL_PERCPU:
{
unsigned int cpu = m->pool_to[pidx];
cpumask_of_cpu_ptr(cpumask, cpu);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, cpumask);
return 1;
}
case SVC_POOL_PERNODE:
{
unsigned int node = m->pool_to[pidx];
node_to_cpumask_ptr(nodecpumask, node);
*oldmask = current->cpus_allowed;
set_cpus_allowed_ptr(current, nodecpumask);
return 1;
}
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When PREEMPT_TRACER and IRQSOFF_TRACER are both configured and irqsoff
tracer is running, the preempt_off sections might also be traced.
Thanks to Andrew Morton for pointing out my mistake of spin_lock disabling
interrupts while he was reviewing ftrace.txt. Seems that my example I used
actually hit this bug.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
kernel/trace/ftrace.c:1615: error: 'ftraced_suspend' undeclared (first use in this function)
kernel/trace/ftrace.c:1615: error: (Each undeclared identifier is reported only once
kernel/trace/ftrace.c:1615: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the function tracer uses the global tracer_enabled variable that
is used to keep track if the tracer is enabled or not. The function tracing
startup needs to be separated out, otherwise the internal happenings of
the tracer startup is also recorded.
This patch creates a ftrace_function_enabled variable to all the starting
of the function traces to happen after everything has been started.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It has been suggested that I add a way to disable the function tracer
on an oops. This code adds a ftrace_kill_atomic. It is not meant to be
used in normal situations. It will disable the ftrace tracer, but will
not perform the nice shutdown that requires scheduling.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is more of a clean up. Currently the function tracer initializes the
tracer with which ever CPU was last used for tracing. This value isn't
realy useful for function tracing, but at least it should be something other
than a random number.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Enabling the wakeup tracer before enabling the function tracing causes
some strange results due to the dynamic enabling of the functions.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no CONFIG_PREEMPT_DESKTOP. Use the proper entry CONFIG_PREEMPT.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CONFIG_FTRACE is not enabled, the tracing_start_functon_trace
and tracing_stop_function_trace should be nops.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We have two markers now that are enabled on sched_switch. One that records
the context switching and the other that records task wake ups. Currently
we enable the tracing first and then set the markers. This causes some
confusing traces:
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
trace-cmd-3973 [00] 115.834817: 3973:120:R + 3: 0:S
trace-cmd-3973 [01] 115.834910: 3973:120:R + 6: 0:S
trace-cmd-3973 [02] 115.834910: 3973:120:R + 9: 0:S
trace-cmd-3973 [03] 115.834910: 3973:120:R + 12: 0:S
trace-cmd-3973 [02] 115.834910: 3973:120:R + 9: 0:S
<idle>-0 [02] 115.834910: 0:140:R ==> 3973:120:R
Here we see that trace-cmd with PID 3973 wakes up task 9 but the next line
shows the idle task doing a context switch to task 3973.
Enabling the tracing to _after_ the markers are set creates a much saner
output:
# tracer: sched_switch
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
<idle>-0 [02] 7922.634225: 0:140:R ==> 4790:120:R
trace-cmd-4789 [03] 7922.634225: 0:140:R + 4790:120:R
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Avoid modifying the mcount call-site if there is a kprobe installed on it.
These records are not marked as failed however. This allowed the filter
rules on them to remain up-to-date. Whenever the kprobe on the corresponding
record is removed, the record gets updated as normal.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Let records identified as being kprobe'd be marked as "frozen". The trouble
with records which have a kprobe installed on their mcount call-site is
that they don't get updated. So if such a function which is currently being
traced gets its tracing disabled due to a new filter rule (or because it
was added to the notrace list) then it won't be updated and continue being
traced. This patch allows scanning of all frozen records during tracing to
check if they should be traced.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Record the address of the mcount call-site. Currently all archs except sparc64
record the address of the instruction following the mcount call-site. Some
general cleanups are entailed. Storing mcount addresses in rec->ip enables
looking them up in the kprobe hash table later on to check if they're kprobe'd.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: davem@davemloft.net
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a fix to give notrace filter rules priority over "set_ftrace_filter"
rules.
This fix ensures that functions which are set to be filtered and are
concurrently marked as "notrace" don't get recorded. As of now, if
a record is marked as FTRACE_FL_FILTER and is enabled, then the notrace
flag is not checked. Tested on x86-32.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do not print loglevel before "entries of %ld bytes". Move it to the previous
pr_info.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Found that inspite of setting the current_tracer to "none", trace from
the previous trace type continued to be collected. The patch below fixes
this and causes the trace to be disabled when the "none" type is
selected.
Compile and boot tested the patch for functionality.
Signed-off-by: Ankita Garg <ankita@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt wrote:
> If we unload a module and reload it, will it ever get converted again?
The intent was always to filter core kernel functions to prevent their freeing.
Here's a fix which should allow re-recording of module call-sites.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Identify functions which had their mcount call-site updates failed. This can
help us track functions which ftrace shouldn't fiddle with, and are thus not
being traced. If there is no race with any external agent which is modifying
the mcount call-site, then this file displays no entries (normal case).
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent freeing of records which cause problems and correspond to function from
core kernel text. A new flag, FTRACE_FL_CONVERTED is used to mark a record
as "converted". All other records are patched lazily to NOPs. Failed records
now also remain on frace_hash table. Each invocation of ftrace_record_ip now
checks whether the traced function has ever been recorded (including past
failures) and doesn't re-record it again.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In dynamic ftrace, the mcount function starts off pointing to a stub
function that just returns.
On start up, the call to the stub is modified to point to a "record_ip"
function. The job of the record_ip function is to add the function to
a pre-allocated hash list. If the function is already there, it simply is
ignored, otherwise it is added to the list.
Later, a ftraced daemon wakes up and calls kstop_machine if any functions
have been recorded, and changes the calls to the recorded functions to
a simple nop. If no functions were recorded, the daemon goes back to sleep.
The daemon wakes up once a second to see if it needs to update any newly
recorded functions into nops. Usually it does not, but if a lot of code
has been executed for the first time in the kernel, the ftraced daemon
will call kstop_machine to update those into nops.
The problem currently is that there's no way to stop the daemon from doing
this, and it can cause unneeded latencies (800us which for some is bothersome).
This patch adds a new file /debugfs/tracing/ftraced_enabled. If the daemon
is active, reading this will return "enabled\n" and "disabled\n" when the
daemon is not running. To disable the daemon, the user can echo "0" or
"disable" into this file, and "1" or "enable" to re-enable the daemon.
Since the daemon is used to convert the functions into nops to increase
the performance of the system, I also added that anytime something is
written into the ftraced_enabled file, kstop_machine will run if there
are new functions that have been detected that need to be converted.
This way the user can disable the daemon but still be able to control the
conversion of the mcount calls to nops by simply,
"echo 0 > /debugfs/tracing/ftraced_enabled"
when they need to do more conversions.
To see the number of converted functions:
"cat /debugfs/tracing/dyn_ftrace_total_info"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tracing functions via ftrace which have a kretprobe installed on them, can produce misleading output in their trace logs. E.g, consider the correct trace of the following sequence:
do_IRQ()
{
~
irq_enter();
~
}
Trace log (sample):
<idle>-0 [00] 4154504455.781616: irq_enter <- do_IRQ
But if irq_enter() has a kretprobe installed on it, the return value stored on the stack at each invocation is modified to divert the return to a kprobe trampoline function called kretprobe_trampoline(). So with this the trace would (currently) look like:
<idle>-0 [00] 4154504455.781616: irq_enter <- kretprobe_trampoline
Now this is quite misleading to the end user, as it suggests something that didn't actually happen. So just to avoid such misinterpretations, the inlined patch aims to output such a log as:
<idle>-0 [00] 4154504455.781616: irq_enter <- [unknown/kretprobe'd]
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hi Ingo/Steven,
Ftrace currently maintains an update count which includes false updates,
i.e, updates which failed. If anything, such failures should be tracked
by some separate variable, but this patch provides a minimal fix.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: rostedt@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Hi Steven,
I noticed that concurrent instances of ftrace_record_ip()
have a race between ftrace_hash list traversal during
ftrace_ip_in_hash() (before acquiring ftrace_shutdown_lock)
and ftrace_add_hash(). If it's so then this should fix it.
Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: rostedt@goodmis.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The new work with converting the trace hooks over to markers broke the
command line recording of ftrace. This patch fixes it again.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
While debugging latencies in the RT kernel, I found that it would be nice
to be able to filter away functions from the trace than just to filter
on functions.
I added a new interface to the debugfs tracing directory called
set_ftrace_notrace
When dynamic frace is enabled, this lets you filter away functions that will
not be recorded in the trace. It is similar to adding 'notrace' to those
functions but by doing it without recompiling the kernel.
Here's how set_ftrace_filter and set_ftrace_notrace interact. Remember, if
set_ftrace_filter is set, it removes all functions from the trace execpt for
those listed in the set_ftrace_filter. set_ftrace_notrace will prevent those
functions from being traced.
If you were to set one function in both set_ftrace_filter and
set_ftrace_notrace and that function was the same, then you would end up
with an empty trace.
the set of functions to trace is:
set_ftrace_filter == empty then
all functions not in set_ftrace_notrace
else
set of the set_ftrace_filter and not in set of set_ftrace_notrace.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Printing out new max latencies was fine for the old RT tracer. But for
mainline it is a bit messy. We also need to test if the run queue
is locked before we can do the print. This means that we may not be
printing out latencies if the run queue is locked on another CPU.
This produces inconsistencies in the output.
This patch simply removes the print altogether.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: pq@iki.fi
Cc: proski@gnu.org
Cc: sandmann@redhat.com
Cc: a.p.zijlstra@chello.nl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds function tracing to the functions that are called
on the CPU of the task being traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: pq@iki.fi
Cc: proski@gnu.org
Cc: sandmann@redhat.com
Cc: a.p.zijlstra@chello.nl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The check_pages function is called often enough that it can cause problems
with trace outputs or even bringing the system to a halt.
This patch limits the check_pages to the places that are most likely to
have problems. The check is made at the flip between the global array and
the max save array, as well as when the size of the buffers changes and
the self tests.
This patch also removes the BUG_ON from check_pages and replaces it with
a WARN_ON and disabling of the tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: pq@iki.fi
Cc: proski@gnu.org
Cc: sandmann@redhat.com
Cc: a.p.zijlstra@chello.nl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Non-zero pid indicates the MMIO access originated in user space.
We do not catch that kind of accesses yet, so always print zero for now.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>