diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9e507e6cb4c8..9bb50dc78338 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -6785,6 +6785,8 @@ reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq + See also Documentation/trace/debugging.rst + trace_options=[option-list] [FTRACE] Enable or disable tracer options at boot. diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst new file mode 100644 index 000000000000..54fb16239d70 --- /dev/null +++ b/Documentation/trace/debugging.rst @@ -0,0 +1,159 @@ +============================== +Using the tracer for debugging +============================== + +Copyright 2024 Google LLC. + +:Author: Steven Rostedt +:License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) + +- Written for: 6.12 + +Introduction +------------ +The tracing infrastructure can be very useful for debugging the Linux +kernel. This document is a place to add various methods of using the tracer +for debugging. + +First, make sure that the tracefs file system is mounted:: + + $ sudo mount -t tracefs tracefs /sys/kernel/tracing + + +Using trace_printk() +-------------------- + +trace_printk() is a very lightweight utility that can be used in any context +inside the kernel, with the exception of "noinstr" sections. It can be used +in normal, softirq, interrupt and even NMI context. The trace data is +written to the tracing ring buffer in a lockless way. To make it even +lighter weight, when possible, it will only record the pointer to the format +string, and save the raw arguments into the buffer. The format and the +arguments will be post processed when the ring buffer is read. This way the +trace_printk() format conversions are not done during the hot path, where +the trace is being recorded. + +trace_printk() is meant only for debugging, and should never be added into +a subsystem of the kernel. If you need debugging traces, add trace events +instead. If a trace_printk() is found in the kernel, the following will +appear in the dmesg:: + + ********************************************************** + ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** + ** ** + ** trace_printk() being used. Allocating extra memory. ** + ** ** + ** This means that this is a DEBUG kernel and it is ** + ** unsafe for production use. ** + ** ** + ** If you see this message and you are not debugging ** + ** the kernel, report this immediately to your vendor! ** + ** ** + ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE ** + ********************************************************** + +Debugging kernel crashes +------------------------ +There is various methods of acquiring the state of the system when a kernel +crash occurs. This could be from the oops message in printk, or one could +use kexec/kdump. But these just show what happened at the time of the crash. +It can be very useful in knowing what happened up to the point of the crash. +The tracing ring buffer, by default, is a circular buffer than will +overwrite older events with newer ones. When a crash happens, the content of +the ring buffer will be all the events that lead up to the crash. + +There are several kernel command line parameters that can be used to help in +this. The first is "ftrace_dump_on_oops". This will dump the tracing ring +buffer when a oops occurs to the console. This can be useful if the console +is being logged somewhere. If a serial console is used, it may be prudent to +make sure the ring buffer is relatively small, otherwise the dumping of the +ring buffer may take several minutes to hours to finish. Here's an example +of the kernel command line:: + + ftrace_dump_on_oops trace_buf_size=50K + +Note, the tracing buffer is made up of per CPU buffers where each of these +buffers is broken up into sub-buffers that are by default PAGE_SIZE. The +above trace_buf_size option above sets each of the per CPU buffers to 50K, +so, on a machine with 8 CPUs, that's actually 400K total. + +Persistent buffers across boots +------------------------------- +If the system memory allows it, the tracing ring buffer can be specified at +a specific location in memory. If the location is the same across boots and +the memory is not modified, the tracing buffer can be retrieved from the +following boot. There's two ways to reserve memory for the use of the ring +buffer. + +The more reliable way (on x86) is to reserve memory with the "memmap" kernel +command line option and then use that memory for the trace_instance. This +requires a bit of knowledge of the physical memory layout of the system. The +advantage of using this method, is that the memory for the ring buffer will +always be the same:: + + memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M + +The memmap above reserves 12 megabytes of memory at the physical memory +location 0x284500000. Then the trace_instance option will create a trace +instance "boot_map" at that same location with the same amount of memory +reserved. As the ring buffer is broke up into per CPU buffers, the 12 +megabytes will be broken up evenly between those CPUs. If you have 8 CPUs, +each per CPU ring buffer will be 1.5 megabytes in size. Note, that also +includes meta data, so the amount of memory actually used by the ring buffer +will be slightly smaller. + +Another more generic but less robust way to allocate a ring buffer mapping +at boot is with the "reserve_mem" option:: + + reserve_mem=12M:4096:trace trace_instance=boot_map@trace + +The reserve_mem option above will find 12 megabytes that are available at +boot up, and align it by 4096 bytes. It will label this memory as "trace" +that can be used by later command line options. + +The trace_instance option creates a "boot_map" instance and will use the +memory reserved by reserve_mem that was labeled as "trace". This method is +more generic but may not be as reliable. Due to KASLR, the memory reserved +by reserve_mem may not be located at the same location. If this happens, +then the ring buffer will not be from the previous boot and will be reset. + +Sometimes, by using a larger alignment, it can keep KASLR from moving things +around in such a way that it will move the location of the reserve_mem. By +using a larger alignment, you may find better that the buffer is more +consistent to where it is placed:: + + reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace + +On boot up, the memory reserved for the ring buffer is validated. It will go +through a series of tests to make sure that the ring buffer contains valid +data. If it is, it will then set it up to be available to read from the +instance. If it fails any of the tests, it will clear the entire ring buffer +and initialize it as new. + +The layout of this mapped memory may not be consistent from kernel to +kernel, so only the same kernel is guaranteed to work if the mapping is +preserved. Switching to a different kernel version may find a different +layout and mark the buffer as invalid. + +Using trace_printk() in the boot instance +----------------------------------------- +By default, the content of trace_printk() goes into the top level tracing +instance. But this instance is never preserved across boots. To have the +trace_printk() content, and some other internal tracing go to the preserved +buffer (like dump stacks), either set the instance to be the trace_printk() +destination from the kernel command line, or set it after boot up via the +trace_printk_dest option. + +After boot up:: + + echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest + +From the kernel command line:: + + reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace + +If setting it from the kernel command line, it is recommended to also +disable tracing with the "traceoff" flag, and enable tracing after boot up. +Otherwise the trace from the most recent boot will be mixed with the trace +from the previous boot, and may make it confusing to read.