e2c18168c3
4895 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
757fa80f4e |
Tracing updates for 5.14:
- Added option for per CPU threads to the hwlat tracer - Have hwlat tracer handle hotplug CPUs - New tracer: osnoise, that detects latency caused by interrupts, softirqs and scheduling of other tasks. - Added timerlat tracer that creates a thread and measures in detail what sources of latency it has for wake ups. - Removed the "success" field of the sched_wakeup trace event. This has been hardcoded as "1" since 2015, no tooling should be looking at it now. If one exists, we can revert this commit, fix that tool and try to remove it again in the future. - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids. - New boot command line option "tp_printk_stop", as tp_printk causes trace events to write to console. When user space starts, this can easily live lock the system. Having a boot option to stop just after boot up is useful to prevent that from happening. - Have ftrace_dump_on_oops boot command line option take numbers that match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops. - Bootconfig clean ups, fixes and enhancements. - New ktest script that tests bootconfig options. - Add tracepoint_probe_register_may_exist() to register a tracepoint without triggering a WARN*() if it already exists. BPF has a path from user space that can do this. All other paths are considered a bug. - Small clean ups and fixes -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYN8YPhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qhxLAP9Mo5hHv7Hg6W7Ddv77rThm+qclsMR/ yW0P+eJpMm4+xAD8Cq03oE1DimPK+9WZBKU5rSqAkqG6CjgDRw6NlIszzQQ= =WEPR -----END PGP SIGNATURE----- Merge tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Added option for per CPU threads to the hwlat tracer - Have hwlat tracer handle hotplug CPUs - New tracer: osnoise, that detects latency caused by interrupts, softirqs and scheduling of other tasks. - Added timerlat tracer that creates a thread and measures in detail what sources of latency it has for wake ups. - Removed the "success" field of the sched_wakeup trace event. This has been hardcoded as "1" since 2015, no tooling should be looking at it now. If one exists, we can revert this commit, fix that tool and try to remove it again in the future. - tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids. - New boot command line option "tp_printk_stop", as tp_printk causes trace events to write to console. When user space starts, this can easily live lock the system. Having a boot option to stop just after boot up is useful to prevent that from happening. - Have ftrace_dump_on_oops boot command line option take numbers that match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops. - Bootconfig clean ups, fixes and enhancements. - New ktest script that tests bootconfig options. - Add tracepoint_probe_register_may_exist() to register a tracepoint without triggering a WARN*() if it already exists. BPF has a path from user space that can do this. All other paths are considered a bug. - Small clean ups and fixes * tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits) tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT tracing: Simplify & fix saved_tgids logic treewide: Add missing semicolons to __assign_str uses tracing: Change variable type as bool for clean-up trace/timerlat: Fix indentation on timerlat_main() trace/osnoise: Make 'noise' variable s64 in run_osnoise() tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing tracing: Fix spelling in osnoise tracer "interferences" -> "interference" Documentation: Fix a typo on trace/osnoise-tracer trace/osnoise: Fix return value on osnoise_init_hotplug_support trace/osnoise: Make interval u64 on osnoise_main trace/osnoise: Fix 'no previous prototype' warnings tracing: Have osnoise_main() add a quiescent state for task rcu seq_buf: Make trace_seq_putmem_hex() support data longer than 8 seq_buf: Fix overflow in seq_buf_putmem_hex() trace/osnoise: Support hotplug operations trace/hwlat: Support hotplug operations trace/hwlat: Protect kdata->kthread with get/put_online_cpus trace: Add timerlat tracer trace: Add osnoise tracer ... |
||
|
71bd934101 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ... |
||
|
4030a6e6a6 |
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that
on systems where pid_max is configured higher than PID_MAX_DEFAULT the
ftrace record-tgid option doesn't work so well. Any tasks with PIDs
higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and
don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its
highest possible 1<<22 value by default on 64 bit systems this renders
the record-tgids option of little use.
Increase the size of tgid_map to the configured pid_max instead,
allowing it to cover the full range of PIDs up to the maximum value of
PID_MAX_LIMIT if the system is configured that way.
On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the
size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in
memory overhead sounds significant 64 bit systems are presumably best
placed to accommodate it, and since tgid_map is only allocated when the
record-tgid option is actually used presumably the user would rather it
spends sufficient memory to actually record the tgids they expect.
The size of tgid_map could also increase for CONFIG_BASE_SMALL=y
configurations, but these seem unlikely to be systems upon which people
are both configuring a large pid_max and running ftrace with record-tgid
anyway.
Of note is that we only allocate tgid_map once, the first time that the
record-tgid option is enabled. Therefore its size is only set once, to
the value of pid_max at the time the record-tgid option is first
enabled. If a user increases pid_max after that point, the saved_tgids
file will not contain entries for any tasks with pids beyond the earlier
value of pid_max.
Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com
Fixes:
|
||
|
f39650de68 |
kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time. Here is the attempt to start cleaning it up by splitting out panic and oops helpers. There are several purposes of doing this: - dropping dependency in bug.h - dropping a loop by moving out panic_notifier.h - unload kernel.h from something which has its own domain At the same time convert users tree-wide to use new headers, although for the time being include new header back to kernel.h to avoid twisted indirected includes for existing users. [akpm@linux-foundation.org: thread_info.h needs limits.h] [andriy.shevchenko@linux.intel.com: ia64 fix] Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Co-developed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
b81b3e959a |
tracing: Simplify & fix saved_tgids logic
The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.
Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
Fixes:
|
||
|
bfbf8d157a |
tracing: Change variable type as bool for clean-up
The wakeup_rt wakeup_dl, tracing_dl is only set to 0, 1. So changing type of wakeup_rt wakeup_dl, tracing_dl as bool makes relevant routine be more readable. Link: https://lkml.kernel.org/r/20210629140548.GA1627@raspberrypi Signed-off-by: Austin Kim <austin.kim@lge.com> [ Removed unneeded initialization of static bool tracing_dl ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
6a82f42a2e |
trace/timerlat: Fix indentation on timerlat_main()
Dan Carpenter reported that: The patch |
||
|
19c3eaa722 |
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
Dab Carpenter reported that: The patch |
||
|
9913d5745b |
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
All internal use cases for tracepoint_probe_register() is set to not ever
be called with the same function and data. If it is, it is considered a
bug, as that means the accounting of handling tracepoints is corrupted.
If the function and data for a tracepoint is already registered when
tracepoint_probe_register() is called, it will call WARN_ON_ONCE() and
return with EEXISTS.
The BPF system call can end up calling tracepoint_probe_register() with
the same data, which now means that this can trigger the warning because
of a user space process. As WARN_ON_ONCE() should not be called because
user space called a system call with bad data, there needs to be a way to
register a tracepoint without triggering a warning.
Enter tracepoint_probe_register_may_exist(), which can be called, but will
not cause a WARN_ON() if the probe already exists. It will still error out
with EEXIST, which will then be sent to the user space that performed the
BPF system call.
This keeps the previous testing for issues with other users of the
tracepoint code, while letting BPF call it with duplicated data and not
warn about it.
Link: https://lore.kernel.org/lkml/20210626135845.4080-1-penguin-kernel@I-love.SAKURA.ne.jp/
Link: https://syzkaller.appspot.com/bug?id=41f4318cf01762389f4d1c1c459da4f542fe5153
Cc: stable@vger.kernel.org
Fixes:
|
||
|
a358f40600 |
once: implement DO_ONCE_LITE for non-fast-path "do once" functionality
Certain uses of "do once" functionality reside outside of fast path, and so do not require jump label patching via static keys, making existing DO_ONCE undesirable in such cases. Replace uses of __section(".data.once") with DO_ONCE_LITE(_IF)? This patch changes the return values of xfs_printk_once, printk_once, and printk_deferred_once. Before, they returned whether the print was performed, but now, they always return true. This is okay because the return values of the following macros are entirely ignored throughout the kernel: - xfs_printk_once - xfs_warn_once - xfs_notice_once - xfs_info_once - printk_once - pr_emerg_once - pr_alert_once - pr_crit_once - pr_err_once - pr_warn_once - pr_notice_once - pr_info_once - pr_devel_once - pr_debug_once - printk_deferred_once - orc_warn Changes v3: - Expand commit message to explain why changing return values of xfs_printk_once, printk_once, printk_deferred_once is benign v2: - Fix i386 build warnings Signed-off-by: Tanner Love <tannerlove@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
e1289cfb63 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-28 The following pull-request contains BPF updates for your *net-next* tree. We've added 37 non-merge commits during the last 12 day(s) which contain a total of 56 files changed, 394 insertions(+), 380 deletions(-). The main changes are: 1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney. 2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski. 3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev. 4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria. 5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards. 6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa. 7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi. 8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim. 9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young. 10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer. 11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski. 12) Fix up bpfilter startup log-level to info level, from Gary Lin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net> |
||
|
498627b4ac |
trace/osnoise: Fix return value on osnoise_init_hotplug_support
kernel test robot reported:
>> kernel/trace/trace_osnoise.c:1584:2: error: void function
'osnoise_init_hotplug_support' should not return a
value [-Wreturn-type]
return 0;
When !CONFIG_HOTPLUG_CPU.
Fix it problem by removing the return value.
Link: https://lkml.kernel.org/r/c7fc67f1a117cc88bab2e508c898634872795341.1624872608.git.bristot@redhat.com
Fixes:
|
||
|
2a81afa326 |
trace/osnoise: Make interval u64 on osnoise_main
kernel test robot reported:
>> kernel/trace/trace_osnoise.c:966:3: warning: comparison of distinct
pointer types ('typeof ((interval)) *' (aka 'long long *') and
'uint64_t *' (aka 'unsigned long long *'))
[-Wcompare-distinct-pointer-types]
do_div(interval, USEC_PER_MSEC);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/asm-generic/div64.h:228:28: note: expanded from macro 'do_div'
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~
As interval cannot be negative because sample_period >= sample_runtime,
making interval u64 on osnoise_main() is enough to fix this problem.
Link: https://lkml.kernel.org/r/4ae1e7780563598563de079a3ef6d4d10b5f5546.1624872608.git.bristot@redhat.com
Fixes:
|
||
|
f7d9f6370e |
trace/osnoise: Fix 'no previous prototype' warnings
kernel test robot reported some osnoise functions with "no previous
prototype."
Fix these warnings by making local functions static, and by adding:
void osnoise_trace_irq_entry(int id);
void osnoise_trace_irq_exit(int id, const char *desc);
to include/linux/trace.h.
Link: https://lkml.kernel.org/r/e40d3cb4be8bde921f4b40fa6a095cf85ab807bd.1624872608.git.bristot@redhat.com
Fixes:
|
||
|
b96285e10a |
tracing: Have osnoise_main() add a quiescent state for task rcu
ftracetest triggered:
INFO: rcu_tasks detected stalls on tasks:
00000000b92b832d: .. nvcsw: 1/1 holdout: 1 idle_cpu: -1/7
task:osnoise/7 state:R running task stack: 0 pid: 2133 ppid: 2 flags:0x00004000
Call Trace:
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? trace_hardirqs_on+0x2b/0xe0
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? trace_clock_local+0xc/0x20
? osnoise_main+0x10e/0x450
? trace_softirq_entry_callback+0x50/0x50
? kthread+0x153/0x170
? __kthread_bind_mask+0x60/0x60
? ret_from_fork+0x22/0x30
While running osnoise tracer with other tracers that rely on
synchronize_rcu_tasks(), where that just hung.
The reason is that osnoise_main() never schedules out if the interval
is less than 1, and this will cause synchronize_rcu_tasks() to never
return.
Link: https://lkml.kernel.org/r/20210628114953.6dc06a91@oasis.local.home
Fixes:
|
||
|
95b861a793 |
bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing
Allow the helper to be called from tracing programs. This is needed to handle cgroup hiererachies in the program. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210627153627.824198-1-namhyung@kernel.org |
||
|
c8895e271f |
trace/osnoise: Support hotplug operations
Enable and disable osnoise/timerlat thread during on CPU hotplug online and offline operations respectivelly. Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/ Link: https://lkml.kernel.org/r/39f98590b3caeb3c32f09526214058efe0e9272a.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
ba998f7d95 |
trace/hwlat: Support hotplug operations
Enable and disable hwlat thread during cpu hotplug online and offline operations, respectivelly. Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/ Link: https://lkml.kernel.org/r/52012d25ea35491a0f8088b947864d8df8e25157.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
039a602db3 |
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
In preparation to the hotplug support, protect kdata->kthread with get/put_online_cpus() to avoid concurrency with hotplug operations. Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/ Link: https://lkml.kernel.org/r/8bdb2a56f46abfd301d6fffbf43448380c09a6f5.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
a955d7eac1 |
trace: Add timerlat tracer
The timerlat tracer aims to help the preemptive kernel developers to found souces of wakeup latencies of real-time threads. Like cyclictest, the tracer sets a periodic timer that wakes up a thread. The thread then computes a *wakeup latency* value as the difference between the *current time* and the *absolute time* that the timer was set to expire. The main goal of timerlat is tracing in such a way to help kernel developers. Usage Write the ASCII text "timerlat" into the current_tracer file of the tracing system (generally mounted at /sys/kernel/tracing). For example: [root@f32 ~]# cd /sys/kernel/tracing/ [root@f32 tracing]# echo timerlat > current_tracer It is possible to follow the trace by reading the trace trace file: [root@f32 tracing]# cat trace # tracer: timerlat # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # || / # |||| ACTIVATION # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY # | | | |||| | | | | <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns The tracer creates a per-cpu kernel thread with real-time priority that prints two lines at every activation. The first is the *timer latency* observed at the *hardirq* context before the activation of the thread. The second is the *timer latency* observed by the thread, which is the same level that cyclictest reports. The ACTIVATION ID field serves to relate the *irq* execution to its respective *thread* execution. The irq/thread splitting is important to clarify at which context the unexpected high value is coming from. The *irq* context can be delayed by hardware related actions, such as SMIs, NMIs, IRQs or by a thread masking interrupts. Once the timer happens, the delay can also be influenced by blocking caused by threads. For example, by postponing the scheduler execution via preempt_disable(), by the scheduler execution, or by masking interrupts. Threads can also be delayed by the interference from other threads and IRQs. The timerlat can also take advantage of the osnoise: traceevents. For example: [root@f32 ~]# cd /sys/kernel/tracing/ [root@f32 tracing]# echo timerlat > current_tracer [root@f32 tracing]# echo osnoise > set_event [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us [root@f32 tracing]# tail -10 trace cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 1585 ns cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns cc1-87882 [005] dNLh2.. 548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns cc1-87882 [005] dNLh2.. 548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns cc1-87882 [005] dNLh2.. 548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns cc1-87882 [005] dNLh2.. 548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns cc1-87882 [005] dNLh2.. 548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns cc1-87882 [005] dNLh2.. 548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns cc1-87882 [005] d...3.. 548.771101: thread_noise: cc1:87882 start 548.771078243 duration 10909 ns timerlat/5-1035 [005] ....... 548.771103: #402268 context thread timer_latency 25960 ns For further information see: Documentation/trace/timerlat-tracer.rst Link: https://lkml.kernel.org/r/71f18efc013e1194bcaea1e54db957de2b19ba62.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
bce29ac9ce |
trace: Add osnoise tracer
In the context of high-performance computing (HPC), the Operating System Noise (*osnoise*) refers to the interference experienced by an application due to activities inside the operating system. In the context of Linux, NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the system. Moreover, hardware-related jobs can also cause noise, for example, via SMIs. The osnoise tracer leverages the hwlat_detector by running a similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing all the sources of *osnoise* during its execution. Using the same approach of hwlat, osnoise takes note of the entry and exit point of any source of interferences, increasing a per-cpu interference counter. The osnoise tracer also saves an interference counter for each source of interference. The interference counter for NMI, IRQs, SoftIRQs, and threads is increased anytime the tool observes these interferences' entry events. When a noise happens without any interference from the operating system level, the hardware noise counter increases, pointing to a hardware-related noise. In this way, osnoise can account for any source of interference. At the end of the period, the osnoise tracer prints the sum of all noise, the max single noise, the percentage of CPU available for the thread, and the counters for the noise sources. Usage Write the ASCII text "osnoise" into the current_tracer file of the tracing system (generally mounted at /sys/kernel/tracing). For example:: [root@f32 ~]# cd /sys/kernel/tracing/ [root@f32 tracing]# echo osnoise > current_tracer It is possible to follow the trace by reading the trace trace file:: [root@f32 tracing]# cat trace # tracer: osnoise # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth MAX # || / SINGLE Interference counters: # |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+ # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD # | | | |||| | | | | | | | | | | <...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1 <...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3 <...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21 <...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0 <...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41 <...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2 <...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1 <...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19 In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the tracer prints a message at the end of each period for each CPU that is running an osnoise/CPU thread. The osnoise specific fields report: - The RUNTIME IN USE reports the amount of time in microseconds that the osnoise thread kept looping reading the time. - The NOISE IN US reports the sum of noise in microseconds observed by the osnoise tracer during the associated runtime. - The % OF CPU AVAILABLE reports the percentage of CPU available for the osnoise thread during the runtime window. - The MAX SINGLE NOISE IN US reports the maximum single noise observed during the runtime window. - The Interference counters display how many each of the respective interference happened during the runtime window. Note that the example above shows a high number of HW noise samples. The reason being is that this sample was taken on a virtual machine, and the host interference is detected as a hardware interference. Tracer options The tracer has a set of options inside the osnoise directory, they are: - osnoise/cpus: CPUs at which a osnoise thread will execute. - osnoise/period_us: the period of the osnoise thread. - osnoise/runtime_us: how long an osnoise thread will look for noise. - osnoise/stop_tracing_us: stop the system tracing if a single noise higher than the configured value happens. Writing 0 disables this option. - osnoise/stop_tracing_total_us: stop the system tracing if total noise higher than the configured value happens. Writing 0 disables this option. - tracing_threshold: the minimum delta between two time() reads to be considered as noise, in us. When set to 0, the default value will be used, which is currently 5 us. Additional Tracing In addition to the tracer, a set of tracepoints were added to facilitate the identification of the osnoise source. - osnoise:sample_threshold: printed anytime a noise is higher than the configurable tolerance_ns. - osnoise:nmi_noise: noise from NMI, including the duration. - osnoise:irq_noise: noise from an IRQ, including the duration. - osnoise:softirq_noise: noise from a SoftIRQ, including the duration. - osnoise:thread_noise: noise from a thread, including the duration. Note that all the values are *net values*. For example, if while osnoise is running, another thread preempts the osnoise thread, it will start a thread_noise duration at the start. Then, an IRQ takes place, preempting the thread_noise, starting a irq_noise. When the IRQ ends its execution, it will compute its duration, and this duration will be subtracted from the thread_noise, in such a way as to avoid the double accounting of the IRQ execution. This logic is valid for all sources of noise. Here is one example of the usage of these tracepoints:: osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2 In this example, a noise sample of 8 microseconds was reported in the last line, pointing to two interferences. Looking backward in the trace, the two previous entries were about the migration thread running after a timer IRQ execution. The first event is not part of the noise because it took place one millisecond before. It is worth noticing that the sum of the duration reported in the tracepoints is smaller than eight us reported in the sample_threshold. The reason roots in the overhead of the entry and exit code that happens before and after any interference execution. This justifies the dual approach: measuring thread and tracing. Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> [ Made the following functions static: trace_irqentry_callback() trace_irqexit_callback() trace_intel_irqentry_callback() trace_intel_irqexit_callback() Added to include/trace.h: osnoise_arch_register() osnoise_arch_unregister() Fixed define logic for LATENCY_FS_NOTIFY Reported-by: kernel test robot <lkp@intel.com> ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
6880c987e4 |
tracing: Add LATENCY_FS_NOTIFY to define if latency_fsnotify() is defined
With the coming addition of the osnoise tracer, the configs needed to include the latency_fsnotify() has become more complex, and to keep the declaration in the header file the same as in the C file, just have the logic needed to define it in one place, and that defines LATENCY_FS_NOTIFY which will be used in the C code. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
aa892f8c88 |
trace/hwlat: Remove printk from sampling loop
hwlat has some time operation checks on the sample loop, and it is currently using pr_err (printk) to report them. The problem is that this can lead the system to an unresponsible state due to an overflow of printk messages. This problem can be mitigated by writing the error message to the trace buffer. Remove the printk messages from the sampling loop, switching the to messages in the trace buffer. No functional change. Link: https://lkml.kernel.org/r/9d77c34869748aa105e965c769d24642914eea3a.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
f27a1c9e1b |
trace/hwlat: Use trace_min_max_param for width and window params
Use the trace_min_max_param to reduce code duplication. No functional change. Link: https://lkml.kernel.org/r/b91accd5a7c6c14ea02d3379aae974ba22b47dd6.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
bc87cf0a08 |
trace: Add a generic function to read/write u64 values from tracefs
The hwlat detector and (in preparation for) the osnoise/timerlat tracers have a set of u64 parameters that the user can read/write via tracefs. For instance, we have hwlat_detector's window and width. To reduce the code duplication, hwlat's window and width share the same read function. However, they do not share the write functions because they do different parameter checks. For instance, the width needs to be smaller than the window, while the window needs to be larger than the window. The same pattern repeats on osnoise/timerlat, and a large portion of the code was devoted to the write function. Despite having different checks, the write functions have the same structure: read a user-space buffer take the lock that protects the value check for minimum and maximum acceptable values save the value release the lock return success or error To reduce the code duplication also in the write functions, this patch provides a generic read and write implementation for u64 values that need to be within some minimum and/or maximum parameters, while (potentially) being protected by a lock. To use this interface, the structure trace_min_max_param needs to be filled: struct trace_min_max_param { struct mutex *lock; u64 *val; u64 *min; u64 *max; }; The desired value is stored on the variable pointed by *val. If *min points to a minimum acceptable value, it will be checked during the write operation. Likewise, if *max points to a maximum allowable value, it will be checked during the write operation. Finally, if *lock points to a mutex, it will be taken at the beginning of the operation and released at the end. The definition of a trace_min_max_param needs to passed as the (private) *data for tracefs_create_file(), and the trace_min_max_fops (added by this patch) as the *fops file_operations. Link: https://lkml.kernel.org/r/3e35760a7c8b5c55f16ae5ad5fc54a0e71cbe647.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
f46b16520a |
trace/hwlat: Implement the per-cpu mode
Implements the per-cpu mode in which a sampling thread is created for each cpu in the "cpus" (and tracing_mask). The per-cpu mode has the potention to speed up the hwlat detection by running on multiple CPUs at the same time, at the cost of higher cpu usage with irqs disabled. Use with care. [ Changed get_cpu_data() to static. Reported-by: kernel test robot <lkp@intel.com> ] Link: https://lkml.kernel.org/r/ec06d0ab340e8460d293772faba19ad8a5c371aa.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
7bb7d802af |
trace/hwlat: Switch disable_migrate to mode none
When in the round-robin mode, if the tracer detects a change in the hwlatd thread affinity by an external tool, e.g., taskset, the round-robin logic is disabled. The disable_migrate variable currently tracks this. With the addition of the "mode" config and the mode "none," the disable_migrate logic is equivalent to switch to the "none" mode. Hence, instead of using a hidden variable to track this behavior, switch the mode to none, informing the user about this change. Link: https://lkml.kernel.org/r/a679af672458d6b1f62252605905c5214030f247.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
8fa826b734 |
trace/hwlat: Implement the mode config option
Provides the "mode" config to the hardware latency detector. hwlatd has two different operation modes. The default mode is the "round-robin" one, in which a single hwlatd thread runs, migrating among the allowed CPUs in a "round-robin" fashion. This is the current behavior. The "none" sets the allowed cpumask for a single hwlatd thread at the startup, but skips the round-robin, letting the scheduler handle the migration. In preparation to the per-cpu mode. Link: https://lkml.kernel.org/r/f3b1271262aa030c680e26615c1b9b2d71e55e92.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
bb1b24cf41 |
trace/hwlat: Fix Clark's email
Clark's email is williams@redhat.com. No functional change. Link: https://lkml.kernel.org/r/6fa4b49e17ab8a1ff19c335ab7cde38d8afb0e29.1624372313.git.bristot@redhat.com Cc: Phil Auld <pauld@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kate Carcia <kcarcia@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexandre Chartre <alexandre.chartre@oracle.com> Cc: Clark Willaims <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
89529d8b8f |
tracing: Do no increment trace_clock_global() by one
The trace_clock_global() tries to make sure the events between CPUs is somewhat in order. A global value is used and updated by the latest read of a clock. If one CPU is ahead by a little, and is read by another CPU, a lock is taken, and if the timestamp of the other CPU is behind, it will simply use the other CPUs timestamp. The lock is also only taken with a "trylock" due to tracing, and strange recursions can happen. The lock is not taken at all in NMI context. In the case where the lock is not able to be taken, the non synced timestamp is returned. But it will not be less than the saved global timestamp. The problem arises because when the time goes "backwards" the time returned is the saved timestamp plus 1. If the lock is not taken, and the plus one to the timestamp is returned, there's a small race that can cause the time to go backwards! CPU0 CPU1 ---- ---- trace_clock_global() { ts = clock() [ 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] <interrupted by NMI> trace_clock_global() { ts = clock() [ 999 ] if (ts < global_ts) ts = global_ts + 1 [ 1001 ] trylock(clock_lock) [ fail ] return ts [ 1001] } unlock(clock_lock); return ts; [ 1000 ] } trace_clock_global() { ts = clock() [ 1000 ] if (ts < global_ts) [ false 1000 == 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] unlock(clock_lock) return ts; [ 1000 ] } The above case shows to reads of trace_clock_global() on the same CPU, but the second read returns one less than the first read. That is, time when backwards, and this is not what is allowed by trace_clock_global(). This was triggered by heavy tracing and the ring buffer checker that tests for the clock going backwards: Ring buffer clock went backwards: 20613921464 -> 20613921463 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0 Modules linked in: [..] [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463 [20613915818] PAGE TIME STAMP [20613915818] delta:0 [20613915819] delta:1 [20613916035] delta:216 [20613916465] delta:430 [20613916575] delta:110 [20613916749] delta:174 [20613917248] delta:499 [20613917333] delta:85 [20613917775] delta:442 [20613917921] delta:146 [20613918321] delta:400 [20613918568] delta:247 [20613918768] delta:200 [20613919306] delta:538 [20613919353] delta:47 [20613919980] delta:627 [20613920296] delta:316 [20613920571] delta:275 [20613920862] delta:291 [20613921152] delta:290 [20613921464] delta:312 [20613921464] delta:0 TIME EXTEND [20613921464] delta:0 This happened more than once, and always for an off by one result. It also started happening after commit |
||
|
4fdd595e4f |
tracing: Do not stop recording comms if the trace file is being read
A while ago, when the "trace" file was opened, tracing was stopped, and
code was added to stop recording the comms to saved_cmdlines, for mapping
of the pids to the task name.
Code has been added that only records the comm if a trace event occurred,
and there's no reason to not trace it if the trace file is opened.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
85550c83da |
tracing: Do not stop recording cmdlines when tracing is off
The saved_cmdlines is used to map pids to the task name, such that the
output of the tracing does not just show pids, but also gives a human
readable name for the task.
If the name is not mapped, the output looks like this:
<...>-1316 [005] ...2 132.044039: ...
Instead of this:
gnome-shell-1316 [005] ...2 132.044039: ...
The names are updated when tracing is running, but are skipped if tracing
is stopped. Unfortunately, this stops the recording of the names if the
top level tracer is stopped, and not if there's other tracers active.
The recording of a name only happens when a new event is written into a
ring buffer, so there is no need to test if tracing is on or not. If
tracing is off, then no event is written and no need to test if tracing is
off or not.
Remove the check, as it hides the names of tasks for events in the
instance buffers.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
2db7ab6b4c |
tracing: Have ftrace_dump_on_oops kernel parameter take numbers
The kernel parameter for ftrace_dump_on_oops can take a single assignment. That is, it can be: ftrace_dump_on_oops or ftrace_dump_on_oops=orig_cpu But the content in the sysctl file is a number. 0 for disabled 1 for ftrace_dump_on_oops (all CPUs) 2 for ftrace_dump_on_oops (orig CPU) Allow the kernel command line to take a number as well to match the sysctl numbers. That is: ftrace_dump_on_oops=1 is the same as ftrace_dump_on_oops and ftrace_dump_on_oops=2 is the same as ftrace_dump_on_oops=orig_cpu Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
f38601368f |
tracing: Add tp_printk_stop_on_boot option
Add a kernel command line option that disables printing of events to console at late_initcall_sync(). This is useful when needing to see specific events written to console on boot up, but not wanting it when user space starts, as user space may make the console so noisy that the system becomes inoperable. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
8f0901cda1 |
tracing: Add better comments for the filtering temp buffer use case
When filtering is enabled, the event is copied into a temp buffer instead of being written into the ring buffer directly, because the discarding of events from the ring buffer is very expensive, and doing the extra copy is much faster than having to discard most of the time. As that logic is subtle, add comments to explain in more detail to what is going on and how it works. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
faa76a6c28 |
tracing: Simplify the max length test when using the filtering temp buffer
When filtering trace events, a temp buffer is used because the extra copy from the temp buffer into the ring buffer is still faster than the direct write into the ring buffer followed by a discard if the filter does not match. But the data that can be stored in the temp buffer is a PAGE_SIZE minus the ring buffer event header. The calculation of that header size is complex, but using the helper macro "struct_size()" can simplify it. Link: https://lore.kernel.org/stable/CAHk-=whKbJkuVmzb0hD3N6q7veprUrSpiBHRxVY=AffWZPtxmg@mail.gmail.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
4f99f84899 |
tracing/boot: Add per-group/all events enablement
Add ftrace.event.<GROUP>.enable and ftrace.event.enable boot-time tracing, which enables all events under given GROUP and all events respectivly. Link: https://lkml.kernel.org/r/162264438005.302580.12019174481201855444.stgit@devnote2 Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
6c610dba6e |
tracing: Add WARN_ON_ONCE when returned value is negative
ret is assigned return value of event_hist_trigger_func, but the value is unused. It is better to warn when returned value is negative, rather than just ignoring it. Link: https://lkml.kernel.org/r/20210529061423.GA103954@hyeyoo Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
099dcc1801 |
tracing: Fix set_named_trigger_data() kernel-doc comment
Fix the description of the parameters. Link: https://lkml.kernel.org/r/20210515105735.52785-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
08b0c9b4b9 |
tracing: Remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Link: https://lkml.kernel.org/r/20210513115517.58178-1-colin.king@canonical.com Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
957cdcd9bd |
ring-buffer: Use fallthrough pseudo-keyword
Replace /* fall through */ comment with pseudo-keyword macro fallthrough[1] [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Link: https://lkml.kernel.org/r/20210511140246.18868-1-jj251510319013@gmail.com Signed-off-by: Wei Ming Chen <jj251510319013@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
614db49c72 |
tracing: Remove redundant assignment to event_var
Variable event_var is set to 'ERR_PTR(-EINVAL)', but this value is never read as it is overwritten or not used later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: kernel/trace/trace_events_hist.c:2437:21: warning: Value stored to 'event_var' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Link: https://lkml.kernel.org/r/1620470236-26562-1-git-send-email-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> |
||
|
3e08a9f976 |
tracing: Correct the length check which causes memory corruption
We've suffered from severe kernel crashes due to memory corruption on our production environment, like, Call Trace: [1640542.554277] general protection fault: 0000 [#1] SMP PTI [1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G [1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190 [1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286 [1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX: 0000000006e931bf [1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI: ffff9a45ff004300 [1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09: 0000000000000000 [1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9a20608d [1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15: 696c662f65636976 [1640542.563128] FS: 00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000) knlGS:0000000000000000 [1640542.563937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4: 00000000003606e0 [1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1640542.566742] Call Trace: [1640542.567009] anon_vma_clone+0x5d/0x170 [1640542.567417] __split_vma+0x91/0x1a0 [1640542.567777] do_munmap+0x2c6/0x320 [1640542.568128] vm_munmap+0x54/0x70 [1640542.569990] __x64_sys_munmap+0x22/0x30 [1640542.572005] do_syscall_64+0x5b/0x1b0 [1640542.573724] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [1640542.575642] RIP: 0033:0x7f45d6e61e27 James Wang has reproduced it stably on the latest 4.19 LTS. After some debugging, we finally proved that it's due to ftrace buffer out-of-bound access using a debug tool as follows: [ 86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000 [ 86.780806] no_context+0xdf/0x3c0 [ 86.784327] __do_page_fault+0x252/0x470 [ 86.788367] do_page_fault+0x32/0x140 [ 86.792145] page_fault+0x1e/0x30 [ 86.795576] strncpy_from_unsafe+0x66/0xb0 [ 86.799789] fetch_memory_string+0x25/0x40 [ 86.804002] fetch_deref_string+0x51/0x60 [ 86.808134] kprobe_trace_func+0x32d/0x3a0 [ 86.812347] kprobe_dispatcher+0x45/0x50 [ 86.816385] kprobe_ftrace_handler+0x90/0xf0 [ 86.820779] ftrace_ops_assist_func+0xa1/0x140 [ 86.825340] 0xffffffffc00750bf [ 86.828603] do_sys_open+0x5/0x1f0 [ 86.832124] do_syscall_64+0x5b/0x1b0 [ 86.835900] entry_SYSCALL_64_after_hwframe+0x44/0xa9 commit |
||
|
6c14133d2d |
ftrace: Do not blindly read the ip address in ftrace_bug()
It was reported that a bug on arm64 caused a bad ip address to be used for
updating into a nop in ftrace_init(), but the error path (rightfully)
returned -EINVAL and not -EFAULT, as the bug caused more than one error to
occur. But because -EINVAL was returned, the ftrace_bug() tried to report
what was at the location of the ip address, and read it directly. This
caused the machine to panic, as the ip was not pointing to a valid memory
address.
Instead, read the ip address with copy_from_kernel_nofault() to safely
access the memory, and if it faults, report that the address faulted,
otherwise report what was in that location.
Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/
Cc: stable@vger.kernel.org
Fixes:
|
||
|
ff40e51043 |
bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
Commit |
||
|
eb01f5353b |
tracing: Handle %.*s in trace_check_vprintf()
If a trace event uses the %*.s notation, the trace_check_vprintf() will
fail and will warn about a bad processing of strings, because it does not
take into account the length field when processing the star (*) part.
Have it handle this case as well.
Link: https://lore.kernel.org/linux-nfs/238C0E2D-C2A4-4578-ADD2-C565B3B99842@oracle.com/
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Fixes:
|
||
|
7ec901b6fa |
tracing: Fix probes written to the set_ftrace_filter file
Now that there's a library that accesses the tracefs file system, (libtracefs), the way the files are interacted with is slightly different than the command line. For instance, the write() system call is used directly instead of an echo. This exposes some old bugs. If a probe is written to "set_ftrace_filter" without any white space after it, it will be ignored. This is because the write expects that a string written to it that does not end with white spaces thinks there is more to come. But if the file is closed, the release function needs to finish it. The "set_ftrace_filter" release function handles the filter part of the "set_ftrace_filter" file, but did not handle the probe part. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYJP4OBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quzjAQCoFQXkJtYhwlMk0dTxclrsQlm0t93H pHwJA9Zyxe25UgD8D/rpG/wtHaSSuP6omEDbqvshpNdszqKb0Nt+UM116QU= =niJ6 -----END PGP SIGNATURE----- Merge tag 'trace-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix probes written to the set_ftrace_filter file Now that there's a library that accesses the tracefs file system (libtracefs), the way the files are interacted with is slightly different than the command line. For instance, the write() system call is used directly instead of an echo. This exposes some old bugs. If a probe is written to "set_ftrace_filter" without any white space after it, it will be ignored. This is because the write expects that a string written to it that does not end with white spaces thinks there is more to come. But if the file is closed, the release function needs to finish it. The "set_ftrace_filter" release function handles the filter part of the "set_ftrace_filter" file, but did not handle the probe part" * tag 'trace-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Handle commands when closing set_ftrace_filter file |
||
|
8c9af478c0 |
ftrace: Handle commands when closing set_ftrace_filter file
# echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
will cause switch_mm to stop tracing by the traceoff command.
# echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
does nothing.
The reason is that the parsing in the write function only processes
commands if it finished parsing (there is white space written after the
command). That's to handle:
write(fd, "switch_mm:", 10);
write(fd, "traceoff", 8);
cases, where the command is broken over multiple writes.
The problem is if the file descriptor is closed, then the write call is
not processed, and the command needs to be processed in the release code.
The release code can handle matching of functions, but does not handle
commands.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
9b1f61d5d7 |
tracing updates for 5.13
New feature: The "func-no-repeats" option in tracefs/options directory. When set the function tracer will detect if the current function being traced is the same as the previous one, and instead of recording it, it will keep track of the number of times that the function is repeated in a row. And when another function is recorded, it will write a new event that shows the function that repeated, the number of times it repeated and the time stamp of when the last repeated function occurred. Enhancements: In order to implement the above "func-no-repeats" option, the ring buffer timestamp can now give the accurate timestamp of the event as it is being recorded, instead of having to record an absolute timestamp for all events. This helps the histogram code which no longer needs to waste ring buffer space. New validation logic to make sure all trace events that access dereferenced pointers do so in a safe way, and will warn otherwise. Fixes: No longer limit the PIDs of tasks that are recorded for "saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger range. This caused the mapping of PIDs to the task names to be dropped for all tasks with a PID greater than 32768. Change trace_clock_global() to never block. This caused a deadlock. Clean ups: Typos, prototype fixes, and removing of duplicate or unused code. Better management of ftrace_page allocations. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYI/1vBQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qiL0AP9EemIC5TDh2oihqLRNeUjdTu0ryEoM HRFqxozSF985twD/bfkt86KQC8rLHwxTbxQZ863bmdaC6cMGFhWiF+H/MAs= =psYt -----END PGP SIGNATURE----- Merge tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "New feature: - A new "func-no-repeats" option in tracefs/options directory. When set the function tracer will detect if the current function being traced is the same as the previous one, and instead of recording it, it will keep track of the number of times that the function is repeated in a row. And when another function is recorded, it will write a new event that shows the function that repeated, the number of times it repeated and the time stamp of when the last repeated function occurred. Enhancements: - In order to implement the above "func-no-repeats" option, the ring buffer timestamp can now give the accurate timestamp of the event as it is being recorded, instead of having to record an absolute timestamp for all events. This helps the histogram code which no longer needs to waste ring buffer space. - New validation logic to make sure all trace events that access dereferenced pointers do so in a safe way, and will warn otherwise. Fixes: - No longer limit the PIDs of tasks that are recorded for "saved_cmdlines" to PID_MAX_DEFAULT (32768), as systemd now allows for a much larger range. This caused the mapping of PIDs to the task names to be dropped for all tasks with a PID greater than 32768. - Change trace_clock_global() to never block. This caused a deadlock. Clean ups: - Typos, prototype fixes, and removing of duplicate or unused code. - Better management of ftrace_page allocations" * tag 'trace-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (32 commits) tracing: Restructure trace_clock_global() to never block tracing: Map all PIDs to command lines ftrace: Reuse the output of the function tracer for func_repeats tracing: Add "func_no_repeats" option for function tracing tracing: Unify the logic for function tracing options tracing: Add method for recording "func_repeats" events tracing: Add "last_func_repeats" to struct trace_array tracing: Define new ftrace event "func_repeats" tracing: Define static void trace_print_time() ftrace: Simplify the calculation of page number for ftrace_page->records some more ftrace: Store the order of pages allocated in ftrace_page tracing: Remove unused argument from "ring_buffer_time_stamp() tracing: Remove duplicate struct declaration in trace_events.h tracing: Update create_system_filter() kernel-doc comment tracing: A minor cleanup for create_system_filter() kernel: trace: Mundane typo fixes in the file trace_events_filter.c tracing: Fix various typos in comments scripts/recordmcount.pl: Make vim and emacs indent the same scripts/recordmcount.pl: Make indent spacing consistent tracing: Add a verifier to check string pointers for trace events ... |
||
|
aafe104aa9 |
tracing: Restructure trace_clock_global() to never block
It was reported that a fix to the ring buffer recursion detection would cause a hung machine when performing suspend / resume testing. The following backtrace was extracted from debugging that case: Call Trace: trace_clock_global+0x91/0xa0 __rb_reserve_next+0x237/0x460 ring_buffer_lock_reserve+0x12a/0x3f0 trace_buffer_lock_reserve+0x10/0x50 __trace_graph_return+0x1f/0x80 trace_graph_return+0xb7/0xf0 ? trace_clock_global+0x91/0xa0 ftrace_return_to_handler+0x8b/0xf0 ? pv_hash+0xa0/0xa0 return_to_handler+0x15/0x30 ? ftrace_graph_caller+0xa0/0xa0 ? trace_clock_global+0x91/0xa0 ? __rb_reserve_next+0x237/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 ? trace_event_buffer_lock_reserve+0x3c/0x120 ? trace_event_buffer_reserve+0x6b/0xc0 ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0 ? dpm_run_callback+0x3b/0xc0 ? pm_ops_is_empty+0x50/0x50 ? platform_get_irq_byname_optional+0x90/0x90 ? trace_device_pm_callback_start+0x82/0xd0 ? dpm_run_callback+0x49/0xc0 With the following RIP: RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200 Since the fix to the recursion detection would allow a single recursion to happen while tracing, this lead to the trace_clock_global() taking a spin lock and then trying to take it again: ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* lock taken */ (something else gets traced by function graph tracer) ring_buffer_lock_reserve() { trace_clock_global() { arch_spin_lock() { queued_spin_lock_slowpath() { /* DEAD LOCK! */ Tracing should *never* block, as it can lead to strange lockups like the above. Restructure the trace_clock_global() code to instead of simply taking a lock to update the recorded "prev_time" simply use it, as two events happening on two different CPUs that calls this at the same time, really doesn't matter which one goes first. Use a trylock to grab the lock for updating the prev_time, and if it fails, simply try again the next time. If it failed to be taken, that means something else is already updating it. Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home Cc: stable@vger.kernel.org Tested-by: Konstantin Kharlamov <hi-angel@yandex.ru> Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Fixes: |