linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-23 20:51:44 +00:00

Author	SHA1	Message	Date
Masami Hiramatsu (Google)	302db0f5b3	tracing/probes: Add a function to search a member of a struct/union Add btf_find_struct_member() API to search a member of a given data structure or union from the member's name. Link: https://lore.kernel.org/all/169272156248.160970.8868479822371129043.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-23 09:40:16 +09:00
Masami Hiramatsu (Google)	ebeed8d4a5	tracing/probes: Move finding func-proto API and getting func-param API to trace_btf Move generic function-proto find API and getting function parameter API to BTF library code from trace_probe.c. This will avoid redundant efforts on different feature. Link: https://lore.kernel.org/all/169272155255.160970.719426926348706349.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-23 09:39:45 +09:00
Masami Hiramatsu (Google)	b1d1e90490	tracing/probes: Support BTF argument on module functions Since the btf returned from bpf_get_btf_vmlinux() only covers functions in the vmlinux, BTF argument is not available on the functions in the modules. Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind() so that BTF argument can find the correct struct btf and btf_type in it. With this fix, fprobe events can use `$arg` on module functions as below # grep nf_log_ip_packet /proc/kallsyms ffffffffa0005c00 t nf_log_ip_packet [nf_log_syslog] ffffffffa0005bf0 t __pfx_nf_log_ip_packet [nf_log_syslog] # echo 'f nf_log_ip_packet $arg' > dynamic_events # cat dynamic_events f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix To support the module's btf which is removable, the struct btf needs to be ref-counted. So this also records the btf in the traceprobe_parse_context and returns the refcount when the parse has done. Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/ Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-23 09:39:15 +09:00
Chuang Wang	f8bbf8b990	tracing/eprobe: Iterate trace_eprobe directly Refer to the description in [1], we can skip "container_of()" following "list_for_each_entry()" by using "list_for_each_entry()" with "struct trace_eprobe" and "tp.list". Also, this patch defines "for_each_trace_eprobe_tp" to simplify the code of the same logic. [1] https://lore.kernel.org/all/CAHk-=wjakjw6-rDzDDBsuMoDCqd+9ogifR_EE1F0K-jYek1CdA@mail.gmail.com/ Link: https://lore.kernel.org/all/20230822022433.262478-1-nashuiliang@gmail.com/ Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2023-08-23 09:38:56 +09:00
Ruan Jinjie	8865aea047	kernel: kprobes: Use struct_size() Use struct_size() instead of hand-writing it, when allocating a structure with a flex array. This is less verbose. Link: https://lore.kernel.org/all/20230725195424.3469242-1-ruanjinjie@huawei.com/ Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2023-08-23 09:38:17 +09:00
Kumar Kartikeya Dwivedi	6785b2edf4	bpf: Fix check_func_arg_reg_off bug for graph root/node The commit being fixed introduced a hunk into check_func_arg_reg_off that bypasses reg->off == 0 enforcement when offset points to a graph node or root. This might possibly be done for treating bpf_rbtree_remove and others as KF_RELEASE and then later check correct reg->off in helper argument checks. But this is not the case, those helpers are already not KF_RELEASE and permit non-zero reg->off and verify it later to match the subobject in BTF type. However, this logic leads to bpf_obj_drop permitting free of register arguments with non-zero offset when they point to a graph root or node within them, which is not ok. For instance: struct foo { int i; int j; struct bpf_rb_node node; }; struct foo f = bpf_obj_new(typeof(f)); if (!f) ... bpf_obj_drop(f); // OK bpf_obj_drop(&f->i); // still ok from verifier PoV bpf_obj_drop(&f->node); // Not OK, but permitted right now Fix this by dropping the whole part of code altogether. Fixes: `6a3cd3318f` ("bpf: Migrate release_on_unlock logic to non-owning ref semantics") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230822175140.1317749-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-22 12:52:48 -07:00
Clive Lin	5f55836ab4	PM: QoS: Add check to make sure CPU latency is non-negative CPU latency should never be negative, which will be incorrectly high when converted to unsigned data type. Commit `8d36694245` ("PM: QoS: Add check to make sure CPU freq is non-negative") makes sure CPU frequency is non-negative to fix incorrect behavior in freqency QoS. Add an analogous check to make sure CPU latency is non-negative so as to prevent this problem from happening in CPU latency QoS. Signed-off-by: Clive Lin <clive.lin@mediatek.com> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-08-22 21:37:29 +02:00
Yonghong Song	ab6c637ad0	bpf: Fix a bpf_kptr_xchg() issue with local kptr When reviewing local percpu kptr support, Alexei discovered a bug wherea bpf_kptr_xchg() may succeed even if the map value kptr type and locally allocated obj type do not match ([1]). Missed struct btf_id comparison is the reason for the bug. This patch added such struct btf_id comparison and will flag verification failure if types do not match. [1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t Reported-by: Alexei Starovoitov <ast@kernel.org> Fixes: `738c96d5e2` ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230822050053.2886960-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-22 09:43:55 -07:00
Eric Vaughn	a943188dab	tracing/user_events: Optimize safe list traversals Several of the list traversals in the user_events facility use safe list traversals where they could be using the unsafe versions instead. Replace these safe traversals with their unsafe counterparts in the interest of optimization. Link: https://lore.kernel.org/linux-trace-kernel/20230810194337.695983-1-ervaughn@linux.microsoft.com Suggested-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Eric Vaughn <ervaughn@linux.microsoft.com> Acked-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:22:10 -04:00
Yue Haibing	efde97a175	tracing: Remove unused function declarations Commit `9457158bbc` ("tracing: Fix reset of time stamps during trace_clock changes") left behind tracing_reset_current() declaration. Also commit `6954e41526` ("tracing: Place trace_pid_list logic into abstract functions") removed trace_free_pid_list() implementation but leave declaration. Link: https://lore.kernel.org/linux-trace-kernel/20230803144028.25492-1-yuehaibing@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:19:35 -04:00
Valentin Schneider	38c6f68083	tracing/filters: Further optimise scalar vs cpumask comparison Per the previous commits, we now only enter do_filter_scalar_cpumask() with a mask of weight greater than one. Optimise the equality checks. Link: https://lkml.kernel.org/r/20230707172155.70873-9-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:29 -04:00
Valentin Schneider	1cffbe6c62	tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. In this case we can directly re-use filter_pred_cpu(), we just need to transform '&' into '==' before executing it. Link: https://lkml.kernel.org/r/20230707172155.70873-8-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:29 -04:00
Valentin Schneider	ca77dd8ce4	tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand. Link: https://lkml.kernel.org/r/20230707172155.70873-7-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:29 -04:00
Valentin Schneider	fe4fa4ec9b	tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. Reuse do_filter_scalar_cpumask() when the input mask has a weight of one. Link: https://lkml.kernel.org/r/20230707172155.70873-6-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:28 -04:00
Valentin Schneider	347d24fc82	tracing/filters: Enable filtering the CPU common field by a cpumask The tracing_cpumask lets us specify which CPUs are traced in a buffer instance, but doesn't let us do this on a per-event basis (unless one creates an instance per event). A previous commit added filtering scalar fields by a user-given cpumask, make this work with the CPU common field as well. This enables doing things like $ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \ -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}' Link: https://lkml.kernel.org/r/20230707172155.70873-5-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:28 -04:00
Valentin Schneider	3cbec9d7b9	tracing/filters: Enable filtering a scalar field by a cpumask Several events use a scalar field to denote a CPU: o sched_wakeup.target_cpu o sched_migrate_task.orig_cpu,dest_cpu o sched_move_numa.src_cpu,dst_cpu o ipi_send_cpu.cpu o ... Filtering these currently requires using arithmetic comparison functions, which can be tedious when dealing with interleaved SMT or NUMA CPU ids. Allow these to be filtered by a user-provided cpumask, which enables e.g.: $ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/20230707172155.70873-4-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:28 -04:00
Valentin Schneider	39f7c41c90	tracing/filters: Enable filtering a cpumask field by another cpumask The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions. Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask. The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression. This enables e.g.: $ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/20230707172155.70873-3-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:28 -04:00
Valentin Schneider	cfb58e278c	tracing/filters: Dynamically allocate filter_pred.regex Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array in the regex field, even if the predicate function does not use the field. A later commit will introduce a dynamically allocated cpumask to struct filter_pred, which will require a dedicated freeing function. Bite the bullet and make filter_pred.regex dynamically allocated. While at it, reorder the fields of filter_pred to fill in the byte holes. The struct now fits on a single cacheline. No change in behaviour intended. The kfree()'s were patched via Coccinelle: @@ struct filter_pred *pred; @@ -kfree(pred); +free_predicate(pred); Link: https://lkml.kernel.org/r/20230707172155.70873-2-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Leonardo Bras <leobras@redhat.com> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-22 05:13:28 -04:00
Jiri Olsa	686328d80c	bpf: Add bpf_get_func_ip helper support for uprobe link Adding support for bpf_get_func_ip helper being called from ebpf program attached by uprobe_multi link. It returns the ip of the uprobe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-7-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:51:25 -07:00
Jiri Olsa	b733eeade4	bpf: Add pid filter support for uprobe_multi link Adding support to specify pid for uprobe_multi link and the uprobes are created only for task with given pid value. Using the consumer.filter filter callback for that, so the task gets filtered during the uprobe installation. We still need to check the task during runtime in the uprobe handler, because the handler could get executed if there's another system wide consumer on the same uprobe (thanks Oleg for the insight). Cc: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:51:25 -07:00
Jiri Olsa	0b779b61f6	bpf: Add cookies support for uprobe_multi link Adding support to specify cookies array for uprobe_multi link. The cookies array share indexes and length with other uprobe_multi arrays (offsets/ref_ctr_offsets). The cookies[i] value defines cookie for i-the uprobe and will be returned by bpf_get_attach_cookie helper when called from ebpf program hooked to that specific uprobe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:51:25 -07:00
Jiri Olsa	89ae89f53d	bpf: Add multi uprobe link Adding new multi uprobe link that allows to attach bpf program to multiple uprobes. Uprobes to attach are specified via new link_create uprobe_multi union: struct { __aligned_u64 path; __aligned_u64 offsets; __aligned_u64 ref_ctr_offsets; __u32 cnt; __u32 flags; } uprobe_multi; Uprobes are defined for single binary specified in path and multiple calling sites specified in offsets array with optional reference counters specified in ref_ctr_offsets array. All specified arrays have length of 'cnt'. The 'flags' supports single bit for now that marks the uprobe as return probe. Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:51:25 -07:00
Jiri Olsa	3505cb9fa2	bpf: Add attach_type checks under bpf_prog_attach_check_attach_type Add extra attach_type checks from link_create under bpf_prog_attach_check_attach_type. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230809083440.3209381-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:51:25 -07:00
Hou Tao	c2e42ddf26	bpf, cpumask: Clean up bpf_cpu_map_entry directly in cpu_map_free After synchronous_rcu(), both the dettached XDP program and xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry will be cpu_map_kthread_run(), so instead of calling __cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU grace period, do these things directly. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20230816045959.358059-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:21:16 -07:00
Hou Tao	8f8500a247	bpf, cpumap: Use queue_rcu_work() to remove unnecessary rcu_barrier() As for now __cpu_map_entry_replace() uses call_rcu() to wait for the inflight xdp program to exit the RCU read critical section, and then launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush all pending xdp frames or skbs. But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to wait for the completion of __cpu_map_entry_free(), because rcu_barrier() will wait for all pending RCU callbacks and cpu_map_kthread_stop() only needs to wait for the completion of a specific __cpu_map_entry_free(). So use queue_rcu_work() to replace call_rcu(), schedule_work() and rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free() kworker after a RCU grace period. Because __cpu_map_entry_free() is running in a kworker context, so it is OK to do all of these freeing procedures include kthread_stop() in it. After the update, there is no need to do reference-counting for bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in __cpu_map_entry_free(), so just remove it. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20230816045959.358059-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-21 15:21:16 -07:00
Matthew Wilcox (Oracle)	ebc1baf5c9	mm: free up a word in the first tail page Store the folio order in the low byte of the flags word in the first tail page. This frees up the word that was being used to store the order and dtor bytes previously. Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 14:28:45 -07:00
Matthew Wilcox (Oracle)	de53c05f2a	mm: add large_rmappable page flag Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 14:28:44 -07:00
Matthew Wilcox (Oracle)	9c5ccf2db0	mm: remove HUGETLB_PAGE_DTOR We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 14:28:44 -07:00
Randy Dunlap	ef815d2cba	treewide: drop CONFIG_EMBEDDED There is only one Kconfig user of CONFIG_EMBEDDED and it can be switched to EXPERT or "if !ARCH_MULTIPLATFORM" (suggested by Arnd). Link: https://lkml.kernel.org/r/20230816055010.31534-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> [RISC-V] Acked-by: Greg Ungerer <gerg@linux-m68k.org> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Russell King <linux@armlinux.org.uk> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Brian Cain <bcain@quicinc.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:46:25 -07:00
Helge Deller	0a6b58c5cd	lockdep: fix static memory detection even more On the parisc architecture, lockdep reports for all static objects which are in the __initdata section (e.g. "setup_done" in devtmpfs, "kthreadd_done" in init/main.c) this warning: INFO: trying to register non-static key. The warning itself is wrong, because those objects are in the __initdata section, but the section itself is on parisc outside of range from _stext to _end, which is why the static_obj() functions returns a wrong answer. While fixing this issue, I noticed that the whole existing check can be simplified a lot. Instead of checking against the _stext and _end symbols (which include code areas too) just check for the .data and .bss segments (since we check a data object). This can be done with the existing is_kernel_core_data() macro. In addition objects in the __initdata section can be checked with init_section_contains(), and is_kernel_rodata() allows keys to be in the _ro_after_init section. This partly reverts and simplifies commit `bac59d18c7` ("x86/setup: Fix static memory detection"). Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100 Fixes: `bac59d18c7` ("x86/setup: Fix static memory detection") Signed-off-by: Helge Deller <deller@gmx.de> Cc: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:46:24 -07:00
Mateusz Guzik	a7031f1452	kernel/fork: stop playing lockless games for exe_file replacement xchg originated in `6e399cd144` ("prctl: avoid using mmap_sem for exe_file serialization"). While the commit message does not explain why the change, I found the original submission [1] which ultimately claims it cleans things up by removing dependency of exe_file on the semaphore. However, `fe69d560b5` ("kernel/fork: always deny write access to current MM exe_file") added a semaphore up/down cycle to synchronize the state of exe_file against fork, defeating the point of the original change. This is on top of semaphore trips already present both in the replacing function and prctl (the only consumer). Normally replacing exe_file does not happen for busy processes, thus write-locking is not an impediment to performance in the intended use case. If someone keeps invoking the routine for a busy processes they are trying to play dirty and that's another reason to avoid any trickery. As such I think the atomic here only adds complexity for no benefit. Just write-lock around the replacement. I also note that replacement races against the mapping check loop as nothing synchronizes actual assignment with with said checks but I am not addressing it in this patch. (Is the loop of any use to begin with?) Link: https://lore.kernel.org/linux-mm/1424979417.10344.14.camel@stgolabs.net/ [1] Link: https://lkml.kernel.org/r/20230814172140.1777161-1-mjguzik@gmail.com Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:46:24 -07:00
Aleksa Sarai	9876cfe8ec	memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT \| O_RDWR \| O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char execpath = NULL; char argv[] = { "bad-exe", NULL }, envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH \| O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. / New API / The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility / As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com Fixes: `105ff5339f` ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Daniel Verkamp <dverkamp@chromium.org> Cc: Jeff Xu <jeffxu@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:37:59 -07:00
Kefeng Wang	549f5c771e	perf/core: use vma_is_initial_stack() and vma_is_initial_heap() Use the helpers to simplify code, also kill unneeded goto cpy_name. Link: https://lkml.kernel.org/r/20230728050043.59880-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian Göttsche <cgzones@googlemail.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:37:32 -07:00
Arnd Bergmann	68af05143f	kernel/iomem.c: remove __weak ioremap_cache helper No portable code calls into this function any more, and on architectures that don't use or define their own, it causes a warning: kernel/iomem.c:10:22: warning: no previous prototype for 'ioremap_cache' [-Wmissing-prototypes] 10 \| __weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size) Fold it into the only caller that uses it on architectures without the #define. Note that the fallback to ioremap is probably still wrong on those architectures, but this is what it's always done there. Link: https://lkml.kernel.org/r/20230726145432.1617809-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:37:28 -07:00
Elena Reshetova	2ddd3cac1f	nsproxy: Convert nsproxy.count to refcount_t atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nsproxy.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in refcount.h have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nsproxy.count it might make a difference in following places: - put_nsproxy() and switch_task_namespaces(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230818041327.gonna.210-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>	2023-08-21 11:29:12 -07:00
Zheng Yejian	c2489bb7e6	tracing: Introduce pipe_cpumask to avoid race on trace_pipes There is race issue when concurrently splice_read main trace_pipe and per_cpu trace_pipes which will result in data read out being different from what actually writen. As suggested by Steven: > I believe we should add a ref count to trace_pipe and the per_cpu > trace_pipes, where if they are opened, nothing else can read it. > > Opening trace_pipe locks all per_cpu ref counts, if any of them are > open, then the trace_pipe open will fail (and releases any ref counts > it had taken). > > Opening a per_cpu trace_pipe will up the ref count for just that > CPU buffer. This will allow multiple tasks to read different per_cpu > trace_pipe files, but will prevent the main trace_pipe file from > being opened. But because we only need to know whether per_cpu trace_pipe is open or not, using a cpumask instead of using ref count may be easier. After this patch, users will find that: - Main trace_pipe can be opened by only one user, and if it is opened, all per_cpu trace_pipes cannot be opened; - Per_cpu trace_pipes can be opened by multiple users, but each per_cpu trace_pipe can only be opened by one user. And if one of them is opened, main trace_pipe cannot be opened. Link: https://lore.kernel.org/linux-trace-kernel/20230818022645.1948314-1-zhengyejian1@huawei.com Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-21 11:17:14 -04:00
Greg Kroah-Hartman	642073c306	Merge commit `b320441c04` ("Merge tag 'tty-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next We need the serial-core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-08-20 14:29:37 +02:00
Jakub Kicinski	7ff57803d2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/sfc/tc.c `fa165e1949` ("sfc: don't unregister flow_indr if it was never registered") `3bf969e88a` ("sfc: add MAE table machinery for conntrack table") https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-18 12:44:56 -07:00
Douglas Anderson	1f38c86bb2	watchdog/hardlockup: avoid large stack frames in watchdog_hardlockup_check() After commit `77c12fc959` ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") we started storing a `struct cpumask` on the stack in watchdog_hardlockup_check(). On systems with CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That triggers warnings with `CONFIG_FRAME_WARN` set to 1024. We'll use the new trigger_allbutcpu_cpu_backtrace() to avoid needing to use a CPU mask at all. Link: https://lkml.kernel.org/r/20230804065935.v4.2.I501ab68cb926ee33a7c87e063d207abf09b9943c@changeid Fixes: `77c12fc959` ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") Signed-off-by: Douglas Anderson <dianders@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202307310955.pLZDhpnl-lkp@intel.com Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:19:00 -07:00
Douglas Anderson	8d539b84f1	nmi_backtrace: allow excluding an arbitrary CPU The APIs that allow backtracing across CPUs have always had a way to exclude the current CPU. This convenience means callers didn't need to find a place to allocate a CPU mask just to handle the common case. Let's extend the API to take a CPU ID to exclude instead of just a boolean. This isn't any more complex for the API to handle and allows the hardlockup detector to exclude a different CPU (the one it already did a trace for) without needing to find space for a CPU mask. Arguably, this new API also encourages safer behavior. Specifically if the caller wants to avoid tracing the current CPU (maybe because they already traced the current CPU) this makes it more obvious to the caller that they need to make sure that the current CPU ID can't change. [akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub] Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:19:00 -07:00
Greg Kroah-Hartman	be33db2142	kthread: unexport __kthread_should_park() There are no in-kernel users of __kthread_should_park() so mark it as static and do not export it. Link: https://lkml.kernel.org/r/2023080450-handcuff-stump-1d6e@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: John Stultz <jstultz@google.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: "Arve Hjønnevåg" <arve@android.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Christian Brauner (Microsoft)" <brauner@kernel.org> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Zqiang <qiang1.zhang@intel.com> Cc: Prathu Baronia <quic_pbaronia@quicinc.com> Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:59 -07:00
Arnd Bergmann	29665c1e2a	gcov: shut up missing prototype warnings for internal stubs gcov uses global functions that are called from generated code, but these have no prototype in a header, which causes a W=1 build warning: kernel/gcov/gcc_base.c:12:6: error: no previous prototype for '__gcov_init' [-Werror=missing-prototypes] kernel/gcov/gcc_base.c:40:6: error: no previous prototype for '__gcov_flush' [-Werror=missing-prototypes] kernel/gcov/gcc_base.c:46:6: error: no previous prototype for '__gcov_merge_add' [-Werror=missing-prototypes] kernel/gcov/gcc_base.c:52:6: error: no previous prototype for '__gcov_merge_single' [-Werror=missing-prototypes] Just turn off these warnings unconditionally for the two files that contain them. Link: https://lore.kernel.org/all/0820010f-e9dc-779d-7924-49c7df446bce@linux.ibm.com/ Link: https://lkml.kernel.org/r/20230725123042.2269077-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:58 -07:00
Li kunyu	598f0046e9	kernel: relay: remove unnecessary NULL values from relay_open_buf buf is assigned first, so it does not need to initialize the assignment. Link: https://lkml.kernel.org/r/20230713234459.2908-1-kunyu@nfschina.com Signed-off-by: Li kunyu <kunyu@nfschina.com> Reviewed-by: Andrew Morton <akpm@linux-foudation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:55 -07:00
Eric DeVolder	95d1fef537	remove ARCH_DEFAULT_KEXEC from Kconfig.kexec This patch is a minor cleanup to the series "refactor Kconfig to consolidate KEXEC and CRASH options". In that series, a new option ARCH_DEFAULT_KEXEC was introduced in order to obtain the equivalent behavior of s390 original Kconfig settings for KEXEC. As it turns out, this new option did not fully provide the equivalent behavior, rather a "select KEXEC" did. As such, the ARCH_DEFAULT_KEXEC is not needed anymore, so remove it. Link: https://lkml.kernel.org/r/20230802161750.2215-1-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:55 -07:00
Eric DeVolder	e6265fe777	kexec: rename ARCH_HAS_KEXEC_PURGATORY The Kconfig refactor to consolidate KEXEC and CRASH options utilized option names of the form ARCH_SUPPORTS_<option>. Thus rename the ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow the same. Link: https://lkml.kernel.org/r/20230712161545.87870-15-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:54 -07:00
Eric DeVolder	89cde45591	kexec: consolidate kexec and crash options into kernel/Kconfig.kexec Patch series "refactor Kconfig to consolidate KEXEC and CRASH options", v6. The Kconfig is refactored to consolidate KEXEC and CRASH options from various arch/<arch>/Kconfig files into new file kernel/Kconfig.kexec. The Kconfig.kexec is now a submenu titled "Kexec and crash features" located under "General Setup". The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_IMAGE_VERIFY_SIG - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP Over time, these options have been copied between Kconfig files and are very similar to one another, but with slight differences. The following architectures are impacted by the refactor (because of use of one or more KEXEC/CRASH options): - arm - arm64 - ia64 - loongarch - m68k - mips - parisc - powerpc - riscv - s390 - sh - x86 More information: In the patch series "crash: Kernel handling of CPU and memory hot un/plug" https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devolder@oracle.com/ the new kernel feature introduces the config option CRASH_HOTPLUG. In reviewing, Thomas Gleixner requested that the new config option not be placed in x86 Kconfig. Rather the option needs a generic/common home. To Thomas' point, the KEXEC and CRASH options have largely been duplicated in the various arch/<arch>/Kconfig files, with minor differences. This kind of proliferation is to be avoid/stopped. https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/ To that end, I have refactored the arch Kconfigs so as to consolidate the various KEXEC and CRASH options. Generally speaking, this work has the following themes: - KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec - These items from arch/Kconfig: CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC - These items from arch/x86/Kconfig form the common options: KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP - These items from arch/arm64/Kconfig form the common options: KEXEC_IMAGE_VERIFY_SIG - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec - The Kconfig.kexec is now a submenu titled "Kexec and crash features" and is now listed in "General Setup" submenu from init/Kconfig. - To control the common options, each has a new ARCH_SUPPORTS_<option> option. These gateway options determine whether the common options options are valid for the architecture. - To account for the slight differences in the original architecture coding of the common options, each now has a corresponding ARCH_SELECTS_<option> which are used to elicit the same side effects as the original arch/<arch>/Kconfig files for KEXEC and CRASH options. An example, 'make menuconfig' illustrating the submenu: > General setup > Kexec and crash features [] Enable kexec system call [] Enable kexec file based system call [] Verify kernel signature during kexec_file_load() syscall [ ] Require a valid signature in kexec_file_load() syscall [ ] Enable bzImage signature verification support [] kexec jump [] kernel crash dumps [] Update the crash elfcorehdr on system configuration changes In the process of consolidating the common options, I encountered slight differences in the coding of these options in several of the architectures. As a result, I settled on the following solution: - Each of the common options has a 'depends on ARCH_SUPPORTS_<option>' statement. For example, the KEXEC_FILE option has a 'depends on ARCH_SUPPORTS_KEXEC_FILE' statement. This approach is needed on all common options so as to prevent options from appearing for architectures which previously did not allow/enable them. For example, arm supports KEXEC but not KEXEC_FILE. The arch/arm/Kconfig does not provide ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options are not available to arm. - The boolean ARCH_SUPPORTS_<option> in effect allows the arch to determine when the feature is allowed. Archs which don't have the feature simply do not provide the corresponding ARCH_SUPPORTS_<option>. For each arch, where there previously were KEXEC and/or CRASH options, these have been replaced with the corresponding boolean ARCH_SUPPORTS_<option>, and an appropriate def_bool statement. For example, if the arch supports KEXEC_FILE, then the ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits the KEXEC_FILE option to be available. If the arch has a 'depends on' statement in its original coding of the option, then that expression becomes part of the def_bool expression. For example, arm64 had: config KEXEC depends on PM_SLEEP_SMP and in this solution, this converts to: config ARCH_SUPPORTS_KEXEC def_bool PM_SLEEP_SMP - In order to account for the architecture differences in the coding for the common options, the ARCH_SELECTS_<option> in the arch/<arch>/Kconfig is used. This option has a 'depends on <option>' statement to couple it to the main option, and from there can insert the differences from the common option and the arch original coding of that option. For example, a few archs enable CRYPTO and CRYTPO_SHA256 for KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and 'select CRYPTO' and 'select CRYPTO_SHA256' statements. Illustrating the option relationships: For each of the common KEXEC and CRASH options: ARCH_SUPPORTS_<option> <- <option> <- ARCH_SELECTS_<option> <option> # in Kconfig.kexec ARCH_SUPPORTS_<option> # in arch/<arch>/Kconfig, as needed ARCH_SELECTS_<option> # in arch/<arch>/Kconfig, as needed For example, KEXEC: ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC KEXEC # in Kconfig.kexec ARCH_SUPPORTS_KEXEC # in arch/<arch>/Kconfig, as needed ARCH_SELECTS_KEXEC # in arch/<arch>/Kconfig, as needed To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be enabled, and the ARCH_SELECTS_<option> handles side effects (ie. select statements). Examples: A few examples to show the new strategy in action: ===== x86 (minus the help section) ===== Original: config KEXEC bool "kexec system call" select KEXEC_CORE config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE select HAVE_IMA_KEXEC if IMA depends on X86_64 depends on CRYPTO=y depends on CRYPTO_SHA256=y config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE config KEXEC_SIG bool "Verify kernel signature during kexec_file_load() syscall" depends on KEXEC_FILE config KEXEC_SIG_FORCE bool "Require a valid signature in kexec_file_load() syscall" depends on KEXEC_SIG config KEXEC_BZIMAGE_VERIFY_SIG bool "Enable bzImage signature verification support" depends on KEXEC_SIG depends on SIGNED_PE_FILE_VERIFICATION select SYSTEM_TRUSTED_KEYRING config CRASH_DUMP bool "kernel crash dumps" depends on X86_64 \|\| (X86_32 && HIGHMEM) config KEXEC_JUMP bool "kexec jump" depends on KEXEC && HIBERNATION help becomes... New: config ARCH_SUPPORTS_KEXEC def_bool y config ARCH_SUPPORTS_KEXEC_FILE def_bool X86_64 && CRYPTO && CRYPTO_SHA256 config ARCH_SELECTS_KEXEC_FILE def_bool y depends on KEXEC_FILE select HAVE_IMA_KEXEC if IMA config ARCH_SUPPORTS_KEXEC_PURGATORY def_bool KEXEC_FILE config ARCH_SUPPORTS_KEXEC_SIG def_bool y config ARCH_SUPPORTS_KEXEC_SIG_FORCE def_bool y config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG def_bool y config ARCH_SUPPORTS_KEXEC_JUMP def_bool y config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 \|\| (X86_32 && HIGHMEM) ===== powerpc (minus the help section) ===== Original: config KEXEC bool "kexec system call" depends on PPC_BOOK3S \|\| PPC_E500 \|\| (44x && !SMP) select KEXEC_CORE config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE select HAVE_IMA_KEXEC if IMA select KEXEC_ELF depends on PPC64 depends on CRYPTO=y depends on CRYPTO_SHA256=y config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE config CRASH_DUMP bool "Build a dump capture kernel" depends on PPC64 \|\| PPC_BOOK3S_32 \|\| PPC_85xx \|\| (44x && !SMP) select RELOCATABLE if PPC64 \|\| 44x \|\| PPC_85xx becomes... New: config ARCH_SUPPORTS_KEXEC def_bool PPC_BOOK3S \|\| PPC_E500 \|\| (44x && !SMP) config ARCH_SUPPORTS_KEXEC_FILE def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y config ARCH_SUPPORTS_KEXEC_PURGATORY def_bool KEXEC_FILE config ARCH_SELECTS_KEXEC_FILE def_bool y depends on KEXEC_FILE select KEXEC_ELF select HAVE_IMA_KEXEC if IMA config ARCH_SUPPORTS_CRASH_DUMP def_bool PPC64 \|\| PPC_BOOK3S_32 \|\| PPC_85xx \|\| (44x && !SMP) config ARCH_SELECTS_CRASH_DUMP def_bool y depends on CRASH_DUMP select RELOCATABLE if PPC64 \|\| 44x \|\| PPC_85xx Testing Approach and Results There are 388 config files in the arch/<arch>/configs directories. For each of these config files, a .config is generated both before and after this Kconfig series, and checked for equivalence. This approach allows for a rather rapid check of all architectures and a wide variety of configs wrt/ KEXEC and CRASH, and avoids requiring compiling for all architectures and running kernels and run-time testing. For each config file, the olddefconfig, allnoconfig and allyesconfig targets are utilized. In testing the randconfig has revealed problems as well, but is not used in the before and after equivalence check since one can not generate the "same" .config for before and after, even if using the same KCONFIG_SEED since the option list is different. As such, the following script steps compare the before and after of 'make olddefconfig'. The new symbols introduced by this series are filtered out, but otherwise the config files are PASS only if they were equivalent, and FAIL otherwise. The script performs the test by doing the following: # Obtain the "golden" .config output for given config file # Reset test sandbox git checkout master git branch -D test_Kconfig git checkout -B test_Kconfig master make distclean # Write out updated config cp -f <config file> .config make ARCH=<arch> olddefconfig # Track each item in .config, LHSB is "golden" scoreboard .config # Obtain the "changed" .config output for given config file # Reset test sandbox make distclean # Apply this Kconfig series git am <this Kconfig series> # Write out updated config cp -f <config file> .config make ARCH=<arch> olddefconfig # Track each item in .config, RHSB is "changed" scoreboard .config # Determine test result # Filter-out new symbols introduced by this series # Filter-out symbol=n which not in either scoreboard # Compare LHSB "golden" and RHSB "changed" scoreboards and issue PASS/FAIL The script was instrumental during the refactoring of Kconfig as it continually revealed problems. The end result being that the solution presented in this series passes all configs as checked by the script, with the following exceptions: - arch/ia64/configs/zx1_config with olddefconfig This config file has: # CONFIG_KEXEC is not set CONFIG_CRASH_DUMP=y and this refactor now couples KEXEC to CRASH_DUMP, so it is not possible to enable CRASH_DUMP without KEXEC. - arch/sh/configs/* with allyesconfig The arch/sh/Kconfig codes CRASH_DUMP as dependent upon BROKEN_ON_MMU (which clearly is not meant to be set). This symbol is not provided but with the allyesconfig it is set to yes which enables CRASH_DUMP. But KEXEC is coded as dependent upon MMU, and is set to no in arch/sh/mm/Kconfig, so KEXEC is not enabled. This refactor now couples KEXEC to CRASH_DUMP, so it is not possible to enable CRASH_DUMP without KEXEC. While the above exceptions are not equivalent to their original, the config file produced is valid (and in fact better wrt/ CRASH_DUMP handling). This patch (of 14) The config options for kexec and crash features are consolidated into new file kernel/Kconfig.kexec. Under the "General Setup" submenu is a new submenu "Kexec and crash handling". All the kexec and crash options that were once in the arch-dependent submenu "Processor type and features" are now consolidated in the new submenu. The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. Architectures specify support of certain KEXEC and CRASH features with similarly named new ARCH_SUPPORTS_<option> config options. Architectures can utilize the new ARCH_SELECTS_<option> config options to specify additional components when <option> is enabled. To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be enabled, and the ARCH_SELECTS_<option> handles side effects (ie. select statements). Link: https://lkml.kernel.org/r/20230712161545.87870-1-eric.devolder@oracle.com Link: https://lkml.kernel.org/r/20230712161545.87870-2-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Cc. "H. Peter Anvin" <hpa@zytor.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Juerg Haefliger <juerg.haefliger@canonical.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Marc Aurèle La France <tsi@tuyoix.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sebastian Reichel <sebastian.reichel@collabora.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Cc: Xin Li <xin3.li@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:51 -07:00
Azeem Shaikh	4264be505d	acct: replace all non-returning strlcpy with strscpy strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Link: https://lkml.kernel.org/r/20230710011748.3538624-1-azeemshaikh38@gmail.com Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:51 -07:00
Vincent Whitchurch	b0b88e02f0	signal: print comm and exe name on fatal signals Make the print-fatal-signals message more useful by printing the comm and the exe name for the process which received the fatal signal: Before: potentially unexpected fatal signal 4 potentially unexpected fatal signal 11 After: buggy-program: pool: potentially unexpected fatal signal 4 some-daemon: gdbus: potentially unexpected fatal signal 11 comm used to be present but was removed in commit `681a90ffe8` ("arc, print-fatal-signals: reduce duplicated information") because it's also included as part of the later stack trace. Having the comm as part of the main "unexpected fatal..." print is rather useful though when analysing logs, and the exe name is also valuable as shown in the examples above where the comm ends up having some generic name like "pool". [akpm@linux-foundation.org: don't include linux/file.h twice] Link: https://lkml.kernel.org/r/20230707-fatal-comm-v1-1-400363905d5e@axis.com Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:50 -07:00
tiozhang	4099451ac2	cred: convert printks to pr_<level> Use current logging style. Link: https://lkml.kernel.org/r/20230625033452.GA22858@didi-ThinkCentre-M930t-N000 Signed-off-by: tiozhang <tiozhang@didiglobal.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Paulo Alcantara <pc@cjr.nz> Cc: Weiping Zhang <zwp10758@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:49 -07:00
Alistair Popple	ec8832d007	mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end() Secondary TLBs are now invalidated from the architecture specific TLB invalidation functions. Therefore there is no need to explicitly notify or invalidate as part of the range end functions. This means we can remove mmu_notifier_invalidate_range_end_only() and some of the ptep_*_notify() functions. Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Baoquan He	016fec9101	mm: move is_ioremap_addr() into new header file Now is_ioremap_addr() is only used in kernel/iomem.c and gonna be used in mm/ioremap.c. Move it into its own new header file linux/ioremap.h. Link: https://lkml.kernel.org/r/20230706154520.11257-17-bhe@redhat.com Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:35 -07:00
Miaohe Lin	3fade62b62	mm/mm_init.c: remove obsolete macro HASH_SMALL HASH_SMALL only works when parameter numentries is 0. But the sole caller futex_init() never calls alloc_large_system_hash() with numentries set to 0. So HASH_SMALL is obsolete and remove it. Link: https://lkml.kernel.org/r/20230625021323.849147-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: André Almeida <andrealmeid@igalia.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:07 -07:00
Kefeng Wang	527ed4f7d9	mm: remove arguments of show_mem() All callers of show_mem() pass 0 and NULL, so we can remove the two arguments by directly calling __show_mem(0, NULL, MAX_NR_ZONES - 1) in show_mem(). Link: https://lkml.kernel.org/r/20230630062253.189440-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:02 -07:00
Gustavo A. R. Silva	78d44b824e	cgroup: Avoid -Wstringop-overflow warnings Change the notation from pointer-to-array to pointer-to-pointer. With this, we avoid the compiler complaining about trying to access a region of size zero as an argument during function calls. This is a workaround to prevent the compiler complaining about accessing an array of size zero when evaluating the arguments of a couple of function calls. See below: kernel/cgroup/cgroup.c: In function 'find_css_set': kernel/cgroup/cgroup.c:1206:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] 1206 \| cset = find_existing_css_set(old_cset, cgrp, template); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/cgroup/cgroup.c:1206:16: note: referencing argument 3 of type 'struct cgroup_subsys_state [0]' kernel/cgroup/cgroup.c:1071:24: note: in a call to function 'find_existing_css_set' 1071 \| static struct css_set find_existing_css_set(struct css_set *old_cset, \| ^~~~~~~~~~~~~~~~~~~~~ With the change to pointer-to-pointer, the functions are not prevented from being executed, and they will do what they have to do when CGROUP_SUBSYS_COUNT == 0. Address the following -Wstringop-overflow warnings seen when built with ARM architecture and aspeed_g4_defconfig configuration (notice that under this configuration CGROUP_SUBSYS_COUNT == 0): kernel/cgroup/cgroup.c:1208:16: warning: 'find_existing_css_set' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:1258:15: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6089:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] kernel/cgroup/cgroup.c:6153:18: warning: 'css_set_hash' accessing 4 bytes in a region of size 0 [-Wstringop-overflow=] This results in no differences in binary output. Link: https://github.com/KSPP/linux/issues/316 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-17 11:55:05 -10:00
Kees Cook	46822860a5	seccomp: Add missing kerndoc notations The kerndoc for some struct member and function arguments were missing. Add them. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308171742.AncabIG1-lkp@intel.com/ Signed-off-by: Kees Cook <keescook@chromium.org>	2023-08-17 12:32:15 -07:00
Zheng Yejian	eecb91b9f9	tracing: Fix memleak due to race between current_tracer and trace Kmemleak report a leak in graph_trace_open(): unreferenced object 0xffff0040b95f4a00 (size 128): comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s) hex dump (first 32 bytes): e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}.......... f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e... backtrace: [<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0 [<000000007df90faa>] graph_trace_open+0xb0/0x344 [<00000000737524cd>] __tracing_open+0x450/0xb10 [<0000000098043327>] tracing_open+0x1a0/0x2a0 [<00000000291c3876>] do_dentry_open+0x3c0/0xdc0 [<000000004015bcd6>] vfs_open+0x98/0xd0 [<000000002b5f60c9>] do_open+0x520/0x8d0 [<00000000376c7820>] path_openat+0x1c0/0x3e0 [<00000000336a54b5>] do_filp_open+0x14c/0x324 [<000000002802df13>] do_sys_openat2+0x2c4/0x530 [<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4 [<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394 [<00000000313647bf>] do_el0_svc+0xac/0xec [<000000002ef1c651>] el0_svc+0x20/0x30 [<000000002fd4692a>] el0_sync_handler+0xb0/0xb4 [<000000000c309c35>] el0_sync+0x160/0x180 The root cause is descripted as follows: __tracing_open() { // 1. File 'trace' is being opened; ... iter->trace = tr->current_trace; // 2. Tracer 'function_graph' is // currently set; ... iter->trace->open(iter); // 3. Call graph_trace_open() here, // and memory are allocated in it; ... } s_start() { // 4. The opened file is being read; ... iter->trace = tr->current_trace; // 5. If tracer is switched to // 'nop' or others, then memory // in step 3 are leaked!!! ... } To fix it, in s_start(), close tracer before switching then reopen the new tracer after switching. And some tracers like 'wakeup' may not update 'iter->private' in some cases when reopen, then it should be cleared to avoid being mistakenly closed again. Link: https://lore.kernel.org/linux-trace-kernel/20230817125539.1646321-1-zhengyejian1@huawei.com Fixes: `d7350c3f45` ("tracing/core: make the read callbacks reentrants") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-17 13:49:37 -04:00
Peter Zijlstra	63304558ba	sched/eevdf: Curb wakeup-preemption Mike and others noticed that EEVDF does like to over-schedule quite a bit -- which does hurt performance of a number of benchmarks / workloads. In particular, what seems to cause over-scheduling is that when lag is of the same order (or larger) than the request / slice then placement will not only cause the task to be placed left of current, but also with a smaller deadline than current, which causes immediate preemption. [ notably, lag bounds are relative to HZ ] Mike suggested we stick to picking 'current' for as long as it's eligible to run, giving it uninterrupted runtime until it reaches parity with the pack. Augment Mike's suggestion by only allowing it to exhaust it's initial request. One random data point: echo NO_RUN_TO_PARITY > /debug/sched/features perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 3,723,554 context-switches ( +- 0.56% ) 9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% ) echo RUN_TO_PARITY > /debug/sched/features perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 2,556,535 context-switches ( +- 0.51% ) 9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% ) Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net	2023-08-17 17:07:07 +02:00
Paul E. McKenney	fe24a0b632	Merge branches 'doc.2023.07.14b', 'fixes.2023.08.16a', 'rcu-tasks.2023.07.24a', 'rcuscale.2023.07.14b', 'refscale.2023.07.14b', 'torture.2023.08.14a' and 'torturescripts.2023.07.20a' into HEAD doc.2023.07.14b: Documentation updates. fixes.2023.08.16a: Miscellaneous fixes. rcu-tasks.2023.07.24a: RCU Tasks updates. rcuscale.2023.07.14b: RCU (updater) scalability test updates. refscale.2023.07.14b: Reference (reader) scalability test updates. torture.2023.08.14a: Other torture-test updates. torturescripts.2023.07.20a: Other torture-test scripting updates.	2023-08-16 14:31:08 -07:00
Paul E. McKenney	3292ba0229	rcu: Make the rcu_nocb_poll boot parameter usable via boot config The rcu_nocb_poll kernel boot parameter is defined via early_param(), whose parsing functions are invoked from parse_early_param() which is in turn invoked by setup_arch(), which is very early indeed. It is invoked so early that the console output timestamps read 0.000000, in other words, before time begins. This use of early_param() means that the rcu_nocb_poll kernel boot parameter cannot usefully be embedded into the kernel image. Yes, you can embed it, but setup_boot_config() is invoked from start_kernel() too late for it to be parsed. But it makes no sense to parse this parameter so early. After all, it cannot do anything until the rcuog kthreads are created, which is long after rcu_init() time, let alone setup_boot_config() time. This commit therefore switches the rcu_nocb_poll kernel boot parameter from early_param() to __setup(), which allows boot-config parsing of this parameter, in turn allowing it to be embedded into the kernel image. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-08-16 14:27:41 -07:00
Paul E. McKenney	343640cb5b	rcu: Mark __rcu_irq_enter_check_tick() ->rcu_urgent_qs load The rcu_request_urgent_qs_task() function does a cross-CPU store to ->rcu_urgent_qs, so this commit therefore marks the load in __rcu_irq_enter_check_tick() with READ_ONCE(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>	2023-08-16 14:27:41 -07:00
Sven Schnelle	c4d6b54381	tracing/synthetic: Allocate one additional element for size While debugging another issue I noticed that the stack trace contains one invalid entry at the end: <idle>-0 [008] d..4. 26.484201: wake_lat: pid=0 delta=2629976084 000000009cc24024 stack=STACK: => __schedule+0xac6/0x1a98 => schedule+0x126/0x2c0 => schedule_timeout+0x150/0x2c0 => kcompactd+0x9ca/0xc20 => kthread+0x2f6/0x3d8 => __ret_from_fork+0x8a/0xe8 => 0x6b6b6b6b6b6b6b6b This is because the code failed to add the one element containing the number of entries to field_size. Link: https://lkml.kernel.org/r/20230816154928.4171614-4-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: `00cf3d672a` ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-16 16:37:07 -04:00
Sven Schnelle	887f92e09e	tracing/synthetic: Skip first entry for stack traces While debugging another issue I noticed that the stack trace output contains the number of entries on top: <idle>-0 [000] d..4. 203.322502: wake_lat: pid=0 delta=2268270616 stack=STACK: => 0x10 => __schedule+0xac6/0x1a98 => schedule+0x126/0x2c0 => schedule_timeout+0x242/0x2c0 => __wait_for_common+0x434/0x680 => __wait_rcu_gp+0x198/0x3e0 => synchronize_rcu+0x112/0x138 => ring_buffer_reset_online_cpus+0x140/0x2e0 => tracing_reset_online_cpus+0x15c/0x1d0 => tracing_set_clock+0x180/0x1d8 => hist_register_trigger+0x486/0x670 => event_hist_trigger_parse+0x494/0x1318 => trigger_process_regex+0x1d4/0x258 => event_trigger_write+0xb4/0x170 => vfs_write+0x210/0xad0 => ksys_write+0x122/0x208 Fix this by skipping the first element. Also replace the pointer logic with an index variable which is easier to read. Link: https://lkml.kernel.org/r/20230816154928.4171614-3-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: `00cf3d672a` ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-16 16:34:25 -04:00
Sven Schnelle	ddeea494a1	tracing/synthetic: Use union instead of casts The current code uses a lot of casts to access the fields member in struct synth_trace_events with different sizes. This makes the code hard to read, and had already introduced an endianness bug. Use a union and struct instead. Link: https://lkml.kernel.org/r/20230816154928.4171614-2-svens@linux.ibm.com Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: `00cf3d672a` ("tracing: Allow synthetic events to pass around stacktraces") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-16 16:33:27 -04:00
Zheng Yejian	b71645d6af	tracing: Fix cpu buffers unavailable due to 'record_disabled' missed Trace ring buffer can no longer record anything after executing following commands at the shell prompt: # cd /sys/kernel/tracing # cat tracing_cpumask fff # echo 0 > tracing_cpumask # echo 1 > snapshot # echo fff > tracing_cpumask # echo 1 > tracing_on # echo "hello world" > trace_marker -bash: echo: write error: Bad file descriptor The root cause is that: 1. After `echo 0 > tracing_cpumask`, 'record_disabled' of cpu buffers in 'tr->array_buffer.buffer' became 1 (see tracing_set_cpumask()); 2. After `echo 1 > snapshot`, 'tr->array_buffer.buffer' is swapped with 'tr->max_buffer.buffer', then the 'record_disabled' became 0 (see update_max_tr()); 3. After `echo fff > tracing_cpumask`, the 'record_disabled' become -1; Then array_buffer and max_buffer are both unavailable due to value of 'record_disabled' is not 0. To fix it, enable or disable both array_buffer and max_buffer at the same time in tracing_set_cpumask(). Link: https://lkml.kernel.org/r/20230805033816.3284594-2-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Cc: <vnagarnaik@google.com> Cc: <shuah@kernel.org> Fixes: `71babb2705` ("tracing: change CPU ring buffer state from tracing_cpumask") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-08-16 15:12:42 -04:00
Enlin Mu	3e00123a13	printk: export symbols for debug modules the module is out-of-tree, it saves kernel logs when panic Signed-off-by: Enlin Mu <enlin.mu@unisoc.com> Acked-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230815020711.2604939-1-yunlong.xing@unisoc.com	2023-08-16 17:06:38 +02:00
Yafang Shao	0aa35162d2	bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe() The commit `1b715e1b0e` ("bpf: Support ->fill_link_info for perf_event") leads to the following Smatch static checker warning: kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe() error: uninitialized symbol 'type'. That can happens when uname is NULL. So fix it by verifying the uname when we really need to fill it. Fixes: `1b715e1b0e` ("bpf: Support ->fill_link_info for perf_event") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com	2023-08-16 16:44:23 +02:00
Benjamin Gray	53834a0c09	perf/hw_breakpoint: Remove arch breakpoint hooks PowerPC was the only user of these hooks, and has been refactored to no longer require them. There is no need to keep them around, so remove them to reduce complexity. Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230801011744.153973-8-bgray@linux.ibm.com	2023-08-16 23:54:50 +10:00
Joel Granados	9edbfe92a0	sysctl: Add size to register_sysctl This commit adds table_size to register_sysctl in preparation for the removal of the sentinel elements in the ctl_table arrays (last empty markers). And though we do not remove any sentinels in this commit, we set things up by either passing the table_size explicitly or using ARRAY_SIZE on the ctl_table arrays. We replace the register_syctl function with a macro that will add the ARRAY_SIZE to the new register_sysctl_sz function. In this way the callers that are already using an array of ctl_table structs do not change. For the callers that pass a ctl_table array pointer, we pass the table_size to register_sysctl_sz instead of the macro. Signed-off-by: Joel Granados <j.granados@samsung.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Joel Granados	bff97cf11b	sysctl: Add a size arg to __register_sysctl_table We make these changes in order to prepare __register_sysctl_table and its callers for when we remove the sentinel element (empty element at the end of ctl_table arrays). We don't actually remove any sentinels in this commit, but we do make sure to use ARRAY_SIZE so the table_size is available when the removal occurs. We add a table_size argument to __register_sysctl_table and adjust callers, all of which pass ctl_table pointers and need an explicit call to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in order to forward the size of the array pointer received from the network register calls. The new table_size argument does not yet have any effect in the init_header call which is still dependent on the sentinel's presence. table_size does however drive the `kzalloc` allocation in __register_sysctl_table with no adverse effects as the allocated memory is either one element greater than the calculated ctl_table array (for the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size of the calculated ctl_table array (for the call from sysctl_net.c and register_sysctl). This approach will allows us to "just" remove the sentinel without further changes to __register_sysctl_table as table_size will represent the exact size for all the callers at that point. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-15 15:26:17 -07:00
Atul Kumar Pant	b1a0f64cc6	audit: move trailing statements to next line Fixes following checkpatch.pl issue: ERROR: trailing statements should be on next line Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> [PM: subject line tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>	2023-08-15 18:16:14 -04:00
Atul Kumar Pant	22cde1012f	audit: cleanup function braces and assignment-in-if-condition The patch fixes following checkpatch.pl issue: ERROR: open brace '{' following function definitions go on the next line ERROR: do not use assignment in if condition Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> [PM: subject line tweaks] Signed-off-by: Paul Moore <paul@paul-moore.com>	2023-08-15 18:10:56 -04:00
Atul Kumar Pant	62acadda11	audit: add space before parenthesis and around '=', "==", and '<' Fixes following checkpatch.pl issue: ERROR: space required before the open parenthesis '(' ERROR: spaces required around that '=' ERROR: spaces required around that '<' ERROR: spaces required around that '==' Signed-off-by: Atul Kumar Pant <atulpant.linux@gmail.com> [PM: subject line tweaks] Signed-off-by: Paul Moore <paul@paul-moore.com>	2023-08-15 18:09:20 -04:00
David Vernet	8ba651ed7f	bpf: Support default .validate() and .update() behavior for struct_ops links Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also define the .validate() and .update() callbacks in its corresponding struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful in its own right to ensure that the map is unloaded if an application crashes. For example, with sched_ext, we want to automatically unload the host-wide scheduler if the application crashes. We would likely never support updating elements of a sched_ext struct_ops map, so we'd have to implement these callbacks showing that they _can't_ support element updates just to benefit from the basic lifetime management of struct_ops links. Let's enable struct_ops maps to work with BPF_F_LINK even if they haven't defined these callbacks, by assuming that a struct_ops map element cannot be updated by default. Acked-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-14 22:23:39 -07:00
Lu Jialin	82b90b6c5b	cgroup:namespace: Remove unused cgroup_namespaces_init() cgroup_namspace_init() just return 0. Therefore, there is no need to call it during start_kernel. Just remove it. Fixes: `a79a908fd2` ("cgroup: introduce cgroup namespaces") Signed-off-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-14 14:29:47 -10:00
Aaron Tomlin	b6a46f7263	workqueue: Rename rescuer kworker Each CPU-specific and unbound kworker kthread conforms to a particular naming scheme. However, this does not extend to the rescuer kworker. At present, a rescuer kworker is simply named according to its workqueue's name. This can be cryptic. This patch modifies a rescuer to follow the kworker naming scheme. The "R" is indicative of a rescuer and after "-" is its workqueue's name e.g. "kworker/R-ext4-rsv-conver". tj: Use "R" instead of "r" as the prefix to make it more distinctive and consistent with how highpri pools are marked. Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-14 14:20:26 -10:00
Paul E. McKenney	bc19e86e28	rcutorture: Stop right-shifting torture_random() return values Now that torture_random() uses swahw32(), its callers no longer see not-so-random low-order bits, as these are now swapped up into the upper 16 bits of the torture_random() function's return value. This commit therefore removes the right-shifting of torture_random() return values. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:08 -07:00
Paul E. McKenney	6cab60ceb1	torture: Stop right-shifting torture_random() return values Now that torture_random() uses swahw32(), its callers no longer see not-so-random low-order bits, as these are now swapped up into the upper 16 bits of the torture_random() function's return value. This commit therefore removes the right-shifting of torture_random() return values. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:08 -07:00
Paul E. McKenney	10af43671e	torture: Move stutter_wait() timeouts to hrtimers In order to gain better race coverage, move the test start/stop waits in stutter_wait() to torture_hrtimeout_jiffies(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:08 -07:00
Paul E. McKenney	dea81dcfd3	torture: Move torture_shuffle() timeouts to hrtimers In order to gain better race coverage, move the CPU-migration timed waits in torture_shuffle() to torture_hrtimeout_jiffies(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:08 -07:00
Paul E. McKenney	3f0c06e1cb	torture: Move torture_onoff() timeouts to hrtimers In order to gain better race coverage, move the CPU-hotplug-related timed waits in torture_onoff() to torture_hrtimeout_jiffies(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:08 -07:00
Paul E. McKenney	872948c665	torture: Make torture_hrtimeout_() use TASK_IDLE Given that it is expected that more code will use torture_hrtimeout_(), including for longer timeouts, make it use TASK_IDLE instead of TASK_UNINTERRUPTIBLE. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:07 -07:00
Dietmar Eggemann	5d248bb39f	torture: Add lock_torture writer_fifo module parameter This commit adds a module parameter that causes the locktorture writer to run at real-time priority. To use it: insmod /lib/modules/torture.ko random_shuffle=1 insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=50 nested_locks=3 writer_fifo=1 ^^^^^^^^^^^^^ A predecessor to this patch has been helpful to uncover issues with the proxy-execution series. [ paulmck: Remove locktorture-specific code from kernel/torture.c. ] Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: kernel-team@android.com Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> [jstultz: Include header change to build, reword commit message] Signed-off-by: John Stultz <jstultz@google.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-08-14 15:01:07 -07:00
Paul E. McKenney	67d5404d27	torture: Add a kthread-creation callback to _torture_create_kthread() This commit adds a kthread-creation callback to the _torture_create_kthread() function, which allows callers of a new torture_create_kthread_cb() macro to specify a function to be invoked after the kthread is created but before it is awakened for the first time. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: kernel-team@android.com Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: John Stultz <jstultz@google.com>	2023-08-14 15:00:37 -07:00
Paul E. McKenney	9d0cce2bc3	rcu-tasks: Fix boot-time RCU tasks debug-only deadlock In kernels built with CONFIG_PROVE_RCU=y (for example, lockdep kernels), the following sequence of events can occur: o rcu_init_tasks_generic() is invoked just before init is spawned. It invokes rcu_spawn_tasks_kthread() and friends. o rcu_spawn_tasks_kthread() invokes rcu_spawn_tasks_kthread_generic(), which uses kthread_run() to create the needed kthread. o Control returns to rcu_init_tasks_generic(), which, because this is a CONFIG_PROVE_RCU=y kernel, invokes the version of the rcu_tasks_initiate_self_tests() function that actually does something, including invoking synchronize_rcu_tasks(), which in turn invokes synchronize_rcu_tasks_generic(). o synchronize_rcu_tasks_generic() sees that the ->kthread_ptr is still NULL, because the newly spawned kthread has not yet started. o The new kthread starts, preempting synchronize_rcu_tasks_generic() just after its check. This kthread invokes rcu_tasks_one_gp(), which acquires ->tasks_gp_mutex, and, seeing no work, blocks in rcuwait_wait_event(). Note that this step requires either a preemptible kernel or a fault-injection-style sleep at the beginning of mutex_lock(). o synchronize_rcu_tasks_generic() resumes and invokes rcu_tasks_one_gp(). o rcu_tasks_one_gp() attempts to acquire ->tasks_gp_mutex, which is still held by the newly spawned kthread's rcu_tasks_one_gp() function. Deadlock. Because the only reason for ->tasks_gp_mutex is to handle pre-kthread synchronous grace periods, this commit avoids this deadlock by having rcu_tasks_one_gp() momentarily release ->tasks_gp_mutex while invoking rcuwait_wait_event(). This allows the call to rcu_tasks_one_gp() from synchronize_rcu_tasks_generic() proceed. Note that it is not necessary to release the mutex anywhere else in rcu_tasks_one_gp() because rcuwait_wait_event() is the only function that can block indefinitely. Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Roy Hopkins <rhopkins@suse.de> Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Roy Hopkins <rhopkins@suse.de>	2023-08-14 14:58:25 -07:00
Peter Zijlstra	7170509cad	sched: Simplify sched_core_cpu_{starting,deactivate}() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.371787909@infradead.org	2023-08-14 17:01:27 +02:00
Peter Zijlstra	b4e1fa1e14	sched: Simplify try_steal_cookie() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.304154828@infradead.org	2023-08-14 17:01:27 +02:00
Peter Zijlstra	6dafc713e3	sched: Simplify sched_tick_remote() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.236247952@infradead.org	2023-08-14 17:01:26 +02:00
Peter Zijlstra	4bdada79f3	sched: Simplify sched_exec() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.168490417@infradead.org	2023-08-14 17:01:26 +02:00
Peter Zijlstra	857d315f12	sched: Simplify ttwu() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.101069260@infradead.org	2023-08-14 17:01:25 +02:00
Peter Zijlstra	4eb054f92b	sched: Simplify wake_up_if_idle() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org	2023-08-14 17:01:25 +02:00
Peter Zijlstra	5bb76f1ddf	sched: Simplify: migrate_swap_stop() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org	2023-08-14 17:01:25 +02:00
Peter Zijlstra	0f92cdf36f	sched: Simplify sysctl_sched_uclamp_handler() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.896559109@infradead.org	2023-08-14 17:01:24 +02:00
Peter Zijlstra	7537b90c00	sched: Simplify get_nohz_timer_target() Use guards to reduce gotos and simplify control flow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org	2023-08-14 17:01:24 +02:00
Cyril Hrubis	c1fc6484e1	sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset The sched_rr_timeslice can be reset to default by writing value that is <= 0. However after reading from this file we always got the last value written, which is not useful at all. $ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms $ cat /proc/sys/kernel/sched_rr_timeslice_ms -1 Fix this by setting the variable that holds the sysctl file value to the jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Vorel <pvorel@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz	2023-08-14 17:01:23 +02:00
Cyril Hrubis	c7fcb99877	sched/rt: Fix sysctl_sched_rr_timeslice intial value There is a 10% rounding error in the intial value of the sysctl_sched_rr_timeslice with CONFIG_HZ_300=y. This was found with LTP test sched_rr_get_interval01: sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90 sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90 What this test does is to compare the return value from the sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and fails if they do not match. The problem it found is the intial sysctl file value which was computed as: static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE; which works fine as long as MSEC_PER_SEC is multiple of HZ, however it introduces 10% rounding error for CONFIG_HZ_300: (MSEC_PER_SEC / HZ) * (100 * HZ / 1000) (1000 / 300) * (100 * 300 / 1000) 3 * 30 = 90 This can be easily fixed by reversing the order of the multiplication and division. After this fix we get: (MSEC_PER_SEC * (100 * HZ / 1000)) / HZ (1000 * (100 * 300 / 1000)) / 300 (1000 * 30) / 300 = 100 Fixes: `975e155ed8` ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds") Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Vorel <pvorel@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Petr Vorel <pvorel@suse.cz> Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz	2023-08-14 17:01:23 +02:00
Kees Cook	53e9e33ede	printk: ringbuffer: Fix truncating buffer size min_t cast If an output buffer size exceeded U16_MAX, the min_t(u16, ...) cast in copy_data() was causing writes to truncate. This manifested as output bytes being skipped, seen as %NUL bytes in pstore dumps when the available record size was larger than 65536. Fix the cast to no longer truncate the calculation. Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Ogness <john.ogness@linutronix.de> Reported-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Link: https://lore.kernel.org/lkml/d8bb1ec7-a4c5-43a2-9de0-9643a70b899f@linux.microsoft.com/ Fixes: `b6cf8b3f33` ("printk: add lockless ringbuffer") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Vijay Balakrishna <vijayb@linux.microsoft.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Reviewed-by: Tyler Hicks (Microsoft) <code@tyhicks.com> Tested-by: Tyler Hicks (Microsoft) <code@tyhicks.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20230811054528.never.165-kees@kernel.org	2023-08-14 13:05:22 +02:00
Rafael J. Wysocki	8e1d6a9223	Merge back system-wide sleep material for v6.6.	2023-08-14 09:55:44 +02:00
Linus Torvalds	9578b04c32	Power management fixes for 6.5-rc6 - Make amd-pstate use device_attributes as expected by the CPU root kobject (Thomas Weißschuh). - Restore the previous behavior of resume_store() when hibernation is not available which is to return the full number of bytes that were to be written by user space (Vlastimil Babka). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTWgJ8SHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxGEgP/01+F+nmq0c5QebC3LWw4cyYuepeCJ86 jfIbJR+XHOoiTaQMORHKBEk8xlelL/R65tRhkB/Gq1uFzeIId+xYJJlsW4Lpj7bz rx/FXOAW8mAyPe/kNitBtcjh4tqEiPBiVzn1tKTA4OOLm0CzOE5v9KML93U2vsOa Y2I3Jp1N6HHC8oRzbYpQgvB6R2MXX/oRd5fCvrVyMidFFbgYz8sWssRe8eUTGFAj U/bufaKM7N/qlavikSul1f4T3KpRN+xpu7+I3W6M5/w0EQt663u3TffY1Mo+qllB uoIM7emwsR6J6WsJyWbHgZEh/fIPmPAhGtsUsam9dN4aoDXfac2Trqrf+xYYbAtS 7mafAyWa+NxQCy/90QxoTrqhj3U4/dIbne4l1ZqgZQ7vyzM/NA4Gi0VBDEpt1BZU q6uvhS4PXvkRm/PezQSQCSMaP66F0erMCHxKTXTN1wYNob0AKjV6l1bmG5LdPcIh Nsk+CDkAVGmbqfDrtek9FfJZWgH3/lPDg0oVVMi9WiE8CdhYfKoB+Eh/MFVGiiDg 69cogAHqTUeuB46NPNedeOacGc6F0+mnAwkgNkClCTCHZJ0QSDlh2yVR003ZhnUj sHx6jf6rYodW+nBQydjUzVm+twH47tltY0ibzN3ZIXiMM0UlALHBF+Oj4hOtGxUa jiiqkLyB/9kH =0RaA -----END PGP SIGNATURE----- Merge tag 'pm-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix an amd-pstate cpufreq driver issues and recently introduced hibernation-related breakage. Specifics: - Make amd-pstate use device_attributes as expected by the CPU root kobject (Thomas Weißschuh) - Restore the previous behavior of resume_store() when hibernation is not available which is to return the full number of bytes that were to be written by user space (Vlastimil Babka)" * tag 'pm-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: amd-pstate: fix global sysfs attribute type PM: hibernate: fix resume_store() return value when hibernation not available	2023-08-11 12:24:22 -07:00
Jakub Kicinski	6a1ed1430d	bpf-next pull-request 2023-08-09 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZNRx8QAKCRBar9k/UBDW 46MBAQC3YDFsEfPzX4P7ZnlM5Lf1NynjNbso5bYW0TF/dp/Y+gD+M8wdM5Vj2Mb0 Zr56TnwCJei0kGBemiel4sStt3e4qwY= =+0u+ -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2023-08-09 We've added 19 non-merge commits during the last 6 day(s) which contain a total of 25 files changed, 369 insertions(+), 141 deletions(-). The main changes are: 1) Fix array-index-out-of-bounds access when detaching from an already empty mprog entry from Daniel Borkmann. 2) Adjust bpf selftest because of a recent llvm change related to the cpu-v4 ISA from Eduard Zingerman. 3) Add uprobe support for the bpf_get_func_ip helper from Jiri Olsa. 4) Fix a KASAN splat due to the kernel incorrectly accepted an invalid program using the recent cpu-v4 instruction from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: bpf: btf: Remove two unused function declarations bpf: lru: Remove unused declaration bpf_lru_promote() selftests/bpf: relax expected log messages to allow emitting BPF_ST selftests/bpf: remove duplicated functions bpf, docs: Fix small typo and define semantics of sign extension selftests/bpf: Add bpf_get_func_ip test for uprobe inside function selftests/bpf: Add bpf_get_func_ip tests for uprobe on function entry bpf: Add support for bpf_get_func_ip helper for uprobe program selftests/bpf: Add a movsx selftest for sign-extension of R10 bpf: Fix an incorrect verification success with movsx insn bpf, docs: Formalize type notation and function semantics in ISA standard bpf: change bpf_alu_sign_string and bpf_movsx_string to static libbpf: Use local includes inside the library bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR. bpf: fix inconsistent return types of bpf_xdp_copy_buf(). selftests/bpf: fix the incorrect verification of port numbers. selftests/bpf: Add test for detachment on empty mprog entry bpf: Fix mprog detachment for empty mprog entry bpf: bpf_struct_ops: Remove unnecessary initial values of variables ==================== Link: https://lore.kernel.org/r/20230810055123.109578-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 14:12:34 -07:00
Jakub Kicinski	4d016ae42e	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/intel/igc/igc_main.c `06b412589e` ("igc: Add lock to safeguard global Qbv variables") `d3750076d4` ("igc: Add TransmissionOverrun counter") drivers/net/ethernet/microsoft/mana/mana_en.c `a7dfeda6fd` ("net: mana: Fix MANA VF unload when hardware is unresponsive") `a9ca9f9cef` ("page_pool: split types and declarations from page_pool.h") `92272ec410` ("eth: add missing xdp.h includes in drivers") net/mptcp/protocol.h `511b90e392` ("mptcp: fix disconnect vs accept race") `b8dc6d6ce9` ("mptcp: fix rcv buffer auto-tuning") tools/testing/selftests/net/mptcp/mptcp_join.sh `c8c101ae39` ("selftests: mptcp: join: fix 'implicit EP' test") `03668c65d1` ("selftests: mptcp: join: rework detailed report") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-10 14:10:53 -07:00
Ingo Molnar	b41bbb33cf	Merge branch 'sched/eevdf' into sched/core Pick up the EEVDF work into the main branch - it's looking good so far. Conflicts: kernel/sched/features.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2023-08-10 09:05:43 +02:00
Yue Haibing	526bc5ba19	bpf: lru: Remove unused declaration bpf_lru_promote() Commit `3a08c2fd76` ("bpf: LRU List") declared but never implemented this. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20230808145531.19692-1-yuehaibing@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-08 17:21:42 -07:00
Khadija Kamran	6672efbb68	lsm: constify the 'target' parameter in security_capget() Three LSMs register the implementations for the "capget" hook: AppArmor, SELinux, and the normal capability code. Looking at the function implementations we may observe that the first parameter "target" is not changing. Mark the first argument "target" of LSM hook security_capget() as "const" since it will not be changing in the LSM hook. cap_capget() LSM hook declaration exceeds the 80 characters per line limit. Split the function declaration to multiple lines to decrease the line length. Signed-off-by: Khadija Kamran <kamrankhadijadj@gmail.com> Acked-by: John Johansen <john.johansen@canonical.com> [PM: align the cap_capget() declaration, spelling fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>	2023-08-08 16:48:47 -04:00
Gaosheng Cui	b59bc6e372	audit: fix possible soft lockup in __audit_inode_child() Tracefs or debugfs maybe cause hundreds to thousands of PATH records, too many PATH records maybe cause soft lockup. For example: 1. CONFIG_KASAN=y && CONFIG_PREEMPTION=n 2. auditctl -a exit,always -S open -k key 3. sysctl -w kernel.watchdog_thresh=5 4. mkdir /sys/kernel/debug/tracing/instances/test There may be a soft lockup as follows: watchdog: BUG: soft lockup - CPU#45 stuck for 7s! [mkdir:15498] Kernel panic - not syncing: softlockup: hung tasks Call trace: dump_backtrace+0x0/0x30c show_stack+0x20/0x30 dump_stack+0x11c/0x174 panic+0x27c/0x494 watchdog_timer_fn+0x2bc/0x390 __run_hrtimer+0x148/0x4fc __hrtimer_run_queues+0x154/0x210 hrtimer_interrupt+0x2c4/0x760 arch_timer_handler_phys+0x48/0x60 handle_percpu_devid_irq+0xe0/0x340 __handle_domain_irq+0xbc/0x130 gic_handle_irq+0x78/0x460 el1_irq+0xb8/0x140 __audit_inode_child+0x240/0x7bc tracefs_create_file+0x1b8/0x2a0 trace_create_file+0x18/0x50 event_create_dir+0x204/0x30c __trace_add_new_event+0xac/0x100 event_trace_add_tracer+0xa0/0x130 trace_array_create_dir+0x60/0x140 trace_array_create+0x1e0/0x370 instance_mkdir+0x90/0xd0 tracefs_syscall_mkdir+0x68/0xa0 vfs_mkdir+0x21c/0x34c do_mkdirat+0x1b4/0x1d4 __arm64_sys_mkdirat+0x4c/0x60 el0_svc_common.constprop.0+0xa8/0x240 do_el0_svc+0x8c/0xc0 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb4 el0_sync+0x160/0x180 Therefore, we add cond_resched() to __audit_inode_child() to fix it. Fixes: `5195d8e217` ("audit: dynamically allocate audit_names when not enough space is in the names array") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Paul Moore <paul@paul-moore.com>	2023-08-08 14:45:20 -04:00
Petr Tesarik	d069ed288a	swiotlb: optimize get_max_slots() Use a simple logical shift and increment to calculate the number of slots taken by the DMA segment boundary. At least GCC-13 is not able to optimize the expression, producing this horrible assembly code on x86: cmpq $-1, %rcx je .L364 addq $2048, %rcx shrq $11, %rcx movq %rcx, %r13 .L331: // rest of the function here... // after function epilogue and return: .L364: movabsq $9007199254740992, %r13 jmp .L331 After the optimization, the code looks more reasonable: shrq $11, %r11 leaq 1(%r11), %rbx Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2023-08-08 10:29:21 -07:00
Petr Tesarik	f94cb36e76	swiotlb: move slot allocation explanation comment where it belongs Move the comment down in front of the loop that actually sets the list member of struct io_tlb_slot to zero. Fixes: `26a7e09478` ("swiotlb: refactor swiotlb_tbl_map_single") Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2023-08-08 10:29:06 -07:00
Tejun Heo	523a301e66	workqueue: Make default affinity_scope dynamically updatable While workqueue.default_affinity_scope is writable, it only affects workqueues which are created afterwards and isn't very useful. Instead, let's introduce explicit "default" scope and update the effective scope dynamically when workqueue.default_affinity_scope is changed. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:25 -10:00
Tejun Heo	8639ecebc9	workqueue: Implement non-strict affinity scope for unbound workqueues An unbound workqueue can be served by multiple worker_pools to improve locality. The segmentation is achieved by grouping CPUs into pods. By default, the cache boundaries according to cpus_share_cache() define the CPUs are grouped. Let's a workqueue is allowed to run on all CPUs and the system has two L3 caches. The workqueue would be mapped to two worker_pools each serving one L3 cache domains. While this improves locality, because the pod boundaries are strict, it limits the total bandwidth a given issuer can consume. For example, let's say there is a thread pinned to a CPU issuing enough work items to saturate the whole machine. With the machine segmented into two pods, no matter how many work items it issues, it can only use half of the CPUs on the system. While this limitation has existed for a very long time, it wasn't very pronounced because the affinity grouping used to be always by NUMA nodes. With cache boundaries as the default and support for even finer grained scopes (smt and cpu), it is now an a lot more pressing problem. This patch implements non-strict affinity scope where the pod boundaries aren't enforced strictly. Going back to the previous example, the workqueue would still be mapped to two worker_pools; however, the affinity enforcement would be soft. The workers in both pools would have their cpus_allowed set to the whole machine thus allowing the scheduler to migrate them anywhere on the machine. However, whenever an idle worker is woken up, the workqueue code asks the scheduler to bring back the task within the pod if the worker is outside. ie. work items start executing within its affinity scope but can be migrated outside as the scheduler sees fit. This removes the hard cap on utilization while maintaining the benefits of affinity scopes. After the earlier ->__pod_cpumask changes, the implementation is pretty simple. When non-strict which is the new default: * pool_allowed_cpus() returns @pool->attrs->cpumask instead of ->__pod_cpumask so that the workers are allowed to run on any CPU that the associated workqueues allow. * If the idle worker task's ->wake_cpu is outside the pod, kick_pool() sets the field to a CPU within the pod. This would be the first use of task_struct->wake_cpu outside scheduler proper, so it isn't clear whether this would be acceptable. However, other methods of migrating tasks are significantly more expensive and are likely prohibitively so if we want to do this on every work item. This needs discussion with scheduler folks. There is also a race window where setting ->wake_cpu wouldn't be effective as the target task is still on CPU. However, the window is pretty small and this being a best-effort optimization, it doesn't seem to warrant more complexity at the moment. While the non-strict cache affinity scopes seem to be the best option, the performance picture interacts with the affinity scope and is a bit complicated to fully discuss in this patch, so the behavior is made easily selectable through wqattrs and sysfs and the next patch will add documentation to discuss performance implications. v2: pool->attrs->affn_strict is set to true for per-cpu worker_pools. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>	2023-08-07 15:57:25 -10:00
Tejun Heo	9546b29e4a	workqueue: Add workqueue_attrs->__pod_cpumask workqueue_attrs has two uses: * to specify the required unouned workqueue properties by users * to match worker_pool's properties to workqueues by core code For example, if the user wants to restrict a workqueue to run only CPUs 0 and 2, and the two CPUs are on different affinity scopes, the workqueue's attrs->cpumask would contains CPUs 0 and 2, and the workqueue would be associated with two worker_pools, one with attrs->cpumask containing just CPU 0 and the other CPU 2. Workqueue wants to support non-strict affinity scopes where work items are started in their matching affinity scopes but the scheduler is free to migrate them outside the starting scopes, which can enable utilizing the whole machine while maintaining most of the locality benefits from affinity scopes. To enable that, worker_pools need to distinguish the strict affinity that it has to follow (because that's the restriction coming from the user) and the soft affinity that it wants to apply when dispatching work items. Note that two worker_pools with different soft dispatching requirements have to be separate; otherwise, for example, we'd be ping-ponging worker threads across NUMA boundaries constantly. This patch adds workqueue_attrs->__pod_cpumask. The new field is double underscored as it's only used internally to distinguish worker_pools. A worker_pool's ->cpumask is now always the same as the online subset of allowed CPUs of the associated workqueues, and ->__pod_cpumask is the pod's subset of that ->cpumask. Going back to the example above, both worker_pools would have ->cpumask containing both CPUs 0 and 2 but one's ->__pod_cpumask would contain 0 while the other's 2. * pool_allowed_cpus() is added. It returns the worker_pool's strict cpumask that the pool's workers must stay within. This is currently always ->__pod_cpumask as all boundaries are still strict. * As a workqueue_attrs can now track both the associated workqueues' cpumask and its per-pod subset, wq_calc_pod_cpumask() no longer needs an external out-argument. Drop @cpumask and instead store the result in ->__pod_cpumask. * The above also simplifies apply_wqattrs_prepare() as the same workqueue_attrs can be used to create all pods associated with a workqueue. tmp_attrs is dropped. * wq_update_pod() is updated to use wqattrs_equal() to test whether a pwq update is needed instead of only comparing ->cpumask so that ->__pod_cpumask is compared too. It can directly compare ->__pod_cpumaks but the code is easier to understand and more robust this way. The only user-visible behavior change is that two workqueues with different cpumasks no longer can share worker_pools even when their pod subsets coincide. Going back to the example, let's say there's another workqueue with cpumask 0, 2, 3, where 2 and 3 are in the same pod. It would be mapped to two worker_pools - one with CPU 0, the other with 2 and 3. The former has the same cpumask as the first pod of the earlier example and would have shared the same worker_pool but that's no longer the case after this patch. The worker_pools would have the same ->__pod_cpumask but their ->cpumask's wouldn't match. While this is necessary to support non-strict affinity scopes, there can be further optimizations to maintain sharing among strict affinity scopes. However, non-strict affinity scopes are going to be preferable for most use cases and we don't see very diverse mixture of unbound workqueue cpumasks anyway, so the additional overhead doesn't seem to justify the extra complexity. v2: - wq_update_pod() was incorrectly comparing target_attrs->__pod_cpumask to pool->attrs->cpumask instead of its ->__pod_cpumask. Fix it by using wqattrs_equal() for comparison instead. - Per-cpu worker pools weren't initializing ->__pod_cpumask which caused a subtle problem later on. Set it to cpumask_of(cpu) like ->cpumask. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:25 -10:00
Tejun Heo	0219a3528d	workqueue: Factor out need_more_worker() check and worker wake-up Checking need_more_worker() and calling wake_up_worker() is a repeated pattern. Let's add kick_pool(), which checks need_more_worker() and open-code wake_up_worker(), and replace wake_up_worker() uses. The following conversions aren't one-to-one: * __queue_work() was using __need_more_work() because it knows that pool->worklist isn't empty. Switching to kick_pool() adds an extra list_empty() test. * create_worker() always needs to wake up the newly minted worker whether there's more work to do or not to avoid triggering hung task check on the new task. Keep the current wake_up_process() and still add kick_pool(). This may lead to an extra wakeup which isn't harmful. * pwq_adjust_max_active() was explicitly checking whether it needs to wake up a worker or not to avoid spurious wakeups. As kick_pool() only wakes up a worker when necessary, this explicit check is no longer necessary and dropped. * unbind_workers() now calls kick_pool() instead of wake_up_worker() adding a need_more_worker() test. This avoids spurious wakeups and shouldn't break anything. wake_up_worker() is dropped as kick_pool() replaces all its users. After this patch, all paths that wakes up a non-rescuer worker to initiate work item execution use kick_pool(). This will enable future changes to improve locality. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:25 -10:00
Tejun Heo	873eaca6ea	workqueue: Factor out work to worker assignment and collision handling The two work execution paths in worker_thread() and rescuer_thread() use move_linked_works() to claim work items from @pool->worklist. Once claimed, process_schedule_works() is called which invokes process_one_work() on each work item. process_one_work() then uses find_worker_executing_work() to detect and handle collisions - situations where the work item to be executed is still running on another worker. This works fine, but, to improve work execution locality, we want to establish work to worker association earlier and know for sure that the worker is going to excute the work once asssigned, which requires performing collision handling earlier while trying to assign the work item to the worker. This patch introduces assign_work() which assigns a work item to a worker using move_linked_works() and then performs collision handling. As collision handling is handled earlier, process_one_work() no longer needs to worry about them. After the this patch, collision checks for linked work items are skipped, which should be fine as they can't be queued multiple times concurrently. For work items running from rescuers, the timing of collision handling may change but the invariant that the work items go through collision handling before starting execution does not. This patch shouldn't cause noticeable behavior changes, especially given that worker_thread() behavior remains the same. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:25 -10:00
Tejun Heo	63c5484e74	workqueue: Add multiple affinity scopes and interface to select them Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE the default. The code changes to actually add the additional scopes are trivial. Also add module parameter "workqueue.default_affinity_scope" to override the default scope and "affinity_scope" sysfs file to configure it per workqueue. wq_dump.py and documentations are updated accordingly. This enables significant flexibility in configuring how unbound workqueues behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu workqueue. On the other hand, "system" removes all locality boundaries. Many modern machines have multiple L3 caches often while being mostly uniform in terms of memory access. Thus, workqueue's previous behavior of spreading work items in each NUMA node had negative performance implications from unncessarily crossing L3 boundaries between issue and execution. However, picking a finer grained affinity scope also has a downside in that an issuer in one group can't utilize CPUs in other groups. While dependent on the specifics of workload, there's usually a noticeable penalty in crossing L3 boundaries, so let's default to CACHE. This issue will be further addressed and documented with examples in future patches. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:24 -10:00
Tejun Heo	025e168458	workqueue: Modularize wq_pod_type initialization While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM init is hard coded into workqueue_init_topology(). This patch modularizes the init path by introducing init_pod_type() which takes a callback to determine whether two CPUs should share a pod as an argument. init_pod_type() first scans the CPU combinations testing for sharing to assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once ->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with cpus_share_numa() which tests whether the CPU belongs to the same NUMA node. This patch may change the pod ID assigned to each NUMA node but that shouldn't cause any behavior changes as the NUMA node to use for allocations are tracked separately in pod_type->pod_node[]. This makes adding new affinty types pretty easy. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:24 -10:00
Tejun Heo	84193c0710	workqueue: Generalize unbound CPU pods While renamed to pod, the code still assumes that the pods are defined by NUMA boundaries. Let's generalize it: * workqueue_attrs->affn_scope is added. Each enum represents the type of boundaries that define the pods. There are currently two scopes - WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before - one pod per NUMA node. The latter defines one global pod across the whole system. * struct wq_pod_type is added which describes how pods are configured for each affnity scope. For each pod, it lists the member CPUs and the preferred NUMA node for memory allocations. The reverse mapping from CPU to pod is also available. * wq_pod_enabled is dropped. Pod is now always enabled. The previously disabled behavior is now implemented through WQ_AFFN_SYSTEM. * get_unbound_pool() wants to determine the NUMA node to allocate memory from for the new pool. The variables are renamed from node to pod but the logic still assumes they're one and the same. Clearly distinguish them - walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's NUMA node. * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA node. Take @cpu instead and determine the cpumask to use from the pod_type matching @attrs. * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of NULL so that it can indicate -EINVAL on invalid affinity scopes. This patch allows CPUs to be grouped into pods however desired per type. While this patch causes some internal behavior changes, nothing material should change for workqueue users. v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is WQ_AFFN_NR_TYPES which indicates that the function is called with a worker_pool's attrs instead of a workqueue's. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:24 -10:00
Tejun Heo	5de7a03cac	workqueue: Factor out clearing of workqueue-only attrs fields workqueue_attrs can be used for both workqueues and worker_pools. However, some fields, currently only ->ordered, only apply to workqueues and should be cleared to the default / invalid values. Currently, an unbound workqueue explicitly clears attrs->ordered in get_unbound_pool() after copying the source workqueue attrs, while per-cpu workqueues rely on the fact that zeroing on allocation gives us the desired default value for pool->attrs->ordered. This is fragile. Let's add wqattrs_clear_for_pool() which clears attrs->ordered and is called from both init_worker_pool() and get_unbound_pool(). This will ease adding more workqueue-only attrs fields. In get_unbound_pool(), pool->node initialization is moved upwards for readability. This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:24 -10:00
Tejun Heo	0f36ee24cd	workqueue: Factor out actual cpumask calculation to reduce subtlety in wq_update_pod() For an unbound pool, multiple cpumasks are involved. U: The user-specified cpumask (may be filtered with cpu_possible_mask). A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering leaves no CPU, wq_unbound_cpumask is used. P: Per-pod subsets of #A. wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and wq->cpu_pwq[CPU]->pool->attrs->cpumask #P. wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To calculate the new #P for each workqueue, it needs to call wq_calc_pod_cpumask() with @attrs that contains #A. Currently, wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with wq->dfl_pwq->pool->attrs. This is rather fragile because we're calling wq_calc_pod_cpumask() with @attrs of a worker_pool rather than the workqueue's actual attrs when what we want to calculate is the workqueue's cpumask on the pod. While this works fine currently, future changes will add fields which are used differently between workqueues and worker_pools and this subtlety will bite us. This patch factors out #U -> #A calculation from apply_wqattrs_prepare() into wqattrs_actualize_cpumask and updates wq_update_pod() to copy wq->unbound_attrs and use the new helper to obtain #A freshly instead of abusing wq->dfl_pwq->pool_attrs. This shouldn't cause any behavior changes in the current code. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com	2023-08-07 15:57:24 -10:00
Tejun Heo	2930155b2e	workqueue: Initialize unbound CPU pods later in the boot During boot, to initialize unbound CPU pods, wq_pod_init() was called from workqueue_init(). This is early enough for NUMA nodes to be set up but before SMP is brought up and CPU topology information is populated. Workqueue is in the process of improving CPU locality for unbound workqueues and will need access to topology information during pod init. This adds a new init function workqueue_init_topology() which is called after CPU topology information is available and replaces wq_pod_init(). As unbound CPU pods are now initialized after workqueues are activated, we need to revisit the workqueues to apply the pod configuration. Workqueues which are created before workqueue_init_topology() are set up so that they always use the default worker pool. After pods are set up in workqueue_init_topology(), wq_update_pod() is called on all existing workqueues to update the pool associations accordingly. Note that wq_update_pod_attrs_buf allocation is moved to workqueue_init_early(). This isn't necessary right now but enables further generalization of pod handling in the future. This patch changes the initialization sequence but the end result should be the same. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:24 -10:00
Tejun Heo	a86feae619	workqueue: Move wq_pod_init() below workqueue_init() wq_pod_init() is called from workqueue_init() and responsible for initializing unbound CPU pods according to NUMA node. Workqueue is in the process of improving affinity awareness and wants to use other topology information to initialize unbound CPU pods; however, unlike NUMA nodes, other topology information isn't yet available in workqueue_init(). The next patch will introduce a later stage init function for workqueue which will be responsible for initializing unbound CPU pods. Relocate wq_pod_init() below workqueue_init() where the new init function is going to be located so that the diff can show the content differences. Just a relocation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:24 -10:00
Tejun Heo	fef59c9cab	workqueue: Rename NUMA related names to use pod instead Workqueue is in the process of improving CPU affinity awareness. It will become more flexible and won't be tied to NUMA node boundaries. This patch renames all NUMA related names in workqueue.c to use "pod" instead. While "pod" isn't a very common term, it short and captures the grouping of CPUs well enough. These names are only going to be used within workqueue implementation proper, so the specific naming doesn't matter that much. * wq_numa_possible_cpumask -> wq_pod_cpus * wq_numa_enabled -> wq_pod_enabled * wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf * workqueue_select_cpu_near -> select_numa_node_cpu This rename is different from others. The function is only used by queue_work_node() and specifically tries to find a CPU in the specified NUMA node. As workqueue affinity will become more flexible and untied from NUMA, this function's name should specifically describe that it's for NUMA. * wq_calc_node_cpumask -> wq_calc_pod_cpumask * wq_update_unbound_numa -> wq_update_pod * wq_numa_init -> wq_pod_init * node -> pod in local variables Only renames. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	af73f5c9fe	workqueue: Rename workqueue_attrs->no_numa to ->ordered With the recent removal of NUMA related module param and sysfs knob, workqueue_attrs->no_numa is now only used to implement ordered workqueues. Let's rename the field so that it's less confusing especially with the planned CPU affinity awareness improvements. Just a rename. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	636b927eba	workqueue: Make unbound workqueues to use per-cpu pool_workqueues A pwq (pool_workqueue) represents an association between a workqueue and a worker_pool. When a work item is queued, the workqueue selects the pwq to use, which in turn determines the pool, and queues the work item to the pool through the pwq. pwq is also what implements the maximum concurrency limit - @max_active. As a per-cpu workqueue should be assocaited with a different worker_pool on each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq. However, unbound workqueues were sharing a pwq within each NUMA node by default. The sharing has several downsides: * Because @max_active is per-pwq, the meaning of @max_active changes depending on the machine configuration and whether workqueue NUMA locality support is enabled. * Makes per-cpu and unbound code deviate. * Gets in the way of making workqueue CPU locality awareness more flexible. This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu workqueues do by making the following changes: * wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound workqueues. * numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs the specified pwq to the target CPU's wq->cpu_pwq. * apply_wqattrs_prepare() now always allocates a separate pwq for each CPU unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq. This makes the return value of wq_calc_node_cpumask() unnecessary. It now returns void. * @max_active now means the same thing for both per-cpu and unbound workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer used in workqueue implementation and will be removed later. * All unbound pwq operations which used to be per-numa-node are now per-cpu. For most unbound workqueue users, this shouldn't cause noticeable changes. Work item issue and completion will be a small bit faster, flush_workqueue() would become a bit more expensive, and the total concurrency limit would likely become higher. All @max_active==1 use cases are currently being audited for conversion into alloc_ordered_workqueue() and they shouldn't be affected once the audit and conversion is complete. One area where the behavior change may be more noticeable is workqueue_congested() as the reported congestion state is now per CPU instead of NUMA node. There are only two users of this interface - drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are cc'd. Inputs on the behavior change would be very much appreciated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com>	2023-08-07 15:57:23 -10:00
Tejun Heo	4cbfd3de73	workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug When a CPU went online or offline, wq_update_unbound_numa() was called only on the CPU which was going up or down. This works fine because all CPUs on the same NUMA node share the same pool_workqueue slot - one CPU updating it updates it for everyone in the node. However, future changes will make each CPU use a separate pool_workqueue even when they're sharing the same worker_pool, which requires updating pool_workqueue's for all CPUs which may be sharing the same pool_workqueue on hotplug. To accommodate the planned changes, this patch updates workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for all CPUs sharing the same NUMA node as the CPU going up or down. In the current code, the second+ calls would be noops and there shouldn't be any behavior changes. * As wq_update_unbound_numa() is now called on multiple CPUs per each hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument is added. The former indicates the CPU being hot[un]plugged and the latter the CPU whose pool_workqueue is being updated. * In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency with the new @hotplug_cpu. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	687a9aa56f	workqueue: Make per-cpu pool_workqueues allocated and released like unbound ones Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly through a per-cpu allocation and thus, unlike unbound workqueues, not reference counted. This difference in lifetime management between the two types is a bit confusing. Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which isn't suitiable for the planned CPU locality related improvements. The plan is to unify pwq handling across per-cpu and unbound workqueues so that they're always accessed through wq->cpu_pwq. In preparation, this patch makes per-cpu pwq's to be allocated, reference counted and released the same way as unbound pwq's. wq->cpu_pwq now holds pointers to pwq's instead of containing them directly. pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now also used for per-cpu work items. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	967b494e2f	workqueue: Use a kthread_worker to release pool_workqueues pool_workqueue release path is currently bounced to system_wq; however, this is a bit tricky because this bouncing occurs while holding a pool lock and thus has risk of causing a A-A deadlock. This is currently addressed by the fact that only unbound workqueues use this bouncing path and system_wq is a per-cpu workqueue. While this works, it's brittle and requires a work-around like setting the lockdep subclass for the lock of unbound pools. Besides, future changes will use the bouncing path for per-cpu workqueues too making the current approach unusable. Let's just use a dedicated kthread_worker to untangle the dependency. This is just one more kthread for all workqueues and makes the pwq release logic simpler and more robust. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	fcecfa8f27	workqueue: Remove module param disable_numa and sysfs knobs pool_ids and numa Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA specific knobs won't make sense anymore. Remove them. Also, the pool_ids knob was used for debugging and not really meaningful given that there is no visibility into the pools associated with those IDs. Remove it too. A future patch will improve overall visibility. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	797e8345cb	workqueue: Relocate worker and work management functions Collect first_idle_worker(), worker_enter/leave_idle(), find_worker_executing_work(), move_linked_works() and wake_up_worker() into one place. These functions will later be used to implement higher level worker management logic. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	ee1ceef727	workqueue: Rename wq->cpu_pwqs to wq->cpu_pwq wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue. The field name being plural is unusual and confusing. Rename it to singular. This patch doesn't cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:23 -10:00
Tejun Heo	fe089f87cc	workqueue: Not all work insertion needs to wake up a worker insert_work() always tried to wake up a worker; however, the only time it needs to try to wake up a worker is when a new active work item is queued. When a work item goes on the inactive list or queueing a flush work item, there's no reason to try to wake up a worker. This patch moves the worker wakeup logic out of insert_work() and places it in the active new work item queueing path in __queue_work(). While at it: * __queue_work() is dereferencing pwq->pool repeatedly. Add local variable pool. * Every caller of insert_work() calls debug_work_activate(). Consolidate the invocations into insert_work(). * In __queue_work() pool->watchdog_ts update is relocated slightly. This is to better accommodate future changes. This makes wakeups more precise and will help the planned change to assign work items to workers before waking them up. No behavior changes intended. v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>	2023-08-07 15:57:22 -10:00
Tejun Heo	c0ab017d43	workqueue: Cleanups around process_scheduled_works() * Drop the trivial optimization in worker_thread() where it bypasses calling process_scheduled_works() if the first work item isn't linked. This is a mostly pointless micro optimization and gets in the way of improving the work processing path. * Consolidate pool->watchdog_ts updates in the two callers into process_scheduled_works(). Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:22 -10:00
Tejun Heo	bc8b50c2df	workqueue: Drop the special locking rule for worker->flags and worker_pool->flags worker->flags used to be accessed from scheduler hooks without grabbing pool->lock for concurrency management. This is no longer true since `6d25be5782` ("sched/core, workqueues: Distangle worker accounting from rq lock"). Also, it's unclear why worker_pool->flags was using the "X" rule. All relevant users are accessing it under the pool lock. Let's drop the special "X" rule and use the "L" rule for these flag fields instead. While at it, replace the CONTEXT comment with lockdep_assert_held(). This allows worker_set/clr_flags() to be used from context which isn't the worker itself. This will be used later to implement assinging work items to workers before waking them up so that workqueue can have better control over which worker executes which work item on which CPU. The only actual changes are sanity checks. There shouldn't be any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:57:22 -10:00
Tejun Heo	87437656c2	workqueue: Merge branch 'for-6.5-fixes' into for-6.6 Unbound workqueue execution locality improvement patchset is about to applied which will cause merge conflicts with changes in for-6.5-fixes. Let's avoid future merge conflict by pulling in for-6.5-fixes. Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 15:54:25 -10:00
Jiri Olsa	a3c485a5d8	bpf: Add support for bpf_get_func_ip helper for uprobe program Adding support for bpf_get_func_ip helper for uprobe program to return probed address for both uprobe and return uprobe. We discussed this in [1] and agreed that uprobe can have special use of bpf_get_func_ip helper that differs from kprobe. The kprobe bpf_get_func_ip returns: - address of the function if probe is attach on function entry for both kprobe and return kprobe - 0 if the probe is not attach on function entry The uprobe bpf_get_func_ip returns: - address of the probe for both uprobe and return uprobe The reason for this semantic change is that kernel can't really tell if the probe user space address is function entry. The uprobe program is actually kprobe type program attached as uprobe. One of the consequences of this design is that uprobes do not have its own set of helpers, but share them with kprobes. As we need different functionality for bpf_get_func_ip helper for uprobe, I'm adding the bool value to the bpf_trace_run_ctx, so the helper can detect that it's executed in uprobe context and call specific code. The is_uprobe bool is set as true in bpf_prog_run_array_sleepable, which is currently used only for executing bpf programs in uprobe. Renaming bpf_prog_run_array_sleepable to bpf_prog_run_array_uprobe to address that it's only used for uprobes and that it sets the run_ctx.is_uprobe as suggested by Yafang Shao. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> [1] https://lore.kernel.org/bpf/CAEf4BzZ=xLVkG5eurEuvLU79wAMtwho7ReR+XJAgwhFF4M-7Cg@mail.gmail.com/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Viktor Malik <vmalik@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230807085956.2344866-2-jolsa@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-07 16:42:58 -07:00
Yonghong Song	db2baf82b0	bpf: Fix an incorrect verification success with movsx insn syzbot reports a verifier bug which triggers a runtime panic. The test bpf program is: 0: (62) (u32 )(r10 -8) = 553656332 1: (bf) r1 = (s16)r10 2: (07) r1 += -8 3: (b7) r2 = 3 4: (bd) if r2 <= r1 goto pc+0 5: (85) call bpf_trace_printk#-138320 6: (b7) r0 = 0 7: (95) exit At insn 1, the current implementation keeps 'r1' as a frame pointer, which caused later bpf_trace_printk helper call crash since frame pointer address is not valid any more. Note that at insn 4, the 'pointer vs. scalar' comparison is allowed for privileged prog run. To fix the problem with above insn 1, the fix in the patch adopts similar pattern to existing 'R1 = (u32) R2' handling. For unprivileged prog run, verification will fail with 'R<num> sign-extension part of pointer'. For privileged prog run, the dst_reg 'r1' will be marked as an unknown scalar, so later 'bpf_trace_pointk' helper will complain since it expected certain pointers. Reported-by: syzbot+d61b595e9205573133b3@syzkaller.appspotmail.com Fixes: `8100928c88` ("bpf: Support new sign-extension mov insns") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20230807175721.671696-1-yonghong.song@linux.dev Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-07 16:23:35 -07:00
Linus Torvalds	14f9643dc9	workqueue: Fixes for v6.5-rc5 Two commits: * The recently added cpu_intensive auto detection and warning mechanism was spuriously triggered on slow CPUs. While not causing serious issues, it's still a nuisance and can cause unintended concurrency management behaviors. Relax the threshold on machines with lower BogoMIPS. While BogoMIPS is not an accurate measure of performance by most measures, we don't have to be accurate and it has rough but strong enough correlation. * A correction in Kconfig help text. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZNFMTQ4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGb+4AQCniWx3rwWWmLgviPR0AfYWbcQ8/P/qGh++fmsR tEF3sQD/bLdeWcVa1pSzXjhGtRVGsTis6oOhk81A0zIZlx0v2Qg= =sThu -----END PGP SIGNATURE----- Merge tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - The recently added cpu_intensive auto detection and warning mechanism was spuriously triggered on slow CPUs. While not causing serious issues, it's still a nuisance and can cause unintended concurrency management behaviors. Relax the threshold on machines with lower BogoMIPS. While BogoMIPS is not an accurate measure of performance by most measures, we don't have to be accurate and it has rough but strong enough correlation. - A correction in Kconfig help text * tag 'wq-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000 workqueue: Fix cpu_intensive_thresh_us name in help text	2023-08-07 13:07:12 -07:00
Hao Jia	0437719c1a	cgroup/rstat: Record the cumulative per-cpu time of cgroup and its descendants The member variable bstat of the structure cgroup_rstat_cpu records the per-cpu time of the cgroup itself, but does not include the per-cpu time of its descendants. The per-cpu time including descendants is very useful for calculating the per-cpu usage of cgroups. Although we can indirectly obtain the total per-cpu time of the cgroup and its descendants by accumulating the per-cpu bstat of each descendant of the cgroup. But after a child cgroup is removed, we will lose its bstat information. This will cause the cumulative value to be non-monotonic, thus affecting the accuracy of cgroup per-cpu usage. So we add the subtree_bstat variable to record the total per-cpu time of this cgroup and its descendants, which is similar to "cpuacct.usage*" in cgroup v1. And this is also helpful for the migration from cgroup v1 to cgroup v2. After adding this variable, we can obtain the per-cpu time of cgroup and its descendants in user mode through eBPF/drgn, etc. And we are still trying to determine how to expose it in the cgroupfs interface. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 08:41:25 -10:00
Yang Yingliang	9680540c0c	workqueue: use LIST_HEAD to initialize cull_list Use LIST_HEAD() to initialize cull_list instead of open-coding it. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 08:36:51 -10:00
Miaohe Lin	e7e64a1bff	cgroup: clean up if condition in cgroup_pidlist_start() There's no need to use '<=' when knowing 'l->list[mid] != pid' already. No functional change intended. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-07 08:30:06 -10:00
Wenyu Liu	55e2b69649	kexec_lock: Replace kexec_mutex() by kexec_lock() in two comments kexec_mutex is replaced by an atomic variable in `05c6257433` (panic, kexec: make __crash_kexec() NMI safe). But there are still two comments that referenced kexec_mutex, replace them by kexec_lock. Signed-off-by: Wenyu Liu <liuwenyu7@huawei.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>	2023-08-07 09:55:42 -04:00
Vlastimil Babka	df2f7cde73	PM: hibernate: fix resume_store() return value when hibernation not available On a laptop with hibernation set up but not actively used, and with secure boot and lockdown enabled kernel, 6.5-rc1 gets stuck on boot with the following repeated messages: A start job is running for Resume from hibernation using device /dev/system/swap (24s / no limit) lockdown_is_locked_down: 25311154 callbacks suppressed Lockdown: systemd-hiberna: hibernation is restricted; see man kernel_lockdown.7 ... Checking the resume code leads to commit `cc89c63e2f` ("PM: hibernate: move finding the resume device out of software_resume") which inadvertently changed the return value from resume_store() to 0 when !hibernation_available(). This apparently translates to userspace write() returning 0 as in number of bytes written, and userspace looping indefinitely in the attempt to write the intended value. Fix this by returning the full number of bytes that were to be written, as that's what was done before the commit. Fixes: `cc89c63e2f` ("PM: hibernate: move finding the resume device out of software_resume") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-08-07 11:41:11 +02:00
Yang Yingliang	1e8e2efb34	bpf: change bpf_alu_sign_string and bpf_movsx_string to static The bpf_alu_sign_string and bpf_movsx_string introduced in commit `f835bb6222` ("bpf: Add kernel/bpftool asm support for new instructions") are only used in disasm.c now, change them to static. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308050615.wxAn1v2J-lkp@intel.com/ Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230803023128.3753323-1-yangyingliang@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-04 16:15:50 -07:00
Kui-Feng Lee	5426700e68	bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR. Verify if the pointer obtained from bpf_xdp_pointer() is either an error or NULL before returning it. The function bpf_dynptr_slice() mistakenly returned an ERR_PTR. Instead of solely checking for NULL, it should also verify if the pointer returned by bpf_xdp_pointer() is an error or NULL. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/bpf/d1360219-85c3-4a03-9449-253ea905f9d1@moroto.mountain/ Fixes: `66e3a13e7c` ("bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwr") Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230803231206.1060485-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-04 14:53:15 -07:00
Daniel Borkmann	d210f9735e	bpf: Fix mprog detachment for empty mprog entry syzbot reported an UBSAN array-index-out-of-bounds access in bpf_mprog_read() upon bpf_mprog_detach(). While it did not have a reproducer, I was able to manually reproduce through an empty mprog entry which just has miniq present. The latter is important given otherwise we get an ENOENT error as tcx detaches the whole mprog entry. The index 4294967295 was triggered via NULL dtuple.prog which then attempts to detach from the back. bpf_mprog_fetch() in this case did hit the idx == total and therefore tried to grab the entry at idx -1. Fix it by adding an explicit bpf_mprog_total() check in bpf_mprog_detach() and bail out early with ENOENT. Fixes: `053c8e1f23` ("bpf: Add generic attach/detach/query API for multi-progs") Reported-by: syzbot+0c06ba0f831fe07a8f27@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20230804131112.11012-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-04 09:35:39 -07:00
Li kunyu	5964d1e459	bpf: bpf_struct_ops: Remove unnecessary initial values of variables err and tlinks is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li kunyu <kunyu@nfschina.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230804175929.2867-1-kunyu@nfschina.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-03 17:54:35 -07:00
Miaohe Lin	7f828eacc4	cgroup: fix obsolete function name in cgroup_destroy_locked() Since commit `e76ecaeef6` ("cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods"), cgroup_kn_lock_live() is used in cgroup kernfs methods. Update corresponding comment. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-08-03 14:13:33 -10:00
Jakub Kicinski	d07b7b32da	pull-request: bpf-next 2023-08-03 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP Tngj6Yid60o39/IZXXblhV37HfSiyQ8= =/kVE -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2023-08-03 We've added 54 non-merge commits during the last 10 day(s) which contain a total of 84 files changed, 4026 insertions(+), 562 deletions(-). The main changes are: 1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer, Daniel Borkmann 2) Support new insns from cpu v4 from Yonghong Song 3) Non-atomically allocate freelist during prefill from YiFei Zhu 4) Support defragmenting IPv(4\|6) packets in BPF from Daniel Xu 5) Add tracepoint to xdp attaching failure from Leon Hwang 6) struct netdev_rx_queue and xdp.h reshuffling to reduce rebuild time from Jakub Kicinski * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits) net: invert the netdevice.h vs xdp.h dependency net: move struct netdev_rx_queue out of netdevice.h eth: add missing xdp.h includes in drivers selftests/bpf: Add testcase for xdp attaching failure tracepoint bpf, xdp: Add tracepoint to xdp attaching failure selftests/bpf: fix static assert compilation issue for test_cls_*.c bpf: fix bpf_probe_read_kernel prototype mismatch riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework libbpf: fix typos in Makefile tracing: bpf: use struct trace_entry in struct syscall_tp_t bpf, devmap: Remove unused dtab field from bpf_dtab_netdev bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry netfilter: bpf: Only define get_proto_defrag_hook() if necessary bpf: Fix an array-index-out-of-bounds issue in disasm.c net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn docs/bpf: Fix malformed documentation bpf: selftests: Add defrag selftests bpf: selftests: Support custom type and proto for client sockets bpf: selftests: Support not connecting client socket netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link ... ==================== Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 15:34:36 -07:00
Jakub Kicinski	35b1b1fd96	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: net/dsa/port.c `9945c1fb03` ("net: dsa: fix older DSA drivers using phylink") `a88dd75384` ("net: dsa: remove legacy_pre_march2020 detection") https://lore.kernel.org/all/20230731102254.2c9868ca@canb.auug.org.au/ net/xdp/xsk.c `3c5b4d69c3` ("net: annotate data-races around sk->sk_mark") `b7f72a30e9` ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path") https://lore.kernel.org/all/20230731102631.39988412@canb.auug.org.au/ drivers/net/ethernet/broadcom/bnxt/bnxt.c `37b61cda9c` ("bnxt: don't handle XDP in netpoll") `2b56b3d992` ("eth: bnxt: handle invalid Tx completions more gracefully") https://lore.kernel.org/all/20230801101708.1dc7faac@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_fs.c `62da08331f` ("net/mlx5e: Set proper IPsec source port in L4 selector") `fbd517549c` ("net/mlx5e: Add function to get IPsec offload namespace") drivers/net/ethernet/sfc/selftest.c `55c1528f9b` ("sfc: fix field-spanning memcpy in selftest") `ae9d445cd4` ("sfc: Miscellaneous comment removals") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 14:34:37 -07:00
Linus Torvalds	999f663186	Including fixes from bpf and wireless. Nothing scary here. Feels like the first wave of regressions from v6.5 is addressed - one outstanding fix still to come in TLS for the sendpage rework. Current release - regressions: - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES - dsa: fix older DSA drivers using phylink Previous releases - regressions: - gro: fix misuse of CB in udp socket lookup - mlx5: unregister devlink params in case interface is down - Revert "wifi: ath11k: Enable threaded NAPI" Previous releases - always broken: - sched: cls_u32: fix match key mis-addressing - sched: bind logic fixes for cls_fw, cls_u32 and cls_route - add bound checks to a number of places which hand-parse netlink - bpf: disable preemption in perf_event_output helpers code - qed: fix scheduling in a tasklet while getting stats - avoid using APIs which are not hardirq-safe in couple of drivers, when we may be in a hard IRQ (netconsole) - wifi: cfg80211: fix return value in scan logic, avoid page allocator warning - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D (DBDC) Misc: - drop handful of inactive maintainers, put some new in place Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmTMCRwACgkQMUZtbf5S Irv1tRAArN6rfYrr2ulaTOfMqhWb1Q+kAs00nBCKqC+OdWgT0hqw2QAuqTAVjhje 8HBYlNGyhJ10yp0Q5y4Fp9CsBDHDDNjIp/YGEbr0vC/9mUDOhYD8WV07SmZmzEJu gmt4LeFPTk07yZy7VxMLY5XKuwce6MWGHArehZE7PSa9+07yY2Ov9X02ntr9hSdH ih+VdDI12aTVSj208qb0qNb2JkefFHW9dntVxce4/mtYJE9+47KMR2aXDXtCh0C6 ECgx0LQkdEJ5vNSYfypww0SXIG5aj7sE6HMTdJkjKH7ws4xrW8H+P9co77Hb/DTH TsRBS4SgB20hFNxz3OQwVmAvj+2qfQssL7SeIkRnaEWeTBuVqCwjLdoIzKXJxxq+ cvtUAAM8XUPqec5cPiHPkeAJV6aJhrdUdMjjbCI9uFYU32AWFBQEqvVGP9xdhXHK QIpTLiy26Vw8PwiJdROuGiZJCXePqQRLDuMX1L43ZO1rwIrZcWGHjCNtsR9nXKgQ apbbxb2/rq2FBMB+6obKeHzWDy3JraNCsUspmfleqdjQ2mpbRokd4Vw2564FJgaC 5OznPIX6OuoCY5sftLUcRcpH5ncNj01BvyqjWyCIfJdkCqCUL7HSAgxfm5AUnZip ZIXOzZnZ6uTUQFptXdjey/jNEQ6qpV8RmwY0CMsmJoo88DXI34Y= =HYkl -----END PGP SIGNATURE----- Merge tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and wireless. Nothing scary here. Feels like the first wave of regressions from v6.5 is addressed - one outstanding fix still to come in TLS for the sendpage rework. Current release - regressions: - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES - dsa: fix older DSA drivers using phylink Previous releases - regressions: - gro: fix misuse of CB in udp socket lookup - mlx5: unregister devlink params in case interface is down - Revert "wifi: ath11k: Enable threaded NAPI" Previous releases - always broken: - sched: cls_u32: fix match key mis-addressing - sched: bind logic fixes for cls_fw, cls_u32 and cls_route - add bound checks to a number of places which hand-parse netlink - bpf: disable preemption in perf_event_output helpers code - qed: fix scheduling in a tasklet while getting stats - avoid using APIs which are not hardirq-safe in couple of drivers, when we may be in a hard IRQ (netconsole) - wifi: cfg80211: fix return value in scan logic, avoid page allocator warning - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D (DBDC) Misc: - drop handful of inactive maintainers, put some new in place" * tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits) MAINTAINERS: update TUN/TAP maintainers test/vsock: remove vsock_perf executable on `make clean` tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen tcp_metrics: annotate data-races around tm->tcpm_net tcp_metrics: annotate data-races around tm->tcpm_vals[] tcp_metrics: annotate data-races around tm->tcpm_lock tcp_metrics: annotate data-races around tm->tcpm_stamp tcp_metrics: fix addr_same() helper prestera: fix fallback to previous version on same major version udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES net/mlx5e: Set proper IPsec source port in L4 selector net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio net/mlx5: fs_core: Make find_closest_ft more generic wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1() vxlan: Fix nexthop hash size ip6mr: Fix skb_under_panic in ip6mr_cache_report() s390/qeth: Don't call dev_close/dev_open (DOWN/UP) net: tap_open(): set sk_uid from current_fsuid() net: tun_chr_open(): set sk_uid from current_fsuid() net: dcb: choose correct policy to parse DCB_ATTR_BCN ...	2023-08-03 14:00:02 -07:00
James Morse	2abcc4b5a6	module: Expose module_init_layout_section() module_init_layout_section() choses whether the core module loader considers a section as init or not. This affects the placement of the exit section when module unloading is disabled. This code will never run, so it can be free()d once the module has been initialised. arm and arm64 need to count the number of PLTs they need before applying relocations based on the section name. The init PLTs are stored separately so they can be free()d. arm and arm64 both use within_module_init() to decide which list of PLTs to use when applying the relocation. Because within_module_init()'s behaviour changes when module unloading is disabled, both architecture would need to take this into account when counting the PLTs. Today neither architecture does this, meaning when module unloading is disabled there are insufficient PLTs in the init section to load some modules, resulting in warnings: \| WARNING: CPU: 2 PID: 51 at arch/arm64/kernel/module-plts.c:99 module_emit_plt_entry+0x184/0x1cc \| Modules linked in: crct10dif_common \| CPU: 2 PID: 51 Comm: modprobe Not tainted 6.5.0-rc4-yocto-standard-dirty #15208 \| Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 \| pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) \| pc : module_emit_plt_entry+0x184/0x1cc \| lr : module_emit_plt_entry+0x94/0x1cc \| sp : ffffffc0803bba60 [...] \| Call trace: \| module_emit_plt_entry+0x184/0x1cc \| apply_relocate_add+0x2bc/0x8e4 \| load_module+0xe34/0x1bd4 \| init_module_from_file+0x84/0xc0 \| __arm64_sys_finit_module+0x1b8/0x27c \| invoke_syscall.constprop.0+0x5c/0x104 \| do_el0_svc+0x58/0x160 \| el0_svc+0x38/0x110 \| el0t_64_sync_handler+0xc0/0xc4 \| el0t_64_sync+0x190/0x194 Instead of duplicating module_init_layout_section()s logic, expose it. Reported-by: Adam Johnston <adam.johnston@arm.com> Fixes: `055f23b74b` ("module: check for exit sections in layout_sections() instead of module_init_section()") Cc: stable@vger.kernel.org Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2023-08-03 13:42:02 -07:00
Jakub Kicinski	3932f22723	pull-request: bpf 2023-08-03 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvqewAKCRBar9k/UBDW 48yeAQCnPnwzcvy+JDrdosuJEErhMv0pH3ECixNpPBpns95kzAEA9QhSYwjAhlFf 61d6hoiXj/sIibgMQT/ihODgeJ4wfQE= =u7qn -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Martin KaFai Lau says: ==================== pull-request: bpf 2023-08-03 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 3 files changed, 37 insertions(+), 20 deletions(-). The main changes are: 1) Disable preemption in perf_event_output helpers code, from Jiri Olsa 2) Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing, from Lin Ma 3) Multiple warning splat fixes in cpumap from Hou Tao * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, cpumap: Handle skb as well when clean up ptr_ring bpf, cpumap: Make sure kthread is running before map update returns bpf: Add length check for SK_DIAG_BPF_STORAGE_REQ_MAP_FD parsing bpf: Disable preemption in bpf_event_output bpf: Disable preemption in bpf_perf_event_output ==================== Link: https://lore.kernel.org/r/20230803181429.994607-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-03 11:22:53 -07:00
Jakub Kicinski	680ee0456a	net: invert the netdevice.h vs xdp.h dependency xdp.h is far more specific and is included in only 67 other files vs netdevice.h's 1538 include sites. Make xdp.h include netdevice.h, instead of the other way around. This decreases the incremental allmodconfig builds size when xdp.h is touched from 5947 to 662 objects. Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h is a mega-header in its own right so it's nice to avoid xdp.h getting included there as well. The only unfortunate part is that the typedef for xdp_features_t has to move to netdevice.h, since its embedded in struct netdevice. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-08-03 08:38:07 -07:00

1 2 3 4 5 ...

42708 Commits