Currently the following code in check_and_init_map_value()
*(struct bpf_timer *)(dst + map->timer_off) =
(struct bpf_timer){};
can help generate bpf_timer definition in vmlinuxBTF.
But the code above may not zero the whole structure
due to anonymour members and that code will be replaced
by memset in the subsequent patch and
bpf_timer definition will disappear from vmlinuxBTF.
Let us emit the type explicitly so bpf program can continue
to use it from vmlinux.h.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220211194948.3141529-1-yhs@fb.com
- Revert a recent change that attempted to avoid issues with
conflicting address ranges during PCI initialization, because it
turned out to introduce a regression (Hans de Goede).
- Revert a change that limited EC GPE wakeups from suspend-to-idle
to systems based on Intel hardware, because it turned out that
systems based on hardware from other vendors depended on that
functionality too (Mario Limonciello).
- Fix two issues related to the handling of wakeup interrupts and
wakeup events signaled through the EC GPE during suspend-to-idle
on x86 (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmIGkq4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxs3MQAKhmnM5mHOqFIM7VM9SoutEoOIXPWLCz
CNjjqfwA63GvxTp//Dks4KGCLF90mfFOv8Mt7vlzVHyQDvTv2w0YddmBxQAgXTU4
v1yfHcRssND1w2kgIg7wThWSxSKvP/4XgqwrhTmEKHb1qPFrpq9pt7+4RIco9362
pacxW1QPX2+5yNDs67JuTsjh2mksKC8CUgiA8BUa+57jrIXvQYWqqZ490PpkbXW0
Nh1naNwT1xDqte5U98PrzZZRt2qUMuKrG2ro0lE97rb067zSoIQo2XCODlo7T12O
7vFzypOTvNJZjd57U9SoKyEDCzDnHkNh2O2jiNKaqzMJHqgh0bkMrMHysPbdTwMu
VH0fw+VElRONyQ5knUcvTG3IRfEuTBl2iqDoO+hLb+cmkL48KL9yJXH7dTRMrNJ8
zqHSCqpON8rZgLfDzrxoVvMXv9Al5ra5wM41EljiUWUDFEB1HbtleysHIMFMxEKy
gBvO6zAyEoh5ie2pxYdSAyY+pq+d1JFeAgpbXpF3miQFE2gchC1BjNjUCguekfty
t0kwOefLOphgShpIm/04F2WBsYNM+QSIoJWRvE5LsEVTdD/j59MweCTJN2KNSIpj
vapl8ymAr6umMrPhxpjY6K4DD000+LYpV9WJ8Q1Wv2HWV9OkevL5KnEXoQ94czpb
ybb/Nd4SwHV+
=aGYp
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These revert two commits that turned out to be problematic and fix two
issues related to wakeup from suspend-to-idle on x86.
Specifics:
- Revert a recent change that attempted to avoid issues with
conflicting address ranges during PCI initialization, because it
turned out to introduce a regression (Hans de Goede).
- Revert a change that limited EC GPE wakeups from suspend-to-idle to
systems based on Intel hardware, because it turned out that systems
based on hardware from other vendors depended on that functionality
too (Mario Limonciello).
- Fix two issues related to the handling of wakeup interrupts and
wakeup events signaled through the EC GPE during suspend-to-idle on
x86 (Rafael Wysocki)"
* tag 'acpi-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"
PM: s2idle: ACPI: Fix wakeup interrupts handling
ACPI: PM: s2idle: Cancel wakeup before dispatching EC GPE
ACPI: PM: Revert "Only mark EC GPE for wakeup on Intel systems"
- Fixes to the RTLA tooling.
- A fix to a tp_printk overriding tp_printk_stop_on_boot on command line.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYgWtxxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsaaAQD+4lcpIRKdkfGb09xMlh8Gr8OvRoVb
5XAhzHVpETjGUAEAyIJAG+7Epw/St8FCSupNAEWTzGghjhoJhFblTd17jAg=
=CCON
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fixes to the RTLA tooling
- A fix to a tp_printk overriding tp_printk_stop_on_boot on the
command line
* tag 'trace-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix tp_printk option related with tp_printk_stop_on_boot
MAINTAINERS: Add RTLA entry
rtla: Fix segmentation fault when failing to enable -t
rtla/trace: Error message fixup
rtla/utils: Fix session duration parsing
rtla: Follow kernel version
This patch adds __sched attributes to a few missing places
to show blocked function rather than locking function
in get_wchan.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220115231657.84828-1-minchan@kernel.org
Add validation to ensure data is at or greater than the min size for the
fields of the event. If a dynamic array is used and is a type of char,
ensure null termination of the array exists.
Link: https://lkml.kernel.org/r/20220118204326.2169-7-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pass iterator through to probes to allow copying data directly to the
probe buffers instead of taking multiple copies. Enables eBPF user and
raw iterator types out to programs for no-copy scenarios.
Link: https://lkml.kernel.org/r/20220118204326.2169-6-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Adds support to write out user_event data to perf_probe/perf files as
well as to any attached eBPF program.
Link: https://lkml.kernel.org/r/20220118204326.2169-5-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Ensures that when dynamic events requests a match with arguments that
they match what is in the user_event.
Link: https://lkml.kernel.org/r/20220118204326.2169-4-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Addes print_fmt format generation for basic types that are supported for
user processes. Only supports sizes that are the same on 32 and 64 bit.
Link: https://lkml.kernel.org/r/20220118204326.2169-3-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Minimal support for interacting with dynamic events, trace_event and
ftrace. Core outline of flow between user process, ioctl and trace_event
APIs.
User mode processes that wish to use trace events to get data into
ftrace, perf, eBPF, etc are limited to uprobes today. The user events
features enables an ABI for user mode processes to create and write to
trace events that are isolated from kernel level trace events. This
enables a faster path for tracing from user mode data as well as opens
managed code to participate in trace events, where stub locations are
dynamic.
User processes often want to trace only when it's useful. To enable this
a set of pages are mapped into the user process space that indicate the
current state of the user events that have been registered. User
processes can check if their event is hooked to a trace/probe, and if it
is, emit the event data out via the write() syscall.
Two new files are introduced into tracefs to accomplish this:
user_events_status - This file is mmap'd into participating user mode
processes to indicate event status.
user_events_data - This file is opened and register/delete ioctl's are
issued to create/open/delete trace events that can be used for tracing.
The typical scenario is on process start to mmap user_events_status. Processes
then register the events they plan to use via the REG ioctl. The ioctl reads
and updates the passed in user_reg struct. The status_index of the struct is
used to know the byte in the status page to check for that event. The
write_index of the struct is used to describe that event when writing out to
the fd that was used for the ioctl call. The data must always include this
index first when writing out data for an event. Data can be written either by
write() or by writev().
For example, in memory:
int index;
char data[];
Psuedo code example of typical usage:
struct user_reg reg;
int page_fd = open("user_events_status", O_RDWR);
char *page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
close(page_fd);
int data_fd = open("user_events_data", O_RDWR);
reg.size = sizeof(reg);
reg.name_args = (__u64)"test";
ioctl(data_fd, DIAG_IOCSREG, ®);
int status_id = reg.status_index;
int write_id = reg.write_index;
struct iovec io[2];
io[0].iov_base = &write_id;
io[0].iov_len = sizeof(write_id);
io[1].iov_base = payload;
io[1].iov_len = sizeof(payload);
if (page_data[status_id])
writev(data_fd, io, 2);
User events are also exposed via the dynamic_events tracefs file for
both create and delete. Current status is exposed via the user_events_status
tracefs file.
Simple example to register a user event via dynamic_events:
echo u:test >> dynamic_events
cat dynamic_events
u:test
If an event is hooked to a probe, the probe hooked shows up:
echo 1 > events/user_events/test/enable
cat user_events_status
1:test # Used by ftrace
Active: 1
Busy: 1
Max: 4096
If an event is not hooked to a probe, no probe status shows up:
echo 0 > events/user_events/test/enable
cat user_events_status
1:test
Active: 1
Busy: 0
Max: 4096
Users can describe the trace event format via the following format:
name[:FLAG1[,FLAG2...] [field1[;field2...]]
Each field has the following format:
type name
Example for char array with a size of 20 named msg:
echo 'u:detailed char[20] msg' >> dynamic_events
cat dynamic_events
u:detailed char[20] msg
Data offsets are based on the data written out via write() and will be
updated to reflect the correct offset in the trace_event fields. For dynamic
data it is recommended to use the new __rel_loc data type. This type will be
the same as __data_loc, but the offset is relative to this entry. This allows
user_events to not worry about what common fields are being inserted before
the data.
The above format is valid for both the ioctl and the dynamic_events file.
Link: https://lkml.kernel.org/r/20220118204326.2169-2-beaub@linux.microsoft.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use the sched_switch function to save both the wakee and the waker comms
in the saved cmdlines list when sched_wakeup is done.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, synthetic event command error strings are restricted to a
length of MAX_FILTER_STR_VAL (256), which is too short for some
commands already seen in the wild (with cmd strings longer than that
showing up truncated in err_log).
Remove the restriction so that no synthetic event command error string
is ever truncated.
Link: https://lkml.kernel.org/r/0376692396a81d0b795127c66ea92ca5bf60f481.1643399022.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, hist trigger command error strings are restricted to a
length of MAX_FILTER_STR_VAL (256), which is too short for some
commands already seen in the wild (with cmd strings longer than that
showing up truncated in err_log).
Remove the restriction so that no hist trigger command error string is
ever truncated.
Link: https://lkml.kernel.org/r/0f9d46407222eaf6632cd3b417bc50a11f401b71.1643399022.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, tracing_log_err.cmd strings are restricted to a length of
MAX_FILTER_STR_VAL (256), which is too short for some commands already
seen in the wild (with cmd strings longer than that showing up
truncated).
Remove the restriction so that no command string is ever truncated.
Link: https://lkml.kernel.org/r/ca965f23256b350ebd94b3dc1a319f28e8267f5f.1643319703.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
If seccomp tries to kill a process, it should never see that process
again. To enforce this proactively, switch the mode to something
impossible. If encountered: WARN, reject all syscalls, and attempt to
kill the process again even harder.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Fixes: 8112c4f140 ("seccomp: remove 2-phase API")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Fatal SIGSYS signals (i.e. seccomp RET_KILL_* syscall filter actions)
were not being delivered to ptraced pid namespace init processes. Make
sure the SIGNAL_UNKILLABLE doesn't get set for these cases.
Reported-by: Robert Święcki <robert@swiecki.net>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/lkml/878rui8u4a.fsf@email.froward.int.ebiederm.org
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmIEf2AUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNB3A/7BzHTEAqMeOZx9x8L0f4dFpqsyRba
1wSyWwH6cPqGZew6IRbrj5hSv8XTAQKtqmhJJNfs4CoKV5pX+XnyKuPDGblBAkFn
uRgzQb6SQSQU7Bm/qWziwa/PVFmkkwbEG/08O5nKQ3cK4JEXzLPuDtN1T196Nlxf
IrbC3E18juIoFlzKk2oLwoYWB6JkQ3GmTnIQ7g91/673U8w4mrAH3ZBTFl8vKaqu
TiWMIujjVrEhgpFmKZTRz+5EeHxuB8XAY6NZH7qI3FcMWhyk/at7orF6JsaKZeyX
/UW0m2Ztr9d4r4U0H/sMJgLmEbj0JApboIAp/O35KsfQi2yquH71c/fY1qW54WD6
Gsn503VkkgZ0J+DWHSk1hQxAylpJ8Jv2W7Rukk74uCXQAiPEI4xq00ZyueIEF2fM
V10oSJtISS9nAwuoZVMQbre+mFsHjuA6qIPutoWJtEsk2f9fnFxKvOtAbhjNLKG0
4CmPIXXKm3h4umdMkh6wsnqPk0UcP6FSDX/bbea+YGKELKYXbbIDhDUbFnfdViie
wuKFFMRAH8N6IiPa7jyKhvLYUdp2coCxgsiaN+sZ5Z9+0O/tLu40quH0TCsjvrog
Ol/en8rIdccgkBCCo/N2y3h4CzYZ0R5ODUTOTfx9MxciBzfUeRhnvN1bNoQxYSjH
KSw7Ppu8b98Iflk=
=iaaT
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20220209' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"Another audit fix, this time a single rather small but important fix
for an oops/page-fault caused by improperly accessing userspace
memory"
* tag 'audit-pr-20220209' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: don't deref the syscall args when checking the openat2 open_how::flags
As reported by Jeff, dereferencing the openat2 syscall argument in
audit_match_perm() to obtain the open_how::flags can result in an
oops/page-fault. This patch fixes this by using the open_how struct
that we store in the audit_context with audit_openat2_how().
Independent of this patch, Richard Guy Briggs posted a similar patch
to the audit mailing list roughly 40 minutes after this patch was
posted.
Cc: stable@vger.kernel.org
Fixes: 1c30e3af8a ("audit: add support for the openat2 syscall")
Reported-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
As a preparation to moving the reference to the device used for
runtime power management, add a new 'dev' field to the irqdomain
structure for that exact purpose.
The irq_chip_pm_{get,put}() helpers are made aware of the dual
location via a new private helper.
No functional change intended.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Tested-by: Tony Lindgren <tony@atomide.com>
Acked-by: Bartosz Golaszewski <brgl@bgdev.pl>
Link: https://lore.kernel.org/r/20220201120310.878267-2-maz@kernel.org
The kernel parameter "tp_printk_stop_on_boot" starts with "tp_printk" which is
the same as another kernel parameter "tp_printk". If "tp_printk" setup is
called before the "tp_printk_stop_on_boot", it will override the latter
and keep it from being set.
This is similar to other kernel parameter issues, such as:
Commit 745a600cf1 ("um: console: Ignore console= option")
or init/do_mounts.c:45 (setup function of "ro" kernel param)
Fix it by checking for a "_" right after the "tp_printk" and if that
exists do not process the parameter.
Link: https://lkml.kernel.org/r/20220208195421.969326-1-jsyoo5b@gmail.com
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
[ Fixed up change log and added space after if condition ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Currently, call_rcu_tasks_generic() sets ->percpu_enqueue_shift to
order_base_2(nr_cpu_ids) upon encountering sufficient contention.
This does not shift to use of non-CPU-0 callback queues as intended, but
rather continues using only CPU 0's queue. Although this does provide
some decrease in contention due to spreading work over multiple locks,
it is not the dramatic decrease that was intended.
This commit therefore makes call_rcu_tasks_generic() set
->percpu_enqueue_shift to 0.
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The ilog2() function can be used to generate a shift count, but it will
generate the same count for a power of two as for one greater than a power
of two. This results in shift counts that are larger than necessary for
systems with a power-of-two number of CPUs because the CPUs are numbered
from zero, so that the maximum CPU number is one less than that power
of two.
This commit therefore substitutes order_base_2(), which appears to have
been designed for exactly this use case.
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The pattern "rdp->grpmask & rcu_rnp_online_cpus(rnp)" occurs frequently
in RCU code in order to determine whether rdp->cpu is online from an
RCU perspective. This commit therefore creates an rcu_rdp_cpu_online()
function to replace it.
[ paulmck: Apply kernel test robot unused-variable feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit removes the cpus_read_lock() and cpus_read_unlock() calls
from rcu_barrier(), thus allowing CPUs to come and go during the course
of rcu_barrier() execution. Posting of the ->barrier_head callbacks does
synchronize with portions of RCU's CPU-hotplug notifiers, but these locks
are held for short time periods on both sides. Thus, full CPU-hotplug
operations could both start and finish during the execution of a given
rcu_barrier() invocation.
Additional synchronization is provided by a global ->barrier_lock.
Since the ->barrier_lock is only used during rcu_barrier() execution and
during onlining/offlining a CPU, the contention for this lock should
be low. It might be tempting to make use of a per-CPU lock just on
general principles, but straightforward attempts to do this have the
problems shown below.
Initial state: 3 CPUs present, CPU 0 and CPU1 do not have
any callback and CPU2 has callbacks.
1. CPU0 calls rcu_barrier().
2. CPU1 starts offlining for CPU2. CPU1 calls
rcutree_migrate_callbacks(). rcu_barrier_entrain() is called
from rcutree_migrate_callbacks(), with CPU2's rdp->barrier_lock.
It does not entrain ->barrier_head for CPU2, as rcu_barrier()
on CPU0 hasn't started the barrier sequence (by calling
rcu_seq_start(&rcu_state.barrier_sequence)) yet.
3. CPU0 starts new barrier sequence. It iterates over
CPU0 and CPU1, after acquiring their per-cpu ->barrier_lock
and finds 0 segcblist length. It updates ->barrier_seq_snap
for CPU0 and CPU1 and continues loop iteration to CPU2.
for_each_possible_cpu(cpu) {
raw_spin_lock_irqsave(&rdp->barrier_lock, flags);
if (!rcu_segcblist_n_cbs(&rdp->cblist)) {
WRITE_ONCE(rdp->barrier_seq_snap, gseq);
raw_spin_unlock_irqrestore(&rdp->barrier_lock, flags);
rcu_barrier_trace(TPS("NQ"), cpu, rcu_state.barrier_sequence);
continue;
}
4. rcutree_migrate_callbacks() completes execution on CPU1.
Segcblist len for CPU2 becomes 0.
5. The loop iteration on CPU0, checks rcu_segcblist_n_cbs(&rdp->cblist)
for CPU2 and completes the loop iteration after setting
->barrier_seq_snap.
6. As there isn't any ->barrier_head callback entrained; at
this point, rcu_barrier() in CPU0 returns.
7. The callbacks, which migrated from CPU2 to CPU1, execute.
Straightforward per-CPU locking is also subject to the following race
condition noted by Boqun Feng:
1. CPU0 calls rcu_barrier(), starting a new barrier sequence by invoking
rcu_seq_start() and init_completion(), but does not yet initialize
rcu_state.barrier_cpu_count.
2. CPU1 starts offlining for CPU2, calling rcutree_migrate_callbacks(),
which in turn calls rcu_barrier_entrain() holding CPU2's.
rdp->barrier_lock. It then entrains ->barrier_head for CPU2
and atomically increments rcu_state.barrier_cpu_count, which is
unfortunately not yet initialized to the value 2.
3. The just-entrained RCU callback is invoked. It atomically
decrements rcu_state.barrier_cpu_count and sees that it is
now zero. This callback therefore invokes complete().
4. CPU0 continues executing rcu_barrier(), but is not blocked
by its call to wait_for_completion(). This results in rcu_barrier()
returning before all pre-existing callbacks have been invoked,
which is a bug.
Therefore, synchronization is provided by rcu_state.barrier_lock,
which is also held across the initialization sequence, especially the
rcu_seq_start() and the atomic_set() that sets rcu_state.barrier_cpu_count
to the value 2. In addition, this lock is held when entraining the
rcu_barrier() callback, when deciding whether or not a CPU has callbacks
that rcu_barrier() must wait on, when setting the ->qsmaskinitnext for
incoming CPUs, and when migrating callbacks from a CPU that is going
offline.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit reworks rcu_barrier() and callback-migration logic to
permit allowing rcu_barrier() to run concurrently with CPU-hotplug
operations. The key trick is for callback migration to check to see if
an rcu_barrier() is in flight, and, if so, enqueue the ->barrier_head
callback on its behalf.
This commit adds synchronization with RCU's CPU-hotplug notifiers. Taken
together, this will permit a later commit to remove the cpus_read_lock()
and cpus_read_unlock() calls from rcu_barrier().
[ paulmck: Updated per kbuild test robot feedback. ]
[ paulmck: Updated per reviews session with Neeraj, Frederic, Uladzislau, and Boqun. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit saves a few lines by checking first for an empty callback
list. If the callback list is empty, then that CPU is taken care of,
regardless of its online or nocb state. Also simplify tracing accordingly
and fold a few lines together.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If we allow architectures to bring APs online in parallel, then we end
up requiring rcu_cpu_starting() to be reentrant. But currently, the
manipulation of rnp->ofl_seq is not thread-safe.
However, rnp->ofl_seq is also fairly much pointless anyway since both
rcu_cpu_starting() and rcu_report_dead() hold rcu_state.ofl_lock for
fairly much the whole time that rnp->ofl_seq is set to an odd number
to indicate that an operation is in progress.
So drop rnp->ofl_seq completely, and use only rcu_state.ofl_lock.
This has a couple of minor complexities: lockdep will complain when we
take rcu_state.ofl_lock, and currently accepts the 'excuse' of having
an odd value in rnp->ofl_seq. So switch it to an arch_spinlock_t to
avoid that false positive complaint. Since we're killing rnp->ofl_seq
of course that 'excuse' has to be changed too, so make it check for
arch_spin_is_locked(rcu_state.ofl_lock).
There's no arch_spin_lock_irqsave() so we have to manually save and
restore local interrupts around the locking.
At Paul's request based on Neeraj's analysis, make rcu_gp_init not just
wait but *exclude* any CPU online/offline activity, which was fairly
much true already by virtue of it holding rcu_state.ofl_lock.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
After commit e3728b50cd ("ACPI: PM: s2idle: Avoid possible race
related to the EC GPE") wakeup interrupts occurring immediately after
the one discarded by acpi_s2idle_wake() may be missed. Moreover, if
the SCI triggers again immediately after the rearming in
acpi_s2idle_wake(), that wakeup may be missed too.
The problem is that pm_system_irq_wakeup() only calls pm_system_wakeup()
when pm_wakeup_irq is 0, but that's not the case any more after the
interrupt causing acpi_s2idle_wake() to run until pm_wakeup_irq is
cleared by the pm_wakeup_clear() call in s2idle_loop(). However,
there may be wakeup interrupts occurring in that time frame and if
that happens, they will be missed.
To address that issue first move the clearing of pm_wakeup_irq to
the point at which it is known that the interrupt causing
acpi_s2idle_wake() to tun will be discarded, before rearming the SCI
for wakeup. Moreover, because that only reduces the size of the
time window in which the issue may manifest itself, allow
pm_system_irq_wakeup() to register two second wakeup interrupts in
a row and, when discarding the first one, replace it with the second
one. [Of course, this assumes that only one wakeup interrupt can be
discarded in one go, but currently that is the case and I am not
aware of any plans to change that.]
Fixes: e3728b50cd ("ACPI: PM: s2idle: Avoid possible race related to the EC GPE")
Cc: 5.4+ <stable@vger.kernel.org> # 5.4+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There's list corruption on cgrp_cpuctx_list. This happens on the
following path:
perf_cgroup_switch: list_for_each_entry(cgrp_cpuctx_list)
cpu_ctx_sched_in
ctx_sched_in
ctx_pinned_sched_in
merge_sched_in
perf_cgroup_event_disable: remove the event from the list
Use list_for_each_entry_safe() to allow removing an entry during
iteration.
Fixes: 058fe1c044 ("perf/core: Make cgroup switch visit only cpuctxs with cgroup events")
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220204004057.2961252-1-song@kernel.org
Syzbot found a GPF in reweight_entity. This has been bisected to
commit 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid
sched_task_group")
There is a race between sched_post_fork() and setpriority(PRIO_PGRP)
within a thread group that causes a null-ptr-deref in
reweight_entity() in CFS. The scenario is that the main process spawns
number of new threads, which then call setpriority(PRIO_PGRP, 0, -20),
wait, and exit. For each of the new threads the copy_process() gets
invoked, which adds the new task_struct and calls sched_post_fork()
for it.
In the above scenario there is a possibility that
setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread
in the group that is just being created by copy_process(), and for
which the sched_post_fork() has not been executed yet. This will
trigger a null pointer dereference in reweight_entity(), as it will
try to access the run queue pointer, which hasn't been set.
Before the mentioned change the cfs_rq pointer for the task has been
set in sched_fork(), which is called much earlier in copy_process(),
before the new task is added to the thread_group. Now it is done in
the sched_post_fork(), which is called after that. To fix the issue
the remove the update_load param from the update_load param() function
and call reweight_task() only if the task flag doesn't have the
TASK_NEW flag set.
Fixes: 4ef0c5c6b5 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org
- Intel/PT: filters could crash the kernel
- Intel: default disable the PMU for SMM, some new-ish EFI firmware has
started using CPL3 and the PMU CPL filters don't discriminate against
SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
cycles.
- Fixup for perf_event_attr::sig_data
(Peter Zijlstra)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmH/vpgACgkQEsHwGGHe
VUqIGw/9EWg7Ek89BG9ZZui8EEDAzx3x0s/tyxiz0z18YfvtTnex5I87uJUYpw2s
hFhxxmGN+rwhcMGQDc0sDLLLxp170Yg0383N6OBBBMWPtNyxMWihBOHQgz8hQzbW
KtwoiBewmvAycHw0aoOtDMqFZTn5RToONnG9h7yV9rUIGKq75XNh72MBy9sCSE2F
w8lA3WWVTrv91YTPSMbsrm/tMC6eQCRiJGRMHTapxrWxkVu/H8O42pxJgS6dlo+h
vw025hXcf0KGBLzwVSHYdZg8jMn7uD2oSMh+wQ+Jy15XjKVWDfF1m3sA5S+zSJsS
THHtmqni5mF5xn0H7eOK9nYmRXR013zx6weo9miK4SN1pcoJq+PTNdSZOIwBm3Nh
eUXR/bXFYL0GGuPOk0QHA9AjqbCBPrkiw1nfppbJem2rrZ0uKHyKa8REVcVg/Xzy
e/nDy8I2y2bnwU9Ugk9BNWBRmn54Q2kb4/egmtLME6oYiqOXumQ4ZB/CmwRaSwxG
bB99/tBKblrWSA6wcgATkqYFSg4ZJniDxKipnrEYX8ePkGODKHoIQS4EUyjxuPW/
fO2G4Oe8aO/qYS/yei8XcubyEFaSPyUo+th+ZiPODCt15JKzQCAxeOYxqnEI4I4s
5afDBmAo47bs9Eem7GRjZOgrDOP88+lISZ1rZidp5paDwWAmL2E=
=0tH5
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Intel/PT: filters could crash the kernel
- Intel: default disable the PMU for SMM, some new-ish EFI firmware has
started using CPL3 and the PMU CPL filters don't discriminate against
SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
cycles.
- Fixup for perf_event_attr::sig_data
* tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix crash with stop filters in single-range mode
perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures
selftests/perf_events: Test modification of perf_event_attr::sig_data
perf: Copy perf_event_attr::sig_data on modification
x86/perf: Default set FREEZE_ON_SMI for all
The to_gov_attr_set() has been moved to the cpufreq.h, so use it to get
the gov_attr_set.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
While the stackleak plugin was already using notrace, objtool is now a
bit more picky. Update the notrace uses to noinstr. Silences the
following objtool warnings when building with:
CONFIG_DEBUG_ENTRY=y
CONFIG_STACK_VALIDATION=y
CONFIG_VMLINUX_VALIDATION=y
CONFIG_GCC_PLUGIN_STACKLEAK=y
vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section
Note that the plugin's addition of calls to stackleak_track_stack() from
noinstr functions is expected to be safe, as it isn't runtime
instrumentation and is self-contained.
Cc: Alexander Popov <alex.popov@linux.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
and ieee802154.
Current release - regressions:
- Partially revert "net/smc: Add netlink net namespace support",
fix uABI breakage
- netfilter:
- nft_ct: fix use after free when attaching zone template
- nft_byteorder: track register operations
Previous releases - regressions:
- ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
- phy: qca8081: fix speeds lower than 2.5Gb/s
- sched: fix use-after-free in tc_new_tfilter()
Previous releases - always broken:
- tcp: fix mem under-charging with zerocopy sendmsg()
- tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
- neigh: do not trigger immediate probes on NUD_FAILED from
neigh_managed_work, avoid a deadlock
- bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
false-positives
- netfilter: nft_reject_bridge: fix for missing reply from prerouting
- smc: forward wakeup to smc socket waitqueue after fallback
- ieee802154:
- return meaningful error codes from the netlink helpers
- mcr20a: fix lifs/sifs periods
- at86rf230, ca8210: stop leaking skbs on error paths
- macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
- ax25: add refcount in ax25_dev to avoid UAF bugs
- eth: mlx5e:
- fix SFP module EEPROM query
- fix broken SKB allocation in HW-GRO
- IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
- eth: amd-xgbe:
- fix skb data length underflow
- ensure reset of the tx_timer_active flag, avoid Tx timeouts
- eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
- eth: e1000e: handshake with CSME starts from Alder Lake platforms
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmH8X9UACgkQMUZtbf5S
IrsxuhAAlAvFHGL6y5Y2gAmhKvVUvCYjiIJBcvk7R66CwYVRxofvlhmxi6GM/Czs
9SrVSaN4RXu3p3d7UtAl1gAQwHqzLIHH3m2g5dSKVvHZWQgkm/+n74x0aZQ9Fll7
mWs9uu5fWsQr/qZBnnjoQTvUxRUNVd4trBy7nXGzkNqJL5j0+2TT4BhH4qalhE28
iPc9YFCyKPdjoWFksteZqD3hAQbXxK/xRRr6xuvFHENlZdEHM6ARftHnJthTG/fY
32rdn9YUkQ9lNtOBJNMN9yP2z1B7TcxASBqjjk55I7XtT1QAI9/PskszavHC0hOk
BCSMX779bLNW4+G0wiSKVB4tq4tvswtawq8Hxa6zdU4TKIzfQ84ZL/Nf66GtH+4W
C0mbZohmyJV9hQFkNT0ZLeihljd7i4BkDttlbK3uz2IL9tHeX3uSo5V7AgS/Xaf6
frXgbGgjQTaR6IL9AUhfN3GTCx60mzpH/aRpFho8A5xAl3EtHWCJcRhbY/CEhQBR
zyCndcLcG5mUzbhx/TxlKrrpRCLxqCUG/Tsb2wCh5jMxO1zonW9Hhv4P1ie6EFuI
h+XiJT2WWObS/KTze9S86WOR0zcqrtRqaOGJlNB+/+K8ClZU8UsDTFXLQ0dqpVZF
Mvp7VchBzyFFJrrvO8WkkJgLTKdaPJmM9wuWUZb4J6d2MWlmDkE=
=qKvf
-----END PGP SIGNATURE-----
Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, netfilter, and ieee802154.
Current release - regressions:
- Partially revert "net/smc: Add netlink net namespace support", fix
uABI breakage
- netfilter:
- nft_ct: fix use after free when attaching zone template
- nft_byteorder: track register operations
Previous releases - regressions:
- ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
- phy: qca8081: fix speeds lower than 2.5Gb/s
- sched: fix use-after-free in tc_new_tfilter()
Previous releases - always broken:
- tcp: fix mem under-charging with zerocopy sendmsg()
- tcp: add missing tcp_skb_can_collapse() test in
tcp_shift_skb_data()
- neigh: do not trigger immediate probes on NUD_FAILED from
neigh_managed_work, avoid a deadlock
- bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
false-positives
- netfilter: nft_reject_bridge: fix for missing reply from prerouting
- smc: forward wakeup to smc socket waitqueue after fallback
- ieee802154:
- return meaningful error codes from the netlink helpers
- mcr20a: fix lifs/sifs periods
- at86rf230, ca8210: stop leaking skbs on error paths
- macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
- ax25: add refcount in ax25_dev to avoid UAF bugs
- eth: mlx5e:
- fix SFP module EEPROM query
- fix broken SKB allocation in HW-GRO
- IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
- eth: amd-xgbe:
- fix skb data length underflow
- ensure reset of the tx_timer_active flag, avoid Tx timeouts
- eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
- eth: e1000e: handshake with CSME starts from Alder Lake platforms"
* tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
ax25: fix reference count leaks of ax25_dev
net: stmmac: ensure PTP time register reads are consistent
net: ipa: request IPA register values be retained
dt-bindings: net: qcom,ipa: add optional qcom,qmp property
tools/resolve_btfids: Do not print any commands when building silently
bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
net: sparx5: do not refer to skb after passing it on
Partially revert "net/smc: Add netlink net namespace support"
net/mlx5e: Avoid field-overflowing memcpy()
net/mlx5e: Use struct_group() for memcpy() region
net/mlx5e: Avoid implicit modify hdr for decap drop rule
net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
net/mlx5: E-Switch, Fix uninitialized variable modact
net/mlx5e: Fix handling of wrong devices during bond netevent
net/mlx5e: Fix broken SKB allocation in HW-GRO
net/mlx5e: Fix wrong calculation of header index in HW_GRO
...
Daniel Borkmann says:
====================
pull-request: bpf 2022-02-03
We've added 6 non-merge commits during the last 10 day(s) which contain
a total of 7 files changed, 11 insertions(+), 236 deletions(-).
The main changes are:
1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
flag which otherwise trips over KASAN, from Hou Tao.
2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
rename, from Alexei Starovoitov.
3) Fix a possible race in inc_misses_counter() when IRQ would trigger
during counter update, from He Fengqing.
4) Fix tooling infra for cross-building with clang upon probing whether
gcc provides the standard libraries, from Jean-Philippe Brucker.
5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
6) Drop unneeded and outdated lirc.h header copy from tooling infra as
BPF does not require it anymore, from Sean Young.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
tools/resolve_btfids: Do not print any commands when building silently
bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
tools: Ignore errors from `which' when searching a GCC toolchain
tools headers UAPI: remove stale lirc.h
bpf: Fix possible race in inc_misses_counter
bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
====================
Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
kernel/printk/sysctl.c introduced an incorrect __user attribute to the
buffer argument. I spotted this change in [1] as well as the kernel
test robot. Revert this change to please sparse:
kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
kernel/printk/sysctl.c:20:51: expected void *
kernel/printk/sysctl.c:20:51: got void [noderef] __user *buffer
Fixes: faaa357a55 ("printk: move printk sysctl to printk/sysctl.c")
Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net [1]
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 774a1221e8.
We need to finish all async code before the module init sequence is
done. In the reverted commit the PF_USED_ASYNC flag was added to mark a
thread that called async_schedule(). Then the PF_USED_ASYNC flag was
used to determine whether or not async_synchronize_full() needs to be
invoked. This works when modprobe thread is calling async_schedule(),
but it does not work if module dispatches init code to a worker thread
which then calls async_schedule().
For example, PCI driver probing is invoked from a worker thread based on
a node where device is attached:
if (cpu < nr_cpu_ids)
error = work_on_cpu(cpu, local_pci_probe, &ddi);
else
error = local_pci_probe(&ddi);
We end up in a situation where a worker thread gets the PF_USED_ASYNC
flag set instead of the modprobe thread. As a result,
async_synchronize_full() is not invoked and modprobe completes without
waiting for the async code to finish.
The issue was discovered while loading the pm80xx driver:
(scsi_mod.scan=async)
modprobe pm80xx worker
...
do_init_module()
...
pci_call_probe()
work_on_cpu(local_pci_probe)
local_pci_probe()
pm8001_pci_probe()
scsi_scan_host()
async_schedule()
worker->flags |= PF_USED_ASYNC;
...
< return from worker >
...
if (current->flags & PF_USED_ASYNC) <--- false
async_synchronize_full();
Commit 21c3c5d280 ("block: don't request module during elevator init")
fixed the deadlock issue which the reverted commit 774a1221e8
("module, async: async_synchronize_full() on module init iff async is
used") tried to fix.
Since commit 0fdff3ec6d ("async, kmod: warn on synchronous
request_module() from async workers") synchronous module loading from
async is not allowed.
Given that the original deadlock issue is fixed and it is no longer
allowed to call synchronous request_module() from async we can remove
PF_USED_ASYNC flag to make module init consistently invoke
async_synchronize_full() unless async module probe is requested.
Signed-off-by: Igor Pylypiv <ipylypiv@google.com>
Reviewed-by: Changyuan Lyu <changyuanl@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup fixes from Tejun Heo:
- Eric's fix for a long standing cgroup1 permission issue where it only
checks for uid 0 instead of CAP which inadvertently allows
unprivileged userns roots to modify release_agent userhelper
- Fixes for the fallout from Waiman's recent cpuset work
* 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
cgroup-v1: Require capabilities to set release_agent
cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
It was found that a "suspicious RCU usage" lockdep warning was issued
with the rcu_read_lock() call in update_sibling_cpumasks(). It is
because the update_cpumasks_hier() function may sleep. So we have
to release the RCU lock, call update_cpumasks_hier() and reacquire
it afterward.
Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
instead of stating that in the comment.
Fixes: 4716909cc5 ("cpuset: Track cpusets that use parent's effective_cpus")
Signed-off-by: Waiman Long <longman@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
After commit 2fd3fb0be1d1 ("kasan, vmalloc: unpoison VM_ALLOC pages
after mapping"), non-VM_ALLOC mappings will be marked as accessible
in __get_vm_area_node() when KASAN is enabled. But now the flag for
ringbuf area is VM_ALLOC, so KASAN will complain out-of-bound access
after vmap() returns. Because the ringbuf area is created by mapping
allocated pages, so use VM_MAP instead.
After the change, info in /proc/vmallocinfo also changes from
[start]-[end] 24576 ringbuf_map_alloc+0x171/0x290 vmalloc user
to
[start]-[end] 24576 ringbuf_map_alloc+0x171/0x290 vmap user
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: syzbot+5ad567a418794b9b5983@syzkaller.appspotmail.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220202060158.6260-1-houtao1@huawei.com
Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes
the NULL check instead of open coding that check everywhere.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_needs_flush_plug fails to account for the cb_list, which needs
flushing as well. Remove it and just check if there is a plug instead
of poking into the internals of the plug structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment. NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no good reason to keep genhd.h separate from the main blkdev.h
header that includes it. So fold the contents of genhd.h into blkdev.h
and remove genhd.h entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220124093913.742411-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The so-called 'kernel' address filter can also be useful for filtering
fixed addresses in user space. Allow that.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-6-adrian.hunter@intel.com
Reset appropriate variables in the parser loop between parsing separate
filters, so that they do not interfere with parsing the next filter.
Fixes: 375637bc52 ("perf/core: Introduce address range filtering")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220131072453.2839535-4-adrian.hunter@intel.com
The intent has always been that perf_event_attr::sig_data should also be
modifiable along with PERF_EVENT_IOC_MODIFY_ATTRIBUTES, because it is
observable by user space if SIGTRAP on events is requested.
Currently only PERF_TYPE_BREAKPOINT is modifiable, and explicitly copies
relevant breakpoint-related attributes in hw_breakpoint_copy_attr().
This misses copying perf_event_attr::sig_data.
Since sig_data is not specific to PERF_TYPE_BREAKPOINT, introduce a
helper to copy generic event-type-independent attributes on
modification.
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20220131103407.1971678-1-elver@google.com
move autogroup sysctls to autogroup.c and use the new
register_sysctl_init() to register the sysctl interface.
Signed-off-by: Zhen Ni <nizhen@uniontech.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220128095025.8745-1-nizhen@uniontech.com
The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
entirely wrong on 32-bit little endian: a preprocessor logic mistake
wrongly uses the big endian field layout on 32-bit little endian
architectures.
Fortunately, those ptr32 accessors were never used within the kernel,
and only meant as a convenience for user-space.
Remove those and replace the whole rseq_cs union by a __u64 type, as
this is the only thing really needed to express the ABI. Document how
32-bit architectures are meant to interact with this field.
Fixes: ec9c82e03a ("rseq: uapi: Declare rseq_cs field as union, update includes")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220127152720.25898-1-mathieu.desnoyers@efficios.com
A watchdog maximum skew of 100us may still be too small for
some systems or archs. It may also be too small when some kernel
debug config options are enabled. So add a new Kconfig option
CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more
control on the threshold for marking clocksource as unstable.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit allows up to 50,000 callbacks worth of callback-flooding
tests of SRCU. The goal of this change is to exercise Tree SRCU's
ability to transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG triggered
by callback-queue-time lock contention.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, _torture_create_kthread() uses kthread_run() to create
torture-test kthreads, which means that the resulting task_struct
pointer is stored after the newly created kthread has been marked
runnable. This in turn can cause spurious failure of checks for
code being run by a particular kthread. This commit therefore changes
_torture_create_kthread() to use kthread_create(), then to do an explicit
wake_up_process() after the task_struct pointer has been stored.
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcu_torture_fwd_cb_hist() function acquires rcu_fwd_mutex, but is
invoked from rcutorture_oom_notify() function, which hold this same
mutex across this call. This commit fixes the resulting deadlock.
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The second and subsequent forward-progress kthreads loop waiting for
the first forward-progress kthread to start the next test interval.
Unfortunately, if the test ends while one of those kthreads is waiting,
the test will hang. This hang occurs because that wait loop fails to
check for the end of the test. This commit therefore adds an end-of-test
check to that wait loop.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Back when only one rcutorture kthread could do forward-progress testing,
it was just fine for rcu_fwd_cb_nodelay to be a non-atomic bool. It was
set at the start of forward-progress testing and cleared at the end.
But now that there are multiple threads, the value can be cleared while
one of the threads is still doing forward-progress testing. This commit
therefore makes rcu_fwd_cb_nodelay be an atomic counter, replacing the
WRITE_ONCE() operations with atomic_inc() and atomic_dec().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a few pr_alert() calls to rcutorture's forward-progress
testing in order to better diagnose shutdown-time hangs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Right now, if a given kthread (call it "kthread") realizes that it needs
to stop, "Stopping kthread" is written to the console. When the cleanup
code decides that it is time to stop that kthread, "Stopping kthread
tasks" is written to the console. These two events might happen in
either order, especially in the case of time-based torture-test shutdown.
But it is hard to distinguish these, especially for those unfamiliar with
the torture tests. This commit therefore changes the first case from
"Stopping kthread" to "kthread is stopping" to make things more clear.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The various ->cb_barrier() functions, for example, rcu_barrier(),
sometimes cause rcutorture hangs. But currently, the last console
message is the unenlightening "Stopping rcu_torture_stats". This commit
therefore prints a message of the form "rcu_torture_cleanup: Invoking
rcu_barrier+0x0/0x1e0()" to help point people in the right direction.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When the rcutree.use_softirq kernel boot parameter is set to zero, all
RCU_SOFTIRQ processing is carried out by the per-CPU rcuc kthreads.
If these kthreads are being starved, quiescent states will not be
reported, which in turn means that the grace period will not end, which
can in turn trigger RCU CPU stall warnings. This commit therefore dumps
stack traces of stalled CPUs' rcuc kthreads, which can help identify
what is preventing those kthreads from running.
Suggested-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently rcu_preempt_deferred_qs_irqrestore() releases rnp->boost_mtx
before reporting the expedited quiescent state. Under heavy real-time
load, this can result in this function being preempted before the
quiescent state is reported, which can in turn prevent the expedited grace
period from completing. Tim Murray reports that the resulting expedited
grace periods can take hundreds of milliseconds and even more than one
second, when they should normally complete in less than a millisecond.
This was fine given that there were no particular response-time
constraints for synchronize_rcu_expedited(), as it was designed
for throughput rather than latency. However, some users now need
sub-100-millisecond response-time constratints.
This patch therefore follows Neeraj's suggestion (seconded by Tim and
by Uladzislau Rezki) of simply reversing the two operations.
Reported-by: Tim Murray <timmurray@google.com>
Reported-by: Joel Fernandes <joelaf@google.com>
Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Tim Murray <timmurray@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: <stable@vger.kernel.org> # 5.4.x
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When CONFIG_PREEMPT_RT=y, the rcutree.kthread_prio command-line
parameter signals initialization code to boost the priority of rcuc
callbacks to the designated value. With the additional
CONFIG_RCU_NOCB_CPU=y configuration and an additional rcu_nocbs
command-line parameter, the callbacks on the listed cores are
offloaded to new rcuop kthreads that are not pinned to the cores whose
post-grace-period work is performed. While the rcuop kthreads perform
the same function as the rcuc kthreads they offload, the kthread_prio
parameter only boosts the priority of the rcuc kthreads. Fix this
inconsistency by elevating rcuop kthreads to the same priority as the rcuc
kthreads.
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The priority of RCU grace period threads is set to kthread_prio when
they are launched from rcu_spawn_gp_kthread(). The same is not true
of rcu_spawn_one_nocb_kthread(). Accordingly, add priority elevation
to rcu_spawn_one_nocb_kthread().
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Move the bounds-check of the kthread_prio cmdline parameter to a new
function in order to faciliate a different callsite.
Signed-off-by: Alison Chaiken <achaiken@aurora.tech>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The per-CPU "rcuc" kthreads are used only by kernels booted with
rcutree.use_softirq=0, but they are nevertheless unconditionally created
by kernels built with CONFIG_RCU_BOOST=y. This results in "rcuc"
kthreads being created that are never actually used. This commit
therefore refrains from creating these kthreads unless the kernel
is actually booted with rcutree.use_softirq=0.
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
When multiple CPUs in the same nocb gp/cb group concurrently
come online, they might try to concurrently create the same
rcuog kthread. Fix this by using nocb gp CPU's spawn mutex to
provide mutual exclusion for the rcuog kthread creation code.
[ paulmck: Whitespace fixes per kernel test robot feedback. ]
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The boost_starttime shared variable has conflicting unmarked C-language
accesses, which are dangerous at best. This commit therefore adds
appropriate marking. This was found by KCSAN.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a READ_ONCE() to an access to the rcu_node structure's
->expmask field to prevent compiler mischief. Detected by KCSAN.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
For PREEMPT_RCU, the rcu_exp_handler() function checks
whether the current CPU is in idle, by calling
rcu_dynticks_curr_cpu_in_eqs(). However, rcu_exp_handler()
is called in IPI handler context. So, it should be checking
the idle context using rcu_is_cpu_rrupt_from_idle(). Fix this
by using rcu_is_cpu_rrupt_from_idle() instead of
rcu_dynticks_curr_cpu_in_eqs(). Non-preempt configuration
already uses the correct check.
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmH4I8oUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXP9zw//ZAU42ylQXvGfLJbCsbZHKq7H/Ljo
QKbwlSKh+aP+IXcRxnh9fV3vTkc7LkHZfrhGx35aHlS1HK8XIa/XHHNTPHfaEKai
iDiXwICP5j1JysWCKJXu0uhe7juW3ko7+bQyI8MYBAeLKrbC1F04oswSgVNuX2bq
aMyD9+GdAv7PiVVw0oc5+hKFS/8q/VRdbCsJmESKUDhthS3hqH8wZJqy37aYIpF1
/yqkvEIts+GzeQrSWvsGL+O720GIyZ8V2/cEH2y+pnGYgoYdlXhD1CccXOOCocb8
M/6uQZTgQiEVf1LMuu/WIW6CzrRQIjOt4SNU0cXLkWlgxAN1p5b9sP+YcncWU23N
zbIACFiFVz1ZhxHT0AEVs+thdZrF6CJX0xfsb+GvJJeYy9aw11s7VGlYhaM+1haG
8oeYmtjQ+rjjkEKMUcOWQYxRvCZIsI6z5JlqoFC0zuJda1k3418LyDZARwCUrm1c
6QY35M0HHxa8k7TWtvJ6aopxM4pg+ZL8WS0shULHRqw/NprvYc6KeZc0/VNyFojJ
S4wa+Z6rXoYIvGReeDkUOiJjigW78/kyQR2rsxHWaGlDfL8+bALdnorkTbU2G4oG
Jl1HQdOSuAeAT/D7w/UmKFXDBPXHU77sLnjxiCDMmbmVY6Vxja/hZD/wy9RyyK4K
UAXDJNwhndHsgX4=
=sFi4
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20220131' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"A single audit patch to fix problems relating to audit queuing and
system responsiveness when "audit=1" is specified on the kernel
command line and the audit daemon is SIGSTOP'd for an extended period
of time"
* tag 'audit-pr-20220131' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: improve audit queue handling when "audit=1" on cmdline
The cgroup release_agent is called with call_usermodehelper. The function
call_usermodehelper starts the release_agent with a full set fo capabilities.
Therefore require capabilities when setting the release_agaent.
Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com>
Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com>
Fixes: 81a6a5cdd2 ("Task Control Groups: automatic userspace notification of idle cgroups")
Cc: stable@vger.kernel.org # v2.6.24+
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
padata_do_parallel() calls cpumask_weight() to check if any bit of a
given cpumask is set. We can do it more efficiently with cpumask_empty()
because cpumask_empty() stops traversing the cpumask as soon as it finds
first set bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
except the one it belongs to, to avoid list corruption
- Make sure parent events are always woken up to avoid indefinite hangs
in the traced workload
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmH2cukACgkQEsHwGGHe
VUrQZA//WGHM7aNDp9xNkEhSsH6tDZqYSzg8yxzQGrBx3hNuTraj1imHVsCXLeU1
yCS5hBr3LdnhsAL+IaNtgTfOyM7m9SLtvFWTLINY2j7L/DkljagIB3Ur+8No565U
a/gd1UbeW2btRYUidFmXokIXMxMcsHCawX1hf1FKm7tP2b+DvD9uGePyxHmwk9BB
o5HTc1ft6qPt7LFfTxrtXEg76Z8DFovusvaPgISmv5RFnvLLwMnY6CN/oMUP+Fr+
PyCxFatgscEDgiDqRWNYvDyKOBV9dDa72Pab/yFoREM7Pc88sLhy9ZSPaf8ZiJDz
fvqBVl9d93i2UExNSdedC4dq8l3s0ioHOd4soSdBTVkrUxYTPvx31WfNLUigNxws
q7vJulsOUCPApuRvybTBy8A12ksjlT8qNDC3RLYfEKNuNMeYzXiZnijJ98FdVirF
FFv+5pPCzz1dhcsu9rMyuNgFwaIik8ptnt4ffaTsy70mT5h1UmKZsLr3clurRX8v
XpzAVCcJKiDBbc4NdtweQqm2zoY9Uxwhez3+wxmlZXndOPdcYbACKQ65U0bl6OYw
Qir8WZMOzHnlJLItPuH+TMuoJ7TdFwD9Sndiih+sKK7PEfU3XvWIRATvyjCoPINA
FfqX59lLDVBpjapdODVCt7St9e7VKAxPuc9yZIBJOo7K0QRLquE=
=+ssI
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Prevent accesses to the per-CPU cgroup context list from another CPU
except the one it belongs to, to avoid list corruption
- Make sure parent events are always woken up to avoid indefinite hangs
in the traced workload
* tag 'perf_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix cgroup event list management
perf: Always wake the parent event
set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmH2cDYACgkQEsHwGGHe
VUo7YhAAs4trrOaq5KCr2YW7agAHqYjYsMfPLEBUOvJ4hGggMAYCwGXYDdzMo8UH
tAxLfh2fyy3VUNKPeLMqz6XVn697oTQnSAkVBDJIcMJlctZGvXmwNnv3g133xlML
n4xLhVAfC76HA0Q0zkPZ6gdN/VBrRl94n3mAWxB9FVFUDwFIMr6PkGrjw9Olml/r
Xm2WUuKfvwHqEhNUDu2rS0qqal3mPO6DV+4Y7JGyQL5fYAu3HfabZV4CBfYt5z3P
m3y4Y2yfpWcB4D6KfXHPQwLhWOoxVZ0X2YiAjfj8x/+dYaTIoBSg/EG+44nX6tBV
RcYK9gARwfCgSdcfWZFYYfHlGbNv1x/HHOVvkuJEzfCya4+FQTRdrqkoOWsPUPk1
jqa9Ybz9wQxJ2GI57wT7W+fyl2M+agyvwywrELLy6w6nwAKdWpbIGW4b1ummfIk2
MGqL99Hge6aX9ONFT6IA4rHd0fvWNC00l4evzNVfyHQnYN3f0ul8h8p950F09I25
apSalhyz40LbJdKsRFGt8CjIgM2Y+rP87JI8ZawjauWFIp+lLiXcLOuLXTNJjAe6
Sw9EWkkr3sTHzOxudrFHz/QwM+m7KkoYnuGQw0gDqzXdqc1Gpc91llLqqWNuyX++
y3NWzEdmnjlV+H1Fls94UhrdsHAiI8d+OTH3fQVY6VEtGcLQJ2c=
=kI2Z
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"Make sure the membarrier-rseq fence commands are part of the reported
set when querying membarrier(2) commands through MEMBARRIER_CMD_QUERY"
* tag 'sched_urgent_for_v5.17_rc2_p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/membarrier: Fix membarrier-rseq fence command missing from query bitmask
When CONFIG_PROC_FS is disabled psi code generates the following
warnings:
kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
1364 | static const struct proc_ops psi_cpu_proc_ops = {
| ^~~~~~~~~~~~~~~~
kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
1355 | static const struct proc_ops psi_memory_proc_ops = {
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
1346 | static const struct proc_ops psi_io_proc_ops = {
| ^~~~~~~~~~~~~~~
Make definitions of these structures and related functions conditional
on CONFIG_PROC_FS config.
Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Make the buffer handling in pm_show_wakelocks() more robust by
using sysfs_emit_at() in it to generate output (Greg Kroah-Hartman).
- Drop register_nosave_region_late() which is not used (Amadeusz
Sławiński).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmHz6CASHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxwEMP/2WnqXjyNvhpw8x0qqz7qHU+2SJuqZBT
FRyY574tGQlBqsatfJAvWVtC8hFOHAKClKNinxemwXNzHHUv5era4WNTu0xvdykf
w6pCgHfHGd+eHB9hWD4Q4lXL3ohv16A3859ldprChN0vygmnoPOBXcvJMO277dNn
b1IobLhr3HMNGK3d4eZY/+xN2SP/dOyJmzK+R8reUGwzVa52PNPii0jxF0N2FyWg
hKIUKNnryQ56TMJyNbpJM47zOe7IgN7YGjSuI1Vx8P5HfSYCmQ+EVmv7qhN6THX/
s1ONWUwZxn/8cgGX/5XCK9b4cAwHZZ2peC/AWueL4qFmXgHHVNRC2rm5Ywn2eVFd
pyyNsZGWLXxpS02g69CR32fiUn3Hh/YKFRDOtg/nibTJ1Gu2xFeGo8xBYqDp26Zl
z2squ2zWBLxnT6cg2ie9mjy7KTcIcCecWiEv20n3b3qQ+kd2e0AKvV+LN8IeisFI
q1cHnAwzgG9oMayDMDl2LdzpH9UvxIGJyT8ucSFoShK2TlV/ZPGz0HkZHRLOvFPc
VAhAP+E9dNr2bd11Eme9Ts02PvKeIJ7tCQ4xa/PfPwQXsg030VldQC+YXKO3CPga
70EsSyk+ryFRG7nhTVqFVSdcnKKiUFHrxWtC7a/ePvPVCjvvkEeqnqam0dRisLhg
HFDWrKFJ78i+
=P7NO
-----END PGP SIGNATURE-----
Merge tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These make the buffer handling in pm_show_wakelocks() more robust and
drop an unused hibernation-related function.
Specifics:
- Make the buffer handling in pm_show_wakelocks() more robust by
using sysfs_emit_at() in it to generate output (Greg
Kroah-Hartman).
- Drop register_nosave_region_late() which is not used (Amadeusz
Sławiński)"
* tag 'pm-5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Remove register_nosave_region_late()
PM: wakeup: simplify the output logic of pm_show_wakelocks()
- Limit mcount build time sorting to only those archs that
we know it works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYfQlIhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qlXXAP9orNMD5Rkj4/2kTULIXhdx/O6l7d6f
Qq/Hy09evN1h+wEAlMIEE2Yr6tyIbO3uaoW6D8RbwG3napr/w4aUkzxGtgM=
=3MmM
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pulltracing fixes from Steven Rostedt:
- Limit mcount build time sorting to only those archs that we know it
works for.
- Fix memory leak in error path of histogram setup
- Fix and clean up rel_loc array out of bounds issue
- tools/rtla documentation fixes
- Fix issues with histogram logic
* tag 'trace-v5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't inc err_log entry count if entry allocation fails
tracing: Propagate is_signed to expression
tracing: Fix smatch warning for do while check in event_hist_trigger_parse()
tracing: Fix smatch warning for null glob in event_hist_trigger_parse()
tools/tracing: Update Makefile to build rtla
rtla: Make doc build optional
tracing/perf: Avoid -Warray-bounds warning for __rel_loc macro
tracing: Avoid -Warray-bounds warning for __rel_loc macro
tracing/histogram: Fix a potential memory leak for kstrdup()
ftrace: Have architectures opt-in for mcount build time sorting
Pull ucount rlimit fix from Eric Biederman.
Make sure the ucounts have a reference to the user namespace it refers
to, so that users that themselves don't carry such a reference around
can safely use the ucount functions.
* 'ucount-rlimit-fixes-for-v5.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucount: Make get_ucount a safe get_user replacement
This pull request fixes a math error added in 7a30871b6a ("rcu-tasks:
Introduce ->percpu_enqueue_shift for dynamic queue selection') during the
v5.17 merge window. This commit works correctly only on systems with a
power-of-two number of CPUs, which just so happens to be the kind that
rcutorture always uses by default.
This pull request fixes the math so that things also work on systems
that don't happen to have a power-of-two number of CPUs.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmHx1pUTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jCqkD/4unL2PTYq1lnIs3Kx9ib6/YO1uwwlm
x8p/JJXjPgr2lSDowsIWAPm1Clbvo3JjSm+aWaWIYLpwk2BomAslY860I12MqPD9
Q57Wj7A/wWcnSD5Nco/FERq2JNQF/WXws/IwNSQmphnug1LPIioHD9ubweM1d4Wn
VD3e4BzVpCxaepidBQnHlb13HyTfQ/UCazmf+/4DX5yGe+FMmHB0n3d+O59CsHGf
cj7ssQ8PrNA3S+dekqpZy8EGaNEauEQmd/duUziqoD4nlNCdRkj51kKn/uflXN9e
mwZ/LOjKuEnepnxF+6+3BsxR3GGqNX9z5n0PvC7eErgJ+REUdG7HxaLpBvd7wZ98
PNnUg+wws0t8ALsnAzS03vh9beSAb9ttvyWoRCzdKW9gG6LG9MHhpisC+koZZEDc
uBF4CeX2xS/DJ1EuVb2gFbkEP9cpITlf3RCqk5w55XnEm1UpTvdH9gaDe4+WYUD6
FD2L6PkT9Ns7aInGLJkKocj1NpJ9T7a3fxvi2OGkvQ77Yj1/3Du38k0I5N4JPxyO
CB3BwOks9tf2+yOCscjvWn+cEvu2LFqU91boXs9DkUBpGJrw09vAa0z6Pwt3H8kr
ZTn/kLhZoOfIVG5ybpETCR1DzVm2K2sFjCRTE2Fdwu0S6PXU70fwpZ+Lb2TivRdz
nyAfhPOrm6enGg==
=wRnw
-----END PGP SIGNATURE-----
Merge tag 'rcu-urgent.2022.01.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This fixes a brown-paper-bag bug in RCU tasks that causes things like
BPF and ftrace to fail miserably on systems with non-power-of-two
numbers of CPUs.
It fixes a math error added in 7a30871b6a ("rcu-tasks: Introduce
->percpu_enqueue_shift for dynamic queue selection') during the v5.17
merge window. This commit works correctly only on systems with a
power-of-two number of CPUs, which just so happens to be the kind that
rcutorture always uses by default.
This pull request fixes the math so that things also work on systems
that don't happen to have a power-of-two number of CPUs"
* tag 'rcu-urgent.2022.01.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu-tasks: Fix computation of CPU-to-list shift counts
The patch ec5ce09875: "tracing: Allow whitespace to surround hist
trigger filter" from Jan 15, 2018, leads to the following Smatch
static checker warning:
kernel/trace/trace_events_hist.c:6199 event_hist_trigger_parse()
warn: 'p' can't be NULL.
Since p is always checked for a NULL value at the top of loop and
nothing in the rest of the loop will set it to NULL, the warning
is correct and might as well be 1 to silence the warning.
Link: https://lkml.kernel.org/r/a1d4c79766c0cf61e20438dc35244d216633fef6.1643319703.git.zanussi@kernel.org
Fixes: ec5ce09875 ("tracing: Allow whitespace to surround hist trigger filter")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The recent rename of event_hist_trigger_parse() caused smatch
re-evaluation of trace_events_hist.c and as a result an old warning
was found:
kernel/trace/trace_events_hist.c:6174 event_hist_trigger_parse()
error: we previously assumed 'glob' could be null (see line 6166)
glob should never be null (and apparently smatch can also figure that
out and skip the warning when using the cross-function DB (but which
can't be used with a 0day build as it takes too much time to
generate)).
Nonetheless for clarity, remove the test but add a WARN_ON() in case
the code ever changes.
Link: https://lkml.kernel.org/r/96925e5c1f116654ada7ea0613d930b1266b5e1c.1643319703.git.zanussi@kernel.org
Fixes: f404da6e1d ("tracing: Add 'last error' error facility for hist triggers")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
kfree() is missing on an error path to free the memory allocated by
kstrdup():
p = param = kstrdup(data->params[i], GFP_KERNEL);
So it is better to free it via kfree(p).
Link: https://lkml.kernel.org/r/tencent_C52895FD37802832A3E5B272D05008866F0A@qq.com
Cc: stable@vger.kernel.org
Fixes: d380dcde9a ("tracing: Fix now invalid var_ref_vals assumption in trace action")
Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
First S390 complained that the sorting of the mcount sections at build
time caused the kernel to crash on their architecture. Now PowerPC is
complaining about it too. And also ARM64 appears to be having issues.
It may be necessary to also update the relocation table for the values
in the mcount table. Not only do we have to sort the table, but also
update the relocations that may be applied to the items in the table.
If the system is not relocatable, then it is fine to sort, but if it is,
some architectures may have issues (although x86 does not as it shifts all
addresses the same).
Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
safe to do the sorting at build time.
Also update the config to compile in build time sorting in the sorttable
code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.
Link: https://lore.kernel.org/all/944D10DA-8200-4BA9-8D0A-3BED9AA99F82@linux.ibm.com/
Link: https://lkml.kernel.org/r/20220127153821.3bc1ac6e@gandalf.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yinan Liu <yinan@linux.alibaba.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Reported-by: Sachin Sant <sachinp@linux.ibm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Fixes: 72b3942a17 ("scripts: ftrace - move the sort-processing in ftrace_init")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When CONFIG_PROC_FS is disabled psi code generates the following warnings:
kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
1364 | static const struct proc_ops psi_cpu_proc_ops = {
| ^~~~~~~~~~~~~~~~
kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
1355 | static const struct proc_ops psi_memory_proc_ops = {
| ^~~~~~~~~~~~~~~~~~~
kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
1346 | static const struct proc_ops psi_io_proc_ops = {
| ^~~~~~~~~~~~~~~
Make definitions of these structures and related functions conditional on
CONFIG_PROC_FS config.
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
iowait_boost signal is applied independently of util and doesn't take
into account uclamp settings of the rq. An io heavy task that is capped
by uclamp_max could still request higher frequency because
sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with()
like effective_cpu_util() does.
Make sure that iowait_boost honours uclamp requests by calling
uclamp_rq_util_with() when applying the boost.
Fixes: 982d9cdc22 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com
sugov_update_single_{freq, perf}() contains a 'busy' filter that ensures
we don't bring the frqeuency down if there's no idle time (CPU is busy).
The problem is that with uclamp_max we will have scenarios where a busy
task is capped to run at a lower frequency and this filter prevents
applying the capping when this task starts running.
We handle this by skipping the filter when uclamp is enabled and the rq
is being capped by uclamp_max.
We introduce a new function uclamp_rq_is_capped() to help detecting when
this capping is taking effect. Some code shuffling was required to allow
using cpu_util_{cfs, rt}() in this new function.
On 2 Core SMT2 Intel laptop I see:
Without this patch:
uclampset -M 0 sysbench --test=cpu --threads = 4 run
produces a score of ~3200 consistently. Which is the highest possible.
Compiling the kernel also results in frequency running at max 3.1GHz all
the time - running uclampset -M 400 to cap it has no effect without this
patch.
With this patch:
uclampset -M 0 sysbench --test=cpu --threads = 4 run
produces a score of ~1100 with some outliers in ~1700. Uclamp max
aggregates the performance requirements, so having high values sometimes
is expected if some other task happens to require that frequency starts
running at the same time.
When compiling the kernel with uclampset -M 400 I can see the
frequencies mostly in the ~2GHz region. Helpful to conserve power and
prevent heating when not plugged in.
Fixes: 982d9cdc22 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211216225320.2957053-2-qais.yousef@arm.com
We can't use this tracepoint in modules without having the symbol
exported first, fix that.
Fixes: 765047932f ("sched/pelt: Add support to track thermal pressure")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com
The child processes will inherit numa_pages_migrated and
total_numa_faults from the parent. It means even if there is no numa
fault happen on the child, the statistics in /proc/$pid of the child
process might show huge amount. This is a bit weird. Let's initialize
them when do fork.
Signed-off-by: Honglei Wang <wanghonglei@didichuxing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20220113133920.49900-1-wanghonglei@didichuxing.com
The older format of /proc/pid/sched printed home node info which
required the mempolicy and task lock around mpol_get(). However
the format has changed since then and there is no need for
sched_show_numa() any more to have mempolicy argument,
asssociated mpol_get/put and task_lock/unlock. Remove them.
Fixes: 397f2378f1 ("sched/numa: Fix numa balancing stats in /proc/pid/sched")
Signed-off-by: Bharata B Rao <bharata@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com
When the ucount code was refactored to create get_ucount it was missed
that some of the contexts in which a rlimit is kept elevated can be
the only reference to the user/ucount in the system.
Ordinary ucount references exist in places that also have a reference
to the user namspace, but in POSIX message queues, the SysV shm code,
and the SIGPENDING code there is no independent user namespace
reference.
Inspection of the the user_namespace show no instance of circular
references between struct ucounts and the user_namespace. So
hold a reference from struct ucount to i's user_namespace to
resolve this problem.
Link: https://lore.kernel.org/lkml/YZV7Z+yXbsx9p3JN@fixkernel.com/
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Fixes: d646969055 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
Fixes: 6e52a9f053 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
Fixes: d7c9e99aee ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The ->percpu_enqueue_shift field is used to map from the running CPU
number to the index of the corresponding callback list. This mapping
can change at runtime in response to varying callback load, resulting
in varying levels of contention on the callback-list locks.
Unfortunately, the initial value of this field is correct only if the
system happens to have a power-of-two number of CPUs, otherwise the
callbacks from the high-numbered CPUs can be placed into the callback list
indexed by 1 (rather than 0), and those index-1 callbacks will be ignored.
This can result in soft lockups and hangs.
This commit therefore corrects this mapping, adding one to this shift
count as needed for systems having odd numbers of CPUs.
Fixes: 7a30871b6a ("rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic queue selection")
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Reported-by: Martin Lau <kafai@fb.com>
Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
subparts_cpus should be limited as a subset of cpus_allowed, but it is
updated wrongly by using cpumask_andnot(). Use cpumask_and() instead to
fix it.
Fixes: ee8dde0cd2 ("cpuset: Add new v2 cpuset.sched.partition flag")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
prb_next_seq() always iterates from the first known sequence number.
In the worst case, it might loop 8k times for 256kB buffer,
15k times for 512kB buffer, and 64k times for 2MB buffer.
It was reported that polling and reading using syslog interface
might occupy 50% of CPU.
Speedup the search by storing @id of the last finalized descriptor.
The loop is still needed because the @id is stored and read in the best
effort way. An atomic variable is used to keep the @id consistent.
But the stores and reads are not serialized against each other.
The descriptor could get reused in the meantime. The related sequence
number will be used only when it is still valid.
An invalid value should be read _only_ when there is a flood of messages
and the ringbuffer is rapidly reused. The performance is the least
problem in this case.
Reported-by: Chunlei Wang <chunlei.wang@mediatek.com>
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1642770388-17327-1-git-send-email-quic_mojha@quicinc.com
Link: https://lore.kernel.org/lkml/YXlddJxLh77DKfIO@alley/T/#m43062e8b2a17f8dbc8c6ccdb8851fb0dbaabbb14
The active cgroup events are managed in the per-cpu cgrp_cpuctx_list.
This list is only accessed from current cpu and not protected by any
locks. But from the commit ef54c1a476 ("perf: Rework
perf_event_exit_event()"), it's possible to access (actually modify)
the list from another cpu.
In the perf_remove_from_context(), it can remove an event from the
context without an IPI when the context is not active. This is not
safe with cgroup events which can have some active events in the
context even if ctx->is_active is 0 at the moment. The target cpu
might be in the middle of list iteration at the same time.
If the event is enabled when it's about to be closed, it might call
perf_cgroup_event_disable() and list_del() with the cgrp_cpuctx_list
on a different cpu.
This resulted in a crash due to an invalid list pointer access during
the cgroup list traversal on the cpu which the event belongs to.
Let's fallback to IPI to access the cgrp_cpuctx_list from that cpu.
Similarly, perf_install_in_context() should use IPI for the cgroup
events too.
Fixes: ef54c1a476 ("perf: Rework perf_event_exit_event()")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220124195808.2252071-1-namhyung@kernel.org
When using per-process mode and event inheritance is set to true,
forked processes will create a new perf events via inherit_event() ->
perf_event_alloc(). But these events will not have ring buffers
assigned to them. Any call to wakeup will be dropped if it's called on
an event with no ring buffer assigned because that's the object that
holds the wakeup list.
If the child event is disabled due to a call to
perf_aux_output_begin() or perf_aux_output_end(), the wakeup is
dropped leaving userspace hanging forever on the poll.
Normally the event is explicitly re-enabled by userspace after it
wakes up to read the aux data, but in this case it does not get woken
up so the event remains disabled.
This can be reproduced when using Arm SPE and 'stress' which forks once
before running the workload. By looking at the list of aux buffers read,
it's apparent that they stop after the fork:
perf record -e arm_spe// -vvv -- stress -c 1
With this patch applied they continue to be printed. This behaviour
doesn't happen when using systemwide or per-cpu mode.
Reported-by: Ruben Ayrapetyan <Ruben.Ayrapetyan@arm.com>
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211206113840.130802-2-james.clark@arm.com
It seems inc_misses_counter() suffers from same issue fixed in
the commit d979617aa8 ("bpf: Fixes possible race in update_prog_stats()
for 32bit arches"):
As it can run while interrupts are enabled, it could
be re-entered and the u64_stats syncp could be mangled.
Fixes: 9ed9e9ba23 ("bpf: Count the number of times recursion was prevented")
Signed-off-by: He Fengqing <hefengqing@huawei.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20220122102936.1219518-1-hefengqing@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
It was found that reading /proc/lockdep after a lockdep splat may
potentially cause an access to freed memory if lockdep_unregister_key()
is called after the splat but before access to /proc/lockdep [1]. This
is due to the fact that graph_lock() call in lockdep_unregister_key()
fails after the clearing of debug_locks by the splat process.
After lockdep_unregister_key() is called, the lock_name may be freed
but the corresponding lock_class structure still have a reference to
it. That invalid memory pointer will then be accessed when /proc/lockdep
is read by a user and a use-after-free (UAF) error will be reported if
KASAN is enabled.
To fix this problem, lockdep_unregister_key() is now modified to always
search for a matching key irrespective of the debug_locks state and
zap the corresponding lock class if a matching one is found.
[1] https://lore.kernel.org/lkml/77f05c15-81b6-bddd-9650-80d5f23fe330@i-love.sakura.ne.jp/
Fixes: 8b39adbee8 ("locking/lockdep: Make lockdep_unregister_key() honor 'debug_locks' again")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lkml.kernel.org/r/20220103023558.1377055-1-longman@redhat.com
The membarrier command MEMBARRIER_CMD_QUERY allows querying the
available membarrier commands. When the membarrier-rseq fence commands
were added, a new MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK was
introduced with the intent to expose them with the MEMBARRIER_CMD_QUERY
command, the but it was never added to MEMBARRIER_CMD_BITMASK.
The membarrier-rseq fence commands are therefore not wired up with the
query command.
Rename MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK to
MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK (the bitmask is not a command
per-se), and change the erroneous
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ_BITMASK (which does not
actually exist) to MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ.
Wire up MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK in
MEMBARRIER_CMD_BITMASK. Fixing this allows discovering availability of
the membarrier-rseq fence feature.
Fixes: 2a36ab717e ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # 5.10+
Link: https://lkml.kernel.org/r/20220117203010.30129-1-mathieu.desnoyers@efficios.com
When an admin enables audit at early boot via the "audit=1" kernel
command line the audit queue behavior is slightly different; the
audit subsystem goes to greater lengths to avoid dropping records,
which unfortunately can result in problems when the audit daemon is
forcibly stopped for an extended period of time.
This patch makes a number of changes designed to improve the audit
queuing behavior so that leaving the audit daemon in a stopped state
for an extended period does not cause a significant impact to the
system.
- kauditd_send_queue() is now limited to looping through the
passed queue only once per call. This not only prevents the
function from looping indefinitely when records are returned
to the current queue, it also allows any recovery handling in
kauditd_thread() to take place when kauditd_send_queue()
returns.
- Transient netlink send errors seen as -EAGAIN now cause the
record to be returned to the retry queue instead of going to
the hold queue. The intention of the hold queue is to store,
perhaps for an extended period of time, the events which led
up to the audit daemon going offline. The retry queue remains
a temporary queue intended to protect against transient issues
between the kernel and the audit daemon.
- The retry queue is now limited by the audit_backlog_limit
setting, the same as the other queues. This allows admins
to bound the size of all of the audit queues on the system.
- kauditd_rehold_skb() now returns records to the end of the
hold queue to ensure ordering is preserved in the face of
recent changes to kauditd_send_queue().
Cc: stable@vger.kernel.org
Fixes: 5b52330bbf ("audit: fix auditd/kernel connection state tracking")
Fixes: f4b3ee3c85 ("audit: improve robustness of the audit queue handling")
Reported-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Tested-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
It is an unused wrapper forcing kmalloc allocation for registering
nosave regions. Also, rename __register_nosave_region() to
register_nosave_region() now that there is no need for disambiguation.
Signed-off-by: Amadeusz Sławiński <amadeuszx.slawinski@linux.intel.com>
Reviewed-by: Cezary Rojewski <cezary.rojewski@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The buffer handling in pm_show_wakelocks() is tricky, and hopefully
correct. Ensure it really is correct by using sysfs_emit_at() which
handles all of the tricky string handling logic in a PAGE_SIZE buffer
for us automatically as this is a sysfs file being read from.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
- A series of bpf fixes, including an oops fix and some codegen fixes.
- Fix a regression in syscall_get_arch() for compat processes.
- Fix boot failure on some 32-bit systems with KASAN enabled.
- A couple of other build/minor fixes.
Thanks to: Athira Rajeev, Christophe Leroy, Dmitry V. Levin, Jiri Olsa, Johan Almbladh,
Maxime Bizon, Naveen N. Rao, Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmHtOJgTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLMeD/9eBHqMy+yhXfTw9mjAf8rVlsdY6Xjq
2LYgHCJpeP0VjxBm8GcyFtAfJc778LyykU18xsi0WVsuwPmVdVwDfZpzxWJrYqeq
6AuK9uRZAbEx6gZVCsWlk3Em7GyUEqHm5PTxKsDfbzh+hBTNQ8CcKKbCVmEZ+loA
nWdaWJOZMpE4hQOx0Ph5JRAos/ZUEt2I57qZkRgvIEV+hroFrmdP1rnncdalnLHN
pXsEn380Q+rpH+48HIpaTpl5+R8CvgFnnlI6/luZC6HoA8K6jph8FdOkkI9eU1v6
TgdNMiNVIe1wRFzVp+nH7SGqL3l9X/UPS57hL52X+ccpHoioBW9CLPha06g/+UKT
kJU3mYZg3jvQGrM9H/2R3P97UH9SwHE5SH0xcXj3ItH1NpOKpMO0h5UuWx+Ze8Wo
zO80xbYLbALtEfx8JMCf95rnUVFL0n55xVJN+7+fs2hU3Be/ZpIyHfDujlHlbd1m
o3KRhkg1a0RyGsun6Gd2vwLGaBbwYNWvgHM9IDeiAFdJ7HFpp1YrVGpQRkAIUbJS
bzYiRczjbSof7c9WqLyKNatgNFO/JPI5vi1iNQ3u9Oh9fzjpv5Zok4vK534mOnqU
QnhiqLJbY+yBpoAvSePGQTCCzDAnJq0r53n0d1O/2T1GcpX9MKM4YlL+Wu4m0yLn
RcbOCTkmoqWZEw==
=KjWT
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- A series of bpf fixes, including an oops fix and some codegen fixes.
- Fix a regression in syscall_get_arch() for compat processes.
- Fix boot failure on some 32-bit systems with KASAN enabled.
- A couple of other build/minor fixes.
Thanks to Athira Rajeev, Christophe Leroy, Dmitry V. Levin, Jiri Olsa,
Johan Almbladh, Maxime Bizon, Naveen N. Rao, and Nicholas Piggin.
* tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Mask SRR0 before checking against the masked NIP
powerpc/perf: Only define power_pmu_wants_prompt_pmi() for CONFIG_PPC64
powerpc/32s: Fix kasan_init_region() for KASAN
powerpc/time: Fix build failure due to do_hard_irq_enable() on PPC32
powerpc/audit: Fix syscall_get_arch()
powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06
tools/bpf: Rename 'struct event' to avoid naming conflict
powerpc/bpf: Update ldimm64 instructions during extra pass
powerpc32/bpf: Fix codegen for bpf-to-bpf calls
bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack()
propagation in the sched hierarchies and other minor cleanups and
improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHtNkcACgkQEsHwGGHe
VUru2xAAq2sJYOjb3AFQQskKDMjUqY42+Z2LnFk+zbv/2NfXPG17lGRNl8zIFWgK
en+RguHOnBDo4Lc4qcx06k02gmZmSA7YonLJVYtT/N1mwsW6zkW0wDho/W3+ssU5
5fJEFSd/y9XmoFOyFj7k+POND/Prk/sguxYcYDRMwjdw4pZoDZ4WgPU3oS3PCiBk
ISua8zqxNC+kqSnlKzDbc23K22mdcsneW/aLFK7npyaKqzypy9IvqaBL6h8tyOgb
Q7jOBavUQwmfi/J5A39JgUrYs90gMuQKMJ0wxWrix+YCgvdRLCX3gcWBvdxHwlmm
KkxmWmM3iGO4qKXUDmmTt8e8GO1c0HgR7tBiVKkG2977fIojLGXTXwZKjIz/gn7f
wg3oltKWj2JZ7X3Z3Te4TDjtWSfibUkUHhrVlm94HgZL9ZiFFY+qigBTUoa/QVAf
q1nkk/acpSDAKY2CGcjeQZtkuIcfz+5Z94n07NsV4O8OriwkEOgVWGGXkky3687C
/woT4a3iIeqiFzSQ8raJq0bdMj3J+wpDe4gmjKmx7oPjiS7FzsyGc8HckwQtiOQ3
kGTTB+9zJS9ChWEk2ViQQgNOUUaJJjAwsBoYkRQakFnQ4AhvQKHmD+MS02vSPBD7
j3k3RPkO0Gm+gUBnkgyKSRTQpAcoVY0lBwttJoEr0IlA/MUWMJ0=
=4m7x
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
"A bunch of fixes: forced idle time accounting, utilization values
propagation in the sched hierarchies and other minor cleanups and
improvements"
* tag 'sched_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kernel/sched: Remove dl_boosted flag comment
sched: Avoid double preemption in __cond_resched_*lock*()
sched/fair: Fix all kernel-doc warnings
sched/core: Accounting forceidle time for all tasks except idle task
sched/pelt: Relax the sync of load_sum with load_avg
sched/pelt: Relax the sync of runnable_sum with runnable_avg
sched/pelt: Continue to relax the sync of util_sum with util_avg
sched/pelt: Relax the sync of util_sum with util_avg
psi: Fix uaf issue when psi trigger is destroyed while being polled
- Add new LBR format v7 support which is v5 modulo TSX
- Fix counter enumeration on Alder Lake hybrids
- Overhaul how context time updates are done and get rid of
perf_event::shadow_ctx_time.
- The usual amount of fixes: event mask correction, supported event
types reporting, etc.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHtMN0ACgkQEsHwGGHe
VUqbiA//We8Vop4/Nf2mi6Vje1YRb3a6419U54v0kH7buS1h4VjthrM8rTHmUQdr
wAxjnOVRFW8MTfxiMWN50xJ4mStUYP+PGm2MDYd7d9bB7d69JaXrf/9BIvKej8KJ
xbQXIv1MxhOlReiv0cPmek8IQf+/tFJvrvJNX7YPhTt4mtG8oDU6Cec2DgKqpg1Y
nEru1v4Hom2jQRNI1YNlsXOjxEkA0zCFfp+4iJ2opupqTguFnIHYLeK+0ymTbih7
BK2QrKUFMCxz23gLRjJoyykCWoEVDiUCQ6Tmb+DjuDIJAoT1vvujX95DTnh4c9DV
djWumE8fobFM32c9FnDFq3h6AsI2RlrNFLBWtLKBbr2L7j/sOeBDTdvDnKNPV/UT
DKjoH57tgj3Ool41bpTe4mAaLICndOhIMIpBUGPTouVxWM1RIIuzcvaTZIoM6uB5
DpHww79ENcAvQCgqvRCfVhCLv8jxRuRrgoPKY4iQFwq1xWMEy54Oy/4NhCJ/gZQB
yC63eWmTFUVFe0fzU78EgZoEaYdpw0A0aSUN2aU35zM4AWASg7kZd4APivBJDqSk
Ec1N+9j5F2FNVDQuw9BqcZQo6TQ0pQnblWW1wV5xi7oAxPtNqm8dqd9jd75dNxE8
yTDWwwNVH12P3TzqLpzJXyjq8A08Z9m72Zz+l9zJGaUAFJj/L9s=
=7fx9
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Add support for accessing the general purpose counters on Alder Lake
via MMIO
- Add new LBR format v7 support which is v5 modulo TSX
- Fix counter enumeration on Alder Lake hybrids
- Overhaul how context time updates are done and get rid of
perf_event::shadow_ctx_time.
- The usual amount of fixes: event mask correction, supported event
types reporting, etc.
* tag 'perf_urgent_for_v5.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/perf: Avoid warning for Arch LBR without XSAVE
perf/x86/intel/uncore: Add IMC uncore support for ADL
perf/x86/intel/lbr: Add static_branch for LBR INFO flags
perf/x86/intel/lbr: Support LBR format V7
perf/x86/rapl: fix AMD event handling
perf/x86/intel/uncore: Fix CAS_COUNT_WRITE issue for ICX
perf/x86/intel: Add a quirk for the calculation of the number of counters on Alder Lake
perf: Fix perf_event_read_local() time
The latest merge of the tracing tree sorts the mcount table at build time.
But s390 appears to do things differently (like always) and replaces the
sorted table back to the original unsorted one. As the ftrace algorithm
depends on it being sorted, bad things happen when it is not, and s390
experienced those bad things.
Add a new config to tell the boot if the mcount table is sorted or not,
and allow s390 to opt out of it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYezl+RQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qs5wAQCO2K1uB+XI4rown8zKLrGxOiFIXChi
Ds5GKJwYlcOnwQEAqKK3mAodmMy6ydbeY1OY1IDdTkOCAd752FDIe1Qozws=
=aF8W
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Fix s390 breakage from sorting mcount tables.
The latest merge of the tracing tree sorts the mcount table at build
time. But s390 appears to do things differently (like always) and
replaces the sorted table back to the original unsorted one. As the
ftrace algorithm depends on it being sorted, bad things happen when it
is not, and s390 experienced those bad things.
Add a new config to tell the boot if the mcount table is sorted or
not, and allow s390 to opt out of it"
* tag 'trace-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix assuming build time sort works for s390
To speed up the boot process, as mcount_loc needs to be sorted for ftrace
to work properly, sorting it at build time is more efficient than boot up
and can save milliseconds of time. Unfortunately, this change broke s390
as it will modify the mcount_loc location after the sorting takes place
and will put back the unsorted locations. Since the sorting is skipped at
boot up if it is believed that it was sorted at run time, ftrace can crash
as its algorithms are dependent on the list being sorted.
Add a new config BUILDTIME_MCOUNT_SORT that is set when
BUILDTIME_TABLE_SORT but not if S390 is set. Use this config to determine
if sorting should take place at boot up.
Link: https://lore.kernel.org/all/yt9dee51ctfn.fsf@linux.ibm.com/
Fixes: 72b3942a17 ("scripts: ftrace - move the sort-processing in ftrace_init")
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQHJBAABCgAzFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmHi+xgVHHl1cnkubm9y
b3ZAZ21haWwuY29tAAoJELFEgP06H77IxdoMAMf3E+L51Ys/4iAiyJQNVoT3aIBC
A8ZVOB9he1OA3o3wBNIRKmICHk+ovnfCWcXTr9fG/Ade2wJz88NAsGPQ1Phywb+s
iGlpySllFN72RT9ZqtJhLEzgoHHOL0CzTW07TN9GJy4gQA2h2G9CTP+OmsQdnVqE
m9Fn3PSlJ5lhzePlKfnln8rGZFgrriJakfEFPC79n/7an4+2Hvkb5rWigo7KQc4Z
9YNqYUcHWZFUgq80adxEb9LlbMXdD+Z/8fCjOrAatuwVkD4RDt6iKD0mFGjHXGL7
MZ9KRS8AfZXawmetk3jjtsV+/QkeS+Deuu7k0FoO0Th2QV7BGSDhsLXAS5By/MOC
nfSyHhnXHzCsBMyVNrJHmNhEZoN29+tRwI84JX9lWcf/OLANcCofnP6f2UIX7tZY
CAZAgVELp+0YQXdybrfzTQ8BT3TinjS/aZtCrYijRendI1GwUXcyl69vdOKqAHuk
5jy8k/xHyp+ZWu6v+PyAAAEGowY++qhL0fmszA==
=RKW4
-----END PGP SIGNATURE-----
Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- introduce for_each_set_bitrange()
- use find_first_*_bit() instead of find_next_*_bit() where possible
- unify for_each_bit() macros
* tag 'bitmap-5.17-rc1' of git://github.com/norov/linux:
vsprintf: rework bitmap_list_string
lib: bitmap: add performance test for bitmap_print_to_pagebuf
bitmap: unify find_bit operations
mm/percpu: micro-optimize pcpu_is_populated()
Replace for_each_*_bit_from() with for_each_*_bit() where appropriate
find: micro-optimize for_each_{set,clear}_bit()
include/linux: move for_each_bit() macros from bitops.h to find.h
cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
tools: sync tools/bitmap with mother linux
all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
cpumask: use find_first_and_bit()
lib: add find_first_and_bit()
arch: remove GENERIC_FIND_FIRST_BIT entirely
include: move find.h from asm_generic to linux
bitops: move find_bit_*_le functions from le.h to find.h
bitops: protect find_first_{,zero}_bit properly
Remove PDE_DATA() completely and replace it with pde_data().
[akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c]
[akpm@linux-foundation.org: now fix it properly]
Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for converting bit_spin_lock to rwlock in zsmalloc so
that multiple writers of zspages can run at the same time but those
zspages are supposed to be different zspage instance. Thus, it's not
deadlock. This patch adds write_lock_nested to support the case for
LOCKDEP.
[minchan@kernel.org: fix write_lock_nested for RT]
Link: https://lkml.kernel.org/r/YZfrMTAXV56HFWJY@google.com
[bigeasy@linutronix.de: fixup write_lock_nested() implementation]
Link: https://lkml.kernel.org/r/20211123170134.y6xb7pmpgdn4m3bn@linutronix.de
Link: https://lkml.kernel.org/r/20211115185909.3949505-8-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we pass a negative value to the proc_doulongvec_minmax() function,
the function returns 0, but the corresponding interface value does not
change.
we can easily reproduce this problem with the following commands:
cd /proc/sys/fs/epoll
echo -1 > max_user_watches; echo $?; cat max_user_watches
This function requires a non-negative number to be passed in, so when a
negative number is passed in, -EINVAL is returned.
Link: https://lkml.kernel.org/r/20211220092627.3744624-1-libaokun1@huawei.com
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The const variable ten_thousand is not used, it is redundant and can be
removed.
Cleans up clang warning:
kernel/sysctl.c:99:18: warning: unused variable 'ten_thousand' [-Wunused-const-variable]
static const int ten_thousand = 10000;
Link: https://lkml.kernel.org/r/20211221184501.574670-1-colin.i.king@gmail.com
Fixes: c26da54dc8ca ("printk: move printk sysctl to printk/sysctl.c")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
Move sysctl_kprobes_optimization from kernel/sysctl.c to
kernel/kprobes.c. Use register_sysctl() to register the sysctl
interface.
[mcgrof@kernel.org: fix compile issue when CONFIG_OPTPROBES is disabled]
Link: https://lkml.kernel.org/r/20211129211943.640266-7-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the fs/coredump.c respective sysctls to its own file.
Link: https://lkml.kernel.org/r/20211129211943.640266-6-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
build warning when CONFIG_PRINTK=n
kernel/printk/printk.c:175:5: warning: no previous prototype for
'devkmsg_sysctl_set_loglvl' [-Wmissing-prototypes]
devkmsg_sysctl_set_loglvl() is only used in sysctl.c when
CONFIG_PRINTK=y, but it participates in the build when CONFIG_PRINTK=n.
So add compile dependency CONFIG_PRINTK=y && CONFIG_SYSCTL=y to fix the
build warning.
Link: https://lkml.kernel.org/r/20211129211943.640266-5-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Stephen Kitt <steve@sk2.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename sysctl_init() to sysctl_init_bases() so to reflect exactly what
this is doing.
Link: https://lkml.kernel.org/r/20211129211943.640266-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the namespace sysctls to its own file as part of the
kernel/sysctl.c spring cleaning
Since we have now removed all sysctls for "fs", we now have to declare
it on the filesystem code, we do that using the new helper, which
reduces boiler plate code.
We rename init_fs_shared_sysctls() to init_fs_sysctls() to reflect that
now fs/sysctls.c is taking on the burden of being the first to register
the base directory as well.
Lastly, since init code will load in the order in which we link it we
have to move the sysctl code to be linked in early, so that its early
init routine runs prior to other fs code. This way, other filesystem
code can register their own sysctls using the helpers after this:
* register_sysctl_init()
* register_sysctl()
Link: https://lkml.kernel.org/r/20211129211943.640266-3-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "sysctl: add and use base directory declarer and
registration helper".
In this patch series we start addressing base directories, and so we
start with the "fs" sysctls. The end goal is we end up completely
moving all "fs" sysctl knobs out from kernel/sysctl.
This patch (of 6):
Add a set of helpers which can be used to declare and register base
directory sysctls on their own. We do this so we can later move each of
the base sysctl directories like "fs", "kernel", etc, to their own
respective files instead of shoving the declarations and registrations
all on kernel/sysctl.c. The lazy approach has caught up and with this,
we just end up extending the list of base directories / sysctls on one
file and this makes maintenance difficult due to merge conflicts from
many developers.
The declarations are used first by kernel/sysctl.c for registration its
own base which over time we'll try to clean up. It will be used in the
next patch to demonstrate how to cleanly deal with base sysctl
directories.
[mcgrof@kernel.org: null-terminate the ctl_table arrays]
Link: https://lkml.kernel.org/r/YafJY3rXDYnjK/gs@bombadil.infradead.org
Link: https://lkml.kernel.org/r/20211129211943.640266-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211129211943.640266-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the pipe sysctls to its own file.
Link: https://lkml.kernel.org/r/20211129205548.605569-10-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the fs/exec.c respective sysctls to its own file.
Since checkpatch complains about style issues with the old code, this
move also fixes a few of those minor style issues:
* Use pr_warn() instead of prink(WARNING
* New empty lines are wanted at the beginning of routines
Link: https://lkml.kernel.org/r/20211129205548.605569-9-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move namei's own sysctl knobs to its own file.
Other than the move we also avoid initializing two static variables to 0
as this is not needed:
* sysctl_protected_symlinks
* sysctl_protected_hardlinks
Link: https://lkml.kernel.org/r/20211129205548.605569-8-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
The locking fs sysctls are only used on fs/locks.c, so move them there.
Link: https://lkml.kernel.org/r/20211129205548.605569-7-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move sysctls which are shared between filesystems into a common file
outside of kernel/sysctl.c.
Link: https://lkml.kernel.org/r/20211129205548.605569-6-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The maxolduid value is only shared for sysctl purposes for use on a max
range. Just stuff this into our shared const array.
[akpm@linux-foundation.org: fix sysctl_vals[], per Mickaël]
Link: https://lkml.kernel.org/r/20211129205548.605569-5-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the dcache sysctl clutter out of kernel/sysctl.c. This is a
small one-off entry, perhaps later we can simplify this representation,
but for now we use the helpers we have. We won't know how we can
simplify this further untl we're fully done with the cleanup.
[arnd@arndb.de: avoid unused-function warning]
Link: https://lkml.kernel.org/r/20211203190123.874239-2-arnd@kernel.org
Link: https://lkml.kernel.org/r/20211129205548.605569-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
We can create the sysctl dynamically on early init for fs stat to help
with this clutter. This dusts off the fs stat syctls knobs and puts
them into where they are declared.
Link: https://lkml.kernel.org/r/20211129205548.605569-3-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "sysctl: 4th set of kernel/sysctl cleanups".
This is slimming down the fs uses of kernel/sysctl.c to the point that
the next step is to just get rid of the fs base directory for it and
move that elsehwere, so that next patch series starts dealing with that
to demo how we can end up cleaning up a full base directory from
kernel/sysctl.c, one at a time.
This patch (of 9):
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the inode sysctls to its own file. Since we are no longer using
this outside of fs/ remove the extern declaration of its respective proc
helper.
We use early_initcall() as it is the earliest we can use.
[arnd@arndb.de: avoid unused-variable warning]
Link: https://lkml.kernel.org/r/20211203190123.874239-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20211129205548.605569-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211129205548.605569-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a way to share unsigned long values. This will allow others to
not have to re-invent these values.
Link: https://lkml.kernel.org/r/20211124231435.1445213-9-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the stack_erasing sysctl from kernel/sysctl.c to
kernel/stackleak.c and use register_sysctl() to register the sysctl
interface.
[mcgrof@kernel.org: commit log update]
Link: https://lkml.kernel.org/r/20211124231435.1445213-8-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the sg-big-buff sysctl from kernel/sysctl.c to drivers/scsi/sg.c
and use register_sysctl() to register the sysctl interface.
[mcgrof@kernel.org: commit log update]
Link: https://lkml.kernel.org/r/20211124231435.1445213-7-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move printk sysctl from kernel/sysctl.c to kernel/printk/sysctl.c.
Use register_sysctl() to register the sysctl interface.
[mcgrof@kernel.org: fixed compile issues when PRINTK is not set, commit log update]
Link: https://lkml.kernel.org/r/20211124231435.1445213-6-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
This moves the binfmt_misc sysctl to its own file to help remove clutter
from kernel/sysctl.c.
Link: https://lkml.kernel.org/r/20211124231435.1445213-5-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the random sysctls to their own file and use
register_sysctl_init().
[mcgrof@kernel.org: commit log update to justify the move]
Link: https://lkml.kernel.org/r/20211124231435.1445213-3-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "sysctl: 3rd set of kernel/sysctl cleanups", v2.
This is the third set of patches to help address cleaning the kitchen
seink in kernel/sysctl.c and to move sysctls away to where they are
actually implemented / used.
This patch (of 8):
kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the firmware configuration sysctl table to the only place where
it is used, and make it clear that if sysctls are disabled this is not
used.
[akpm@linux-foundation.org: export register_firmware_config_sysctl and unregister_firmware_config_sysctl to modules]
[akpm@linux-foundation.org: use EXPORT_SYMBOL_NS_GPL instead]
[sfr@canb.auug.org.au: fix that so it compiles]
Link: https://lkml.kernel.org/r/20211201160626.401d828d@canb.auug.org.au
[mcgrof@kernel.org: major commit log update to justify the move]
Link: https://lkml.kernel.org/r/20211124231435.1445213-1-mcgrof@kernel.org
Link: https://lkml.kernel.org/r/20211124231435.1445213-2-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move the epoll_table sysctl to fs/eventpoll.c and use
register_sysctl().
Link: https://lkml.kernel.org/r/20211123202422.819032-9-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no need to user boiler plate code to specify a set of base
directories we're going to stuff sysctls under. Simplify this by using
register_sysctl() and specifying the directory path directly.
Move inotify_user sysctl to inotify_user.c while at it to remove clutter
from kernel/sysctl.c.
[mcgrof@kernel.org: remember to register fanotify_table]
Link: https://lkml.kernel.org/r/YZ5A6iWLb0h3N3RC@bombadil.infradead.org
[mcgrof@kernel.org: update commit log to reflect new path we decided to take]
Link: https://lkml.kernel.org/r/20211123202422.819032-7-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move dnotify sysctls to dnotify.c and use the new
register_sysctl_init() to register the sysctl interface.
[mcgrof@kernel.org: adjust the commit log to justify the move]
Link: https://lkml.kernel.org/r/20211123202347.818157-10-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
Move aio sysctl to aio.c and use the new register_sysctl_init() to
register the sysctl interface for aio.
[mcgrof@kernel.org: adjust commit log to justify the move]
Link: https://lkml.kernel.org/r/20211123202347.818157-9-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the variable SYSCTL_ZERO to replace some static int boundary
variables with a value of 0 (minolduid, min_extfrag_threshold,
min_wakeup_granularity_ns).
Link: https://lkml.kernel.org/r/20211123202347.818157-8-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When proc_dointvec_minmax() or proc_doulongvec_minmax() are used we are
using the extra1 and extra2 parameters on the sysctl table only for a
min and max boundary, these extra1 and extra2 arguments are then used
for read-only operations. So make them const to reflect this.
[mcgrof@kernel.org: commit log love]
Link: https://lkml.kernel.org/r/20211123202347.818157-7-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ngroups_max is a read-only sysctl entry, reflecting NGROUPS_MAX. Make
it const, in the same way as cap_last_cap.
Link: https://lkml.kernel.org/r/20211123202347.818157-6-mcgrof@kernel.org
Signed-off-by: Stephen Kitt <steve@sk2.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic of proc sysctl.
So, move the watchdog syscl interface to watchdog.c. Use
register_sysctl() to register the sysctl interface to avoid merge
conflicts when different features modify sysctl.c at the same time.
[mcgrof@kernel.org: justify the move on the commit log]
Link: https://lkml.kernel.org/r/20211123202347.818157-5-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.
To help with this maintenance let's start by moving sysctls to places
where they actually belong. The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.
So move hung_task sysctl interface to hung_task.c and use
register_sysctl() to register the sysctl interface.
[mcgrof@kernel.org: commit log refresh and fixed 2-3 0day reported compile issues]
Link: https://lkml.kernel.org/r/20211123202347.818157-4-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysctl has helpers which let us specify boundary values for a min or max
int value. Since these are used for a boundary check only they don't
change, so move these variables to sysctl_vals to avoid adding duplicate
variables. This will help with our cleanup of kernel/sysctl.c.
[akpm@linux-foundation.org: update it for "mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%"]
[mcgrof@kernel.org: major rebase]
Link: https://lkml.kernel.org/r/20211123202347.818157-3-mcgrof@kernel.org
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Qing Wang <wangqing@vivo.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Phillip Potter <phil@philpotter.co.uk>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Cc: James E.J. Bottomley <jejb@linux.ibm.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current release - regressions:
- fix memory leaks in the skb free deferral scheme if upper layer
protocols are used, i.e. in-kernel TCP readers like TLS
Current release - new code bugs:
- nf_tables: fix NULL check typo in _clone() functions
- change the default to y for Vertexcom vendor Kconfig
- a couple of fixes to incorrect uses of ref tracking
- two fixes for constifying netdev->dev_addr
Previous releases - regressions:
- bpf:
- various verifier fixes mainly around register offset handling
when passed to helper functions
- fix mount source displayed for bpffs (none -> bpffs)
- bonding:
- fix extraction of ports for connection hash calculation
- fix bond_xmit_broadcast return value when some devices are down
- phy: marvell: add Marvell specific PHY loopback
- sch_api: don't skip qdisc attach on ingress, prevent ref leak
- htb: restore minimal packet size handling in rate control
- sfp: fix high power modules without diagnostic monitoring
- mscc: ocelot:
- don't let phylink re-enable TX PAUSE on the NPI port
- don't dereference NULL pointers with shared tc filters
- smsc95xx: correct reset handling for LAN9514
- cpsw: avoid alignment faults by taking NET_IP_ALIGN into account
- phy: micrel: use kszphy_suspend/_resume for irq aware devices,
avoid races with the interrupt
Previous releases - always broken:
- xdp: check prog type before updating BPF link
- smc: resolve various races around abnormal connection termination
- sit: allow encapsulated IPv6 traffic to be delivered locally
- axienet: fix init/reset handling, add missing barriers,
read the right status words, stop queues correctly
- add missing dev_put() in sock_timestamping_bind_phc()
Misc:
- ipv4: prevent accidentally passing RTO_ONLINK to
ip_route_output_key_hash() by sanitizing flags
- ipv4: avoid quadratic behavior in netns dismantle
- stmmac: dwmac-oxnas: add support for OX810SE
- fsl: xgmac_mdio: add workaround for erratum A-009885
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHoS14ACgkQMUZtbf5S
IrtMQA/6AxhWuj2JsoNhvTzBCi4vkeo53rKU941bxOaST9Ow8dqDc7yAT8YeJU2B
lGw6/pXx+Fm9twGsRkkQ0vX7piIk25vKzEwnlCYVVXLAnE+lPu9qFH49X1HO5Fwy
K+frGDC524MrbJFb+UbZfJG4UitsyHoqc58Mp7ZNBe2gn12DcHotsiSJikzdd02F
rzQZhvwRKsDS2prcIHdvVAxva380cn99mvaFqIPR9MemhWKOzVa3NfkiC3tSlhW/
OphG3UuOfKCVdofYAO5/oXlVQcDKx0OD9Sr2q8aO0mlME0p0ounKz+LDcwkofaYQ
pGeMY2pEAHujLyRewunrfaPv8/SIB/ulSPcyreoF28TTN20M+4onvgTHvVSyzLl7
MA4kYH7tkPgOfbW8T573OFPdrqsy4WTrFPFovGqvDuiE8h65Pll/gTcAqsWjF/xw
CmfmtICcsBwVGMLUzpUjKAWuB0/voa/sQUuQoxvQFsgCteuslm1suLY5EfSIhdu8
nvhySJjPXRHicZQNflIwKTiOYYWls7yYVGe76u9hqjyD36peJXYjUjyyENIfLiFA
0XclGIfSBMGWMGmxvGYIZDwGOKK0j+s0PipliXVjP2otLrPYUjma5Co37KW8SiSV
9TT673FAXJNB0IJ7xiT7nRUZ/fjRrweP1glte/6d148J1Lf9MTQ=
=XM4Y
-----END PGP SIGNATURE-----
Merge tag 'net-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, bpf.
Quite a handful of old regression fixes but most of those are
pre-5.16.
Current release - regressions:
- fix memory leaks in the skb free deferral scheme if upper layer
protocols are used, i.e. in-kernel TCP readers like TLS
Current release - new code bugs:
- nf_tables: fix NULL check typo in _clone() functions
- change the default to y for Vertexcom vendor Kconfig
- a couple of fixes to incorrect uses of ref tracking
- two fixes for constifying netdev->dev_addr
Previous releases - regressions:
- bpf:
- various verifier fixes mainly around register offset handling
when passed to helper functions
- fix mount source displayed for bpffs (none -> bpffs)
- bonding:
- fix extraction of ports for connection hash calculation
- fix bond_xmit_broadcast return value when some devices are down
- phy: marvell: add Marvell specific PHY loopback
- sch_api: don't skip qdisc attach on ingress, prevent ref leak
- htb: restore minimal packet size handling in rate control
- sfp: fix high power modules without diagnostic monitoring
- mscc: ocelot:
- don't let phylink re-enable TX PAUSE on the NPI port
- don't dereference NULL pointers with shared tc filters
- smsc95xx: correct reset handling for LAN9514
- cpsw: avoid alignment faults by taking NET_IP_ALIGN into account
- phy: micrel: use kszphy_suspend/_resume for irq aware devices,
avoid races with the interrupt
Previous releases - always broken:
- xdp: check prog type before updating BPF link
- smc: resolve various races around abnormal connection termination
- sit: allow encapsulated IPv6 traffic to be delivered locally
- axienet: fix init/reset handling, add missing barriers, read the
right status words, stop queues correctly
- add missing dev_put() in sock_timestamping_bind_phc()
Misc:
- ipv4: prevent accidentally passing RTO_ONLINK to
ip_route_output_key_hash() by sanitizing flags
- ipv4: avoid quadratic behavior in netns dismantle
- stmmac: dwmac-oxnas: add support for OX810SE
- fsl: xgmac_mdio: add workaround for erratum A-009885"
* tag 'net-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (92 commits)
ipv4: add net_hash_mix() dispersion to fib_info_laddrhash keys
ipv4: avoid quadratic behavior in netns dismantle
net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module
powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses
dt-bindings: net: Document fsl,erratum-a009885
net/fsl: xgmac_mdio: Add workaround for erratum A-009885
net: mscc: ocelot: fix using match before it is set
net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices
net: cpsw: avoid alignment faults by taking NET_IP_ALIGN into account
nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind()
net: axienet: increase default TX ring size to 128
net: axienet: fix for TX busy handling
net: axienet: fix number of TX ring slots for available check
net: axienet: Fix TX ring slot available check
net: axienet: limit minimum TX ring size
net: axienet: add missing memory barriers
net: axienet: reset core on initialization prior to MDIO access
net: axienet: Wait for PhyRstCmplt after core reset
net: axienet: increase reset timeout
bpf, selftests: Add ringbuf memory type confusion test
...
Merge more updates from Andrew Morton:
"55 patches.
Subsystems affected by this patch series: percpu, procfs, sysctl,
misc, core-kernel, get_maintainer, lib, checkpatch, binfmt, nilfs2,
hfs, fat, adfs, panic, delayacct, kconfig, kcov, and ubsan"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (55 commits)
lib: remove redundant assignment to variable ret
ubsan: remove CONFIG_UBSAN_OBJECT_SIZE
kcov: fix generic Kconfig dependencies if ARCH_WANTS_NO_INSTR
lib/Kconfig.debug: make TEST_KMOD depend on PAGE_SIZE_LESS_THAN_256KB
btrfs: use generic Kconfig option for 256kB page size limit
arch/Kconfig: split PAGE_SIZE_LESS_THAN_256KB from PAGE_SIZE_LESS_THAN_64KB
configs: introduce debug.config for CI-like setup
delayacct: track delays from memory compact
Documentation/accounting/delay-accounting.rst: add thrashing page cache and direct compact
delayacct: cleanup flags in struct task_delay_info and functions use it
delayacct: fix incomplete disable operation when switch enable to disable
delayacct: support swapin delay accounting for swapping without blkio
panic: remove oops_id
panic: use error_report_end tracepoint on warnings
fs/adfs: remove unneeded variable make code cleaner
FAT: use io_schedule_timeout() instead of congestion_wait()
hfsplus: use struct_group_attr() for memcpy() region
nilfs2: remove redundant pointer sbufs
fs/binfmt_elf: use PT_LOAD p_align values for static PIE
const_structs.checkpatch: add frequently used ops structs
...
Some general debugging features like kmemleak, KASAN, lockdep, UBSAN etc
help fix many viruses like a microscope. On the other hand, those
features are scatter around and mixed up with more situational debugging
options making them difficult to consume properly. This cold help
amplify the general debugging/testing efforts and help establish
sensitive default values for those options across the broad. This could
also help different distros to collaborate on maintaining debug-flavored
kernels.
The config is based on years' experiences running daily CI inside the
largest enterprise Linux distro company to seek regressions on
linux-next builds on different bare-metal and virtual platforms. It can
be used for example,
$ make ARCH=arm64 defconfig debug.config
Since KASAN and KCSAN can't be enabled together, we will need to create
a separate one for KCSAN later as well.
Link: https://lkml.kernel.org/r/20211115134754.7334-1-quic_qiancai@quicinc.com
Signed-off-by: Qian Cai <quic_qiancai@quicinc.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: "Stephen Rothwell" <sfr@canb.auug.org.au>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Delay accounting does not track the delay of memory compact. When there
is not enough free memory, tasks can spend a amount of their time
waiting for compact.
To get the impact of tasks in direct memory compact, measure the delay
when allocating memory through memory compact.
Also update tools/accounting/getdelays.c:
/ # ./getdelays_next -di -p 304
print delayacct stats ON
printing IO accounting
PID 304
CPU count real total virtual total delay total delay average
277 780000000 849039485 18877296 0.068ms
IO count delay total delay average
0 0 0ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
5 11088812685 2217ms
THRASHING count delay total delay average
0 0 0ms
COMPACT count delay total delay average
3 72758 0ms
watch: read=0, write=0, cancelled_write=0
Link: https://lkml.kernel.org/r/1638619795-71451-1-git-send-email-wang.yong12@zte.com.cn
Signed-off-by: wangyong <wang.yong12@zte.com.cn>
Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Zhang Wenya <zhang.wenya1@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently delayacct accounts swapin delay only for swapping that cause
blkio. If we use zram for swapping, tools/accounting/getdelays can't
get any SWAP delay.
It's useful to get zram swapin delay information, for example to adjust
compress algorithm or /proc/sys/vm/swappiness.
Reference to PSI, it accounts any kind of swapping by doing its work in
swap_readpage(), no matter whether swapping causes blkio. Let delayacct
do the similar work.
Link: https://lkml.kernel.org/r/20211112083813.8559-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oops id has been added as part of the end of trace marker for the
kerneloops.org project. The id is used to automatically identify
duplicate submissions of the same report. Identical looking reports
with different a id can be considered as the same oops occurred again.
The early initialisation of the oops_id can create a warning if the
random core is not yet fully initialized. On PREEMPT_RT it is
problematic if the id is initialized on demand from non preemptible
context.
The kernel oops project is not available since 2017. Remove the oops_id
and use 0 in the output in case parser rely on it.
Link: https://bugs.debian.org/953172
Link: https://lkml.kernel.org/r/Ybdi16aP2NEugWHq@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce the error detector "warning" to the error_report event and use
the error_report_end tracepoint at the end of a warning report.
This allows in-kernel tests but also userspace to more easily determine
if a warning occurred without polling kernel logs.
[akpm@linux-foundation.org: add comma to enum list, per Andy]
Link: https://lkml.kernel.org/r/20211115085630.1756817-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PRIO_PGRP needs the tasklist_lock mainly to serialize vs setpgid(2), to
protect against any concurrent change_pid(PIDTYPE_PGID) that can move
the task from one hlist to another while iterating.
However, the remaining can only rely only on RCU:
PRIO_PROCESS only does the task lookup and never iterates over tasklist
and we already have an rcu-aware stable pointer.
PRIO_USER is already racy vs setuid(2) so with creds being rcu
protected, we can end up seeing stale data. When removing the
tasklist_lock there can be a race with (i) fork but this is benign as
the child's nice is inherited and the new task is not observable by the
user yet either, hence the return semantics do not differ. And (ii) a
race with exit, which is a small window and can cause us to miss a task
which was removed from the list and it had the highest nice.
Similarly change the buggy do_each_thread/while_each_thread combo in
PRIO_USER for the rcu-safe for_each_process_thread flavor, which doesn't
make use of next_thread/p->thread_group.
[akpm@linux-foundation.org: coding style fixes]
Link: https://lkml.kernel.org/r/20211210182250.43734-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I was implementing a new per-cpu kthread cfs_migration, I found the
comm of it "cfs_migration/%u" is truncated due to the limitation of
TASK_COMM_LEN. For example, the comm of the percpu thread on CPU10~19
all have the same name "cfs_migration/1", which will confuse the user.
This issue is not critical, because we can get the corresponding CPU
from the task's Cpus_allowed. But for kthreads corresponding to other
hardware devices, it is not easy to get the detailed device info from
task comm, for example,
jbd2/nvme0n1p2-
xfs-reclaim/sdf
Currently there are so many truncated kthreads:
rcu_tasks_kthre
rcu_tasks_rude_
rcu_tasks_trace
poll_mpt3sas0_s
ext4-rsv-conver
xfs-reclaim/sd{a, b, c, ...}
xfs-blockgc/sd{a, b, c, ...}
xfs-inodegc/sd{a, b, c, ...}
audit_send_repl
ecryptfs-kthrea
vfio-irqfd-clea
jbd2/nvme0n1p2-
...
We can shorten these names to work around this problem, but it may be
not applied to all of the truncated kthreads. Take 'jbd2/nvme0n1p2-'
for example, it is a nice name, and it is not a good idea to shorten it.
One possible way to fix this issue is extending the task comm size, but
as task->comm is used in lots of places, that may cause some potential
buffer overflows. Another more conservative approach is introducing a
new pointer to store kthread's full name if it is truncated, which won't
introduce too much overhead as it is in the non-critical path. Finally
we make a dicision to use the second approach. See also the discussions
in this thread:
https://lore.kernel.org/lkml/20211101060419.4682-1-laoar.shao@gmail.com/
After this change, the full name of these truncated kthreads will be
displayed via /proc/[pid]/comm:
rcu_tasks_kthread
rcu_tasks_rude_kthread
rcu_tasks_trace_kthread
poll_mpt3sas0_statu
ext4-rsv-conversion
xfs-reclaim/sdf1
xfs-blockgc/sdf1
xfs-inodegc/sdf1
audit_send_reply
ecryptfs-kthread
vfio-irqfd-cleanup
jbd2/nvme0n1p2-8
Link: https://lkml.kernel.org/r/20211120112850.46047-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add new kconfig target 'make mod2noconfig', which will be useful to
speed up the build and test iteration.
- Raise the minimum supported version of LLVM to 11.0.0
- Refactor certs/Makefile
- Change the format of include/config/auto.conf to stop double-quoting
string type CONFIG options.
- Fix ARCH=sh builds in dash
- Separate compression macros for general purposes (cmd_bzip2 etc.) and
the ones for decompressors (cmd_bzip2_with_size etc.)
- Misc Makefile cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmHnFNIVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGiQEP/1tkt9IHP7vFvkN9xChQI8HQ7HOC
mPIxBAUzHIp1V2IALb0lfojjnpkzcMNpJZVlmqjgyYShLEPPBFwKVXs1War6GViX
aprUMz7w1zR/vZJ2fplFmrkNwSxNp3+LSE6sHVmsliS4Vfzh7CjHb8DnaKjBvQLZ
M+eQugjHsWI3d3E81/qtRG5EaVs6q8osF3b0Km59mrESWVYKqwlUP3aUyQCCUGFK
mI+zC4SrHH6EAIZd//VpaleXxVtDcjjadb7Iru5MFhFdCBIRoSC3d1IWPUNUKNnK
i0ocDXuIoAulA/mROgrpyAzLXg10qYMwwTmX+tplkHA055gKcY/v4aHym6ypH+TX
6zd34UMTLM32LSjs8hssiQT8BiZU0uZoa/m2E9IBaiExA2sTsRZxgQMKXFFaPQJl
jn4cRiG0K1NDeRKtq4xh2WO46OS4sPlR6zW9EXDEsS/bI05Y7LpUz7Flt6iA2Mq3
0g8uYIYr/9drl96X83tFgTkxxB6lpB29tbsmsrKJRGxvrCDnAhXlXhPCkMajkm2Q
PjJfNtMFzwemSZWq09+F+X5BgCjzZtroOdFI9FTMNhGWyaUJZXCtcXQ6UTIKnTHO
cDjcURvh+l56eNEQ5SMTNtAkxB+pX8gPUmyO1wLwRUT4YodxylkTUXGyBBR9tgTn
Yks1TnPD06ld364l
=8BQf
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add new kconfig target 'make mod2noconfig', which will be useful to
speed up the build and test iteration.
- Raise the minimum supported version of LLVM to 11.0.0
- Refactor certs/Makefile
- Change the format of include/config/auto.conf to stop double-quoting
string type CONFIG options.
- Fix ARCH=sh builds in dash
- Separate compression macros for general purposes (cmd_bzip2 etc.) and
the ones for decompressors (cmd_bzip2_with_size etc.)
- Misc Makefile cleanups
* tag 'kbuild-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
kbuild: add cmd_file_size
arch: decompressor: remove useless vmlinux.bin.all-y
kbuild: rename cmd_{bzip2,lzma,lzo,lz4,xzkern,zstd22}
kbuild: drop $(size_append) from cmd_zstd
sh: rename suffix-y to suffix_y
doc: kbuild: fix default in `imply` table
microblaze: use built-in function to get CPU_{MAJOR,MINOR,REV}
certs: move scripts/extract-cert to certs/
kbuild: do not quote string values in include/config/auto.conf
kbuild: do not include include/config/auto.conf from shell scripts
certs: simplify $(srctree)/ handling and remove config_filename macro
kbuild: stop using config_filename in scripts/Makefile.modsign
certs: remove misleading comments about GCC PR
certs: refactor file cleaning
certs: remove unneeded -I$(srctree) option for system_certificates.o
certs: unify duplicated cmd_extract_certs and improve the log
certs: use $< and $@ to simplify the key generation rule
kbuild: remove headers_check stub
kbuild: move headers_check.pl to usr/include/
certs: use if_changed to re-generate the key when the key type is changed
...
The bpf_ringbuf_submit() and bpf_ringbuf_discard() have ARG_PTR_TO_ALLOC_MEM
in their bpf_func_proto definition as their first argument, and thus both expect
the result from a prior bpf_ringbuf_reserve() call which has a return type of
RET_PTR_TO_ALLOC_MEM_OR_NULL.
While the non-NULL memory from bpf_ringbuf_reserve() can be passed to other
helpers, the two sinks (bpf_ringbuf_submit(), bpf_ringbuf_discard()) right now
only enforce a register type of PTR_TO_MEM.
This can lead to potential type confusion since it would allow other PTR_TO_MEM
memory to be passed into the two sinks which did not come from bpf_ringbuf_reserve().
Add a new MEM_ALLOC composable type attribute for PTR_TO_MEM, and enforce that:
- bpf_ringbuf_reserve() returns NULL or PTR_TO_MEM | MEM_ALLOC
- bpf_ringbuf_submit() and bpf_ringbuf_discard() only take PTR_TO_MEM | MEM_ALLOC
but not plain PTR_TO_MEM arguments via ARG_PTR_TO_ALLOC_MEM
- however, other helpers might treat PTR_TO_MEM | MEM_ALLOC as plain PTR_TO_MEM
to populate the memory area when they use ARG_PTR_TO_{UNINIT_,}MEM in their
func proto description
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Both bpf_ringbuf_submit() and bpf_ringbuf_discard() have ARG_PTR_TO_ALLOC_MEM
in their bpf_func_proto definition as their first argument. They both expect
the result from a prior bpf_ringbuf_reserve() call which has a return type of
RET_PTR_TO_ALLOC_MEM_OR_NULL.
Meaning, after a NULL check in the code, the verifier will promote the register
type in the non-NULL branch to a PTR_TO_MEM and in the NULL branch to a known
zero scalar. Generally, pointer arithmetic on PTR_TO_MEM is allowed, so the
latter could have an offset.
The ARG_PTR_TO_ALLOC_MEM expects a PTR_TO_MEM register type. However, the non-
zero result from bpf_ringbuf_reserve() must be fed into either bpf_ringbuf_submit()
or bpf_ringbuf_discard() but with the original offset given it will then read
out the struct bpf_ringbuf_hdr mapping.
The verifier missed to enforce a zero offset, so that out of bounds access
can be triggered which could be used to escalate privileges if unprivileged
BPF was enabled (disabled by default in kernel).
Fixes: 457f44363a ("bpf: Implement BPF ring buffer and verifier support for it")
Reported-by: <tr3e.wang@gmail.com> (SecCoder Security Lab)
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Right now the assertion on check_ptr_off_reg() is only enforced for register
types PTR_TO_CTX (and open coded also for PTR_TO_BTF_ID), however, this is
insufficient since many other PTR_TO_* register types such as PTR_TO_FUNC do
not handle/expect register offsets when passed to helper functions.
Given this can slip-through easily when adding new types, make this an explicit
allow-list and reject all other current and future types by default if this is
encountered.
Also, extend check_ptr_off_reg() to handle PTR_TO_BTF_ID as well instead of
duplicating it. For PTR_TO_BTF_ID, reg->off is used for BTF to match expected
BTF ids if struct offset is used. This part still needs to be allowed, but the
dynamic off from the tnum must be rejected.
Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Similar as with other pointer types where we use ldimm64, clear the register
content to zero first, and then populate the PTR_TO_FUNC type and subprogno
number. Currently this is not done, and leads to reuse of stale register
tracking data.
Given for special ldimm64 cases we always clear the register offset, make it
common for all cases, so it won't be forgotten in future.
Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Generalize the check_ctx_reg() helper function into a more generic named one
so that it can be reused for other register types as well to check whether
their offset is non-zero. No functional change.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
For PREEMPT/DYNAMIC_PREEMPT the *_unlock() will already trigger a
preemption, no point in then calling preempt_schedule_common()
*again*.
Use _cond_resched() instead, since this is a NOP for the preemptible
configs while it provide a preemption point for the others.
Reported-by: xuhaifeng <xuhaifeng@oppo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net
Quieten all kernel-doc warnings in kernel/sched/fair.c:
kernel/sched/fair.c:3663: warning: No description found for return value of 'update_cfs_rq_load_avg'
kernel/sched/fair.c:8601: warning: No description found for return value of 'asym_smt_can_pull_tasks'
kernel/sched/fair.c:8673: warning: Function parameter or member 'sds' not described in 'update_sg_lb_stats'
kernel/sched/fair.c:9483: warning: contents before sections
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211218055900.2704-1-rdunlap@infradead.org
There are two types of forced idle time: forced idle time from cookie'd
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.
Assuming cpu x and cpu y are a pair of SMT siblings, consider the
following scenarios:
1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
tasks running on cpu y. For cpu x, there will be 80% forced idle time
(from uncookie'd task); for cpu y, there will be 20% forced idle time
(from cookie'd task).
2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
tasks running on cpu y. For cpu x, there will be 80% forced idle time
(from cookie'd task); for cpu y, there will be 20% forced idle time
(from uncookie'd task).
The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
(cookie'd)taskset -c x stress-ng -c 1 -l 100
(uncookie'd)taskset -c y stress-ng -c 4 -l 100
In the above two scenarios, the total capacity loss is 1 cpu, but in
scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
which are not accurate. It'll be more accurate to measure with cookie'd
forced idle time and uncookie'd forced idle time.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://lore.kernel.org/r/1641894961-9241-2-git-send-email-CruzZhao@linux.alibaba.com
Similarly to util_avg and util_sum, don't sync load_sum with the low
bound of load_avg but only ensure that load_sum stays in the correct range.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lkml.kernel.org/r/20220111134659.24961-5-vincent.guittot@linaro.org
Similarly to util_avg and util_sum, don't sync runnable_sum with the low
bound of runnable_avg but only ensure that runnable_sum stays in the
correct range.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lkml.kernel.org/r/20220111134659.24961-4-vincent.guittot@linaro.org
Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045
He bisected the problem to:
commit 1c35b07e6d ("sched/fair: Ensure _sum and _avg values stay consistent")
This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.
update_tg_cfs_util() is not the only place where we round util_sum and
lost some accumulated contributions that are not already reflected in
util_avg. Modify update_tg_cfs_util() and detach_entity_load_avg() to not
sync util_sum with the new util_avg. Instead of always setting util_sum to
the low bound of util_avg, which can significantly lower the utilization,
we propagate the difference. In addition, we also check that cfs's util_sum
always stays above the lower bound for a given util_avg as it has been
observed that sched_entity's util_sum is sometimes above cfs one.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lkml.kernel.org/r/20220111134659.24961-3-vincent.guittot@linaro.org
Rick reported performance regressions in bugzilla because of cpu frequency
being lower than before:
https://bugzilla.kernel.org/show_bug.cgi?id=215045
He bisected the problem to:
commit 1c35b07e6d ("sched/fair: Ensure _sum and _avg values stay consistent")
This commit forces util_sum to be synced with the new util_avg after
removing the contribution of a task and before the next periodic sync. By
doing so util_sum is rounded to its lower bound and might lost up to
LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
reflected in util_avg.
Instead of always setting util_sum to the low bound of util_avg, which can
significantly lower the utilization of root cfs_rq after propagating the
change down into the hierarchy, we revert the change of util_sum and
propagate the difference.
In addition, we also check that cfs's util_sum always stays above the
lower bound for a given util_avg as it has been observed that
sched_entity's util_sum is sometimes above cfs one.
Fixes: 1c35b07e6d ("sched/fair: Ensure _sum and _avg values stay consistent")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lkml.kernel.org/r/20220111134659.24961-2-vincent.guittot@linaro.org
With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger->event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.
Fixes: 0e94682b73 ("psi: introduce psi monitor")
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Analyzed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
Time readers that cannot take locks (due to NMI etc..) currently make
use of perf_event::shadow_ctx_time, which, for that event gives:
time' = now + (time - timestamp)
or, alternatively arranged:
time' = time + (now - timestamp)
IOW, the progression of time since the last time the shadow_ctx_time
was updated.
There's problems with this:
A) the shadow_ctx_time is per-event, even though the ctx_time it
reflects is obviously per context. The direct concequence of this
is that the context needs to iterate all events all the time to
keep the shadow_ctx_time in sync.
B) even with the prior point, the context itself might not be active
meaning its time should not advance to begin with.
C) shadow_ctx_time isn't consistently updated when ctx_time is
There are 3 users of this stuff, that suffer differently from this:
- calc_timer_values()
- perf_output_read()
- perf_event_update_userpage() /* A */
- perf_event_read_local() /* A,B */
In particular, perf_output_read() doesn't suffer at all, because it's
sample driven and hence only relevant when the event is actually
running.
This same was supposed to be true for perf_event_update_userpage(),
after all self-monitoring implies the context is active *HOWEVER*, as
per commit f792565326 ("perf/core: fix userpage->time_enabled of
inactive events") this goes wrong when combined with counter
overcommit, in that case those events that do not get scheduled when
the context becomes active (task events typically) miss out on the
EVENT_TIME update and ENABLED time is inflated (for a little while)
with the time the context was inactive. Once the event gets rotated
in, this gets corrected, leading to a non-monotonic timeflow.
perf_event_read_local() made things even worse, it can request time at
any point, suffering all the problems perf_event_update_userpage()
does and more. Because while perf_event_update_userpage() is limited
by the context being active, perf_event_read_local() users have no
such constraint.
Therefore, completely overhaul things and do away with
perf_event::shadow_ctx_time. Instead have regular context time updates
keep track of this offset directly and provide perf_event_time_now()
to complement perf_event_time().
perf_event_time_now() will, in adition to being context wide, also
take into account if the context is active. For inactive context, it
will not advance time.
This latter property means the cgroup perf_cgroup_info context needs
to grow addition state to track this.
Additionally, since all this is strictly per-cpu, we can use barrier()
to order context activity vs context time.
Fixes: 7d9285e82d ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Song Liu <song@kernel.org>
Tested-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/YcB06DasOBtU0b00@hirez.programming.kicks-ass.net
Pull module updates from Luis Chamberlain:
"The biggest change here is in-kernel support for module decompression.
This change is being made to help support LSMs like LoadPin as
otherwise it loses link between the source of kernel module on the
disk and binary blob that is being loaded into the kernel.
kmod decompression is still done by userspace even with this is done,
both because there are no measurable gains in not doing so and as it
adds a secondary extra check for validating the module before loading
it into the kernel.
The rest of the changes are minor, the only other change worth
mentionin there is Jessica Yu is now bowing out of maintenance of
modules as she's taking a break from work.
While there were other changes posted for modules, those have not yet
received much review of testing so I'm not yet comfortable in merging
any of those changes yet."
* 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: fix signature check failures when using in-kernel decompression
kernel: Fix spelling mistake "compresser" -> "compressor"
MAINTAINERS: add mailing lists for kmod and modules
module.h: allow #define strings to work with MODULE_IMPORT_NS
module: add in-kernel support for decompressing
MAINTAINERS: Remove myself as modules maintainer
module: Remove outdated comment
Pull signal/exit/ptrace updates from Eric Biederman:
"This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs found
along the way.
The end-game which I have not yet reached yet is for fatal signals
that generate coredumps to be short-circuit deliverable from
complete_signal, for force_siginfo_to_task not to require changing
userspace configured signal delivery state, and for the ptrace stops
to always happen in locations where we can guarantee on all
architectures that the all of the registers are saved and available on
the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues
reported were small enough that they would not affect bisection so I
simply added the fixes and did not fold the fixes into the changes
they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.
Some successes after this set of changes: There are few enough calls
to do_exit to audit in a reasonable amount of time. The lifetime of
struct kthread now matches the lifetime of struct task, and the
pointer to struct kthread is no longer stored in set_child_tid. The
flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
removed. Issues where task->exit_code was examined with
signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am
cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at:
https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
taskstats: Cleanup the use of task->exit_code
exit: Use the correct exit_code in /proc/<pid>/stat
exit: Fix the exit_code for wait_task_zombie
exit: Coredumps reach do_group_exit
exit: Remove profile_handoff_task
exit: Remove profile_task_exit & profile_munmap
signal: clean up kernel-doc comments
signal: Remove the helper signal_group_exit
signal: Rename group_exit_task group_exec_task
coredump: Stop setting signal->group_exit_task
signal: Remove SIGNAL_GROUP_COREDUMP
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Make coredump handling explicit in complete_signal
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Have the oom killer detect coredumps using signal->core_state
exit: Move force_uaccess back into do_exit
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmHhw7oTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXrjSB/979LV4Dn1PMcFYsSdlFEMeHcjzJdw/
kFnLPXMaPJyfg6QPuf83jxzw9uxw8fcePMdVq/FFBtmVV9fJMAv62B8jaGS1p58c
WnAg+7zsTN+xEoJn+tskSSon8BNMWVrl41zP3K4Ged+5j8UEBk62GB8Orz1qkpwL
fTh3/+xAvczJeD4zZb1dAm4WnmcQJ4vhg45p07jX6owvnwQAikMFl45aSW54I5o8
vAxGzFgdsZ2NtExnRNKh3b3DozA8JUE89KckBSZnDtq4rH8Fyy6Wij56Hc6v6Cml
SUohiNbHX7hsNwit/lxL8wuF97IiA0pQSABobEg3rxfTghTUep51LlaN
=/m4A
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20220114' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- More patches for Hyper-V isolation VM support (Tianyu Lan)
- Bug fixes and clean-up patches from various people
* tag 'hyperv-next-signed-20220114' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
scsi: storvsc: Fix storvsc_queuecommand() memory leak
x86/hyperv: Properly deal with empty cpumasks in hyperv_flush_tlb_multi()
Drivers: hv: vmbus: Initialize request offers message for Isolation VM
scsi: storvsc: Fix unsigned comparison to zero
swiotlb: Add CONFIG_HAS_IOMEM check around swiotlb_mem_remap()
x86/hyperv: Fix definition of hv_ghcb_pg variable
Drivers: hv: Fix definition of hypercall input & output arg variables
net: netvsc: Add Isolation VM support for netvsc driver
scsi: storvsc: Add Isolation VM support for storvsc driver
hyper-v: Enable swiotlb bounce buffer for Isolation VM
x86/hyper-v: Add hyperv Isolation VM check in the cc_platform_has()
swiotlb: Add swiotlb bounce buffer remap function for HV IVM
New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools directory.
- Can safely filter on user space pointers with: field.ustring ~ "match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads to safely
write to it. Note, this could possibly break existing user space, but we will
not know until we hear about it, and then can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time instead of
at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but the offset
to the dynamic data is based off of the location of the descriptor and not
the beginning of the event. Needed for user defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder other
event callbacks that are registered. Needed for user defined events.
And other small fixes and clean ups.
-
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYeGvcxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qrZtAP9ICjJxX54MTErElhhUL/NFLV7wqhJi
OIAgmp6jGVRqPAD+JxQtBnGH+3XMd71ioQkTfQ1rp+jBz2ERBj2DmELUAg0=
=zmda
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New:
- The Real Time Linux Analysis (RTLA) tool is added to the tools
directory.
- Can safely filter on user space pointers with: field.ustring ~
"match-string"
- eprobes can now be filtered like any other event.
- trace_marker(_raw) now uses stream_open() to allow multiple threads
to safely write to it. Note, this could possibly break existing
user space, but we will not know until we hear about it, and then
can revert the change if need be.
- New field in events to display when bottom halfs are disabled.
- Sorting of the ftrace functions are now done at compile time
instead of at bootup.
Infrastructure changes to support future efforts:
- Added __rel_loc type for trace events. Similar to __data_loc but
the offset to the dynamic data is based off of the location of the
descriptor and not the beginning of the event. Needed for user
defined events.
- Some simplification of event trigger code.
- Make synthetic events process its callback better to not hinder
other event callbacks that are registered. Needed for user defined
events.
And other small fixes and cleanups"
* tag 'trace-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (50 commits)
tracing: Add ustring operation to filtering string pointers
rtla: Add rtla timerlat hist documentation
rtla: Add rtla timerlat top documentation
rtla: Add rtla timerlat documentation
rtla: Add rtla osnoise hist documentation
rtla: Add rtla osnoise top documentation
rtla: Add rtla osnoise man page
rtla: Add Documentation
rtla/timerlat: Add timerlat hist mode
rtla: Add timerlat tool and timelart top mode
rtla/osnoise: Add the hist mode
rtla/osnoise: Add osnoise top mode
rtla: Add osnoise tool
rtla: Helper functions for rtla
rtla: Real-Time Linux Analysis tool
tracing/osnoise: Properly unhook events if start_per_cpu_kthreads() fails
tracing: Remove duplicate warnings when calling trace_create_file()
tracing/kprobes: 'nmissed' not showed correctly for kretprobe
tracing: Add test for user space strings when filtering on string pointers
tracing: Have syscall trace events use trace_event_buffer_lock_reserve()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmHhbrwACgkQUqAMR0iA
lPJkAg//ZSvlc/dB7LdAiVmc2JBnDC49yxn7OguCW6t5T8RmVuaurPa/ZNDcZomw
tbT7QSjndg0LHAGiWRrjGzQZ89D7/cbTzOxz821F0LLFDjiEMbUB+tCYznHud1cP
5+83QMwAkpME+RAj471SAHlhGAGJ3YKgFBbUP9sHSbtQqXzLGHBl4MluMh1wpVPq
K1lyVG8TaOCXhWtpuO8MVrz7Q52qqi/85fHZKbQq9Tm5x+nDxVfGw24QcrkQB/Il
4B2T5UP9rzbHQYemc0hascLMJc1mxcFMtK8WKH48cPEL+kQarvsRn6fvVlCxdNUe
ZgQfhixDQPQyC0/CaN7qM9EOpKbwoY8+Lc36kq06YMAJj/NPa4LE/1j2DyRvaein
lJ2vy/E/TdaR71pApL02P5BeU4XAmhm10D/Qpj0pgRuERYVayNE/7wqzyrae7mqm
8zgpUDftGS6BXnwmRWmgFYI5bZWVfzwGjs+JjKIDXPhWHZInpOCkHGCehvabd+4U
02jwCAmKmpkhAxslm2579CZ0+VFK52z5mUwrWwsxhflUD1q3dNj7CAtPKYLqw5Xe
iz2rrulDlwVll5pdbRf18A/+I2YSdz8pDwCpI6WKggompd6r+du3Y/GTGTBhoRcD
L9Q/rENx1gpJgdZYL5zefKlrjMSQhdwp3CRk34DEronTTkty5mc=
=fTdn
-----END PGP SIGNATURE-----
Merge tag 'livepatching-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
Pull livepatching updates from Petr Mladek:
- Correctly handle kobjects when a livepatch init fails
- Avoid CPU hogging when searching for many livepatched symbols
- Add livepatch API page into documentation
* tag 'livepatching-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
livepatch: Avoid CPU hogging with cond_resched
livepatch: Fix missing unlock on error in klp_enable_patch()
livepatch: Fix kobject refcount bug on klp_init_patch_early failure path
Documentation: livepatch: Add livepatch API page
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.
There's no cpumask_first_zero() function, so create it.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Currently three dma atomic pools are initialized as long as the relevant
kernel codes are built in. While in kdump kernel of x86_64, this is not
right when trying to create atomic_pool_dma, because there's no managed
pages in DMA zone. In the case, DMA zone only has low 1M memory
presented and locked down by memblock allocator. So no pages are added
into buddy of DMA zone. Please check commit f1d4d47c58 ("x86/setup:
Always reserve the first 1M of RAM").
Then in kdump kernel of x86_64, it always prints below failure message:
DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-0.rc5.20210611git929d931f2b40.42.fc35.x86_64 #1
Hardware name: Dell Inc. PowerEdge R910/0P658H, BIOS 2.12.0 06/04/2018
Call Trace:
dump_stack+0x7f/0xa1
warn_alloc.cold+0x72/0xd6
__alloc_pages_slowpath.constprop.0+0xf29/0xf50
__alloc_pages+0x24d/0x2c0
alloc_page_interleave+0x13/0xb0
atomic_pool_expand+0x118/0x210
__dma_atomic_pool_init+0x45/0x93
dma_atomic_pool_init+0xdb/0x176
do_one_initcall+0x67/0x320
kernel_init_freeable+0x290/0x2dc
kernel_init+0xa/0x111
ret_from_fork+0x22/0x30
Mem-Info:
......
DMA: failed to allocate 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
Here, let's check if DMA zone has managed pages, then create
atomic_pool_dma if yes. Otherwise just skip it.
Link: https://lkml.kernel.org/r/20211223094435.248523-3-bhe@redhat.com
Fixes: 6f599d8423 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For embedded systems with low total memory, having to run applications
with relatively large memory requirements, 10% max limitation for
watermark_scale_factor poses an issue of triggering direct reclaim every
time such application is started. This results in slow application
startup times and bad end-user experience.
By increasing watermark_scale_factor max limit we allow vendors more
flexibility to choose the right level of kswapd aggressiveness for their
device and workload requirements.
Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Fengfei Xi <xi.fengfei@h3c.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch to add anonymous vma names causes a build failure in some
configurations:
include/linux/mm_types.h: In function 'is_same_vma_anon_name':
include/linux/mm_types.h:924:37: error: implicit declaration of function 'strcmp' [-Werror=implicit-function-declaration]
924 | return name && vma_name && !strcmp(name, vma_name);
| ^~~~~~
include/linux/mm_types.h:22:1: note: 'strcmp' is defined in header '<string.h>'; did you forget to '#include <string.h>'?
This should not really be part of linux/mm_types.h in the first place,
as that header is meant to only contain structure defintions and need a
minimum set of indirect includes itself.
While the header clearly includes more than it should at this point,
let's not make it worse by including string.h as well, which would pull
in the expensive (compile-speed wise) fortify-string logic.
Move the new functions into a separate header that only needs to be
included in a couple of locations.
Link: https://lkml.kernel.org/r/20211207125710.2503446-1-arnd@kernel.org
Fixes: "mm: add a field to store names for private anonymous memory"
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@google.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new helper function kthread_run_on_cpu(), which includes
kthread_create_on_cpu/wake_up_process().
In some cases, use kthread_run_on_cpu() directly instead of
kthread_create_on_node/kthread_bind/wake_up_process() or
kthread_create_on_cpu/wake_up_process() or
kthreadd_create/kthread_bind/wake_up_process() to simplify the code.
[akpm@linux-foundation.org: export kthread_create_on_cpu to modules]
Link: https://lkml.kernel.org/r/20211022025711.3673-2-caihuoqing@baidu.com
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Cc: Bernard Metzler <bmt@zurich.ibm.com>
Cc: Cai Huoqing <caihuoqing@baidu.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_pt_regs() can return NULL on powerpc for kernel threads. This is
then used in __bpf_get_stack() to check for user mode, resulting in a
kernel oops. Guard against this by checking return value of
task_pt_regs() before trying to obtain the call chain.
Fixes: fa28dcb82a ("bpf: Introduce helper bpf_get_task_stack()")
Cc: stable@vger.kernel.org # v5.9+
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d5ef83c361cc255494afd15ff1b4fb02a36e1dcf.1641468127.git.naveen.n.rao@linux.vnet.ibm.com
The new flag MODULE_INIT_COMPRESSED_FILE unintentionally trips check in
module_sig_check(). The check was supposed to catch case when version
info or magic was removed from a signed module, making signature
invalid, but it was coded too broadly and was catching this new flag as
well.
Change the check to only test the 2 particular flags affecting signature
validity.
Fixes: b1ae6dc41e ("module: add in-kernel support for decompressing")
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Since referencing user space pointers is special, if the user wants to
filter on a field that is a pointer to user space, then they need to
specify it.
Add a ".ustring" attribute to the field name for filters to state that the
field is pointing to user space such that the kernel can take the
appropriate action to read that pointer.
Link: https://lore.kernel.org/all/yt9d8rvmt2jq.fsf@linux.ibm.com/
Fixes: 77360f9bbc ("tracing: Add test for user space strings when filtering on string pointers")
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If start_per_cpu_kthreads() called from osnoise_workload_start() returns
error, event hooks are left in broken state: unhook_irq_events() called
but unhook_thread_events() and unhook_softirq_events() not called, and
trace_osnoise_callback_enabled flag not cleared.
On the next tracer enable, hooks get not installed due to
trace_osnoise_callback_enabled flag.
And on the further tracer disable an attempt to remove non-installed
hooks happened, hitting a WARN_ON_ONCE() in tracepoint_remove_func().
Fix the error path by adding the missing part of cleanup.
While at this, introduce osnoise_unhook_events() to avoid code
duplication between this error path and normal tracer disable.
Link: https://lkml.kernel.org/r/20220109153459.3701773-1-nikita.yushchenko@virtuozzo.com
Cc: stable@vger.kernel.org
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the same warning message is already printed in the
trace_create_file() function, there is no need to print it again.
Link: https://lkml.kernel.org/r/20220109162232.361747-1-ytcoode@gmail.com
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The 'nmissed' column of the 'kprobe_profile' file for kretprobe is
not showed correctly, kretprobe can be skipped by two reasons,
shortage of kretprobe_instance which is counted by tk->rp.nmissed,
and kprobe itself is missed by some reason, so to show the sum.
Link: https://lkml.kernel.org/r/20220107150242.5019-1-xyz.sun.ok@gmail.com
Cc: stable@vger.kernel.org
Fixes: 4a846b443b ("tracing/kprobes: Cleanup kprobe tracer code")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Xiangyang Zhang <xyz.sun.ok@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pingfan reported that the following causes a fault:
echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter
echo 1 > events/syscalls/sys_enter_at/enable
The reason is that trace event filter treats the user space pointer
defined by "filename" as a normal pointer to compare against the "cpu"
string. The following bug happened:
kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60
#PF: supervisor read access in kernel mode
#PF: error_code(0x0001) - permissions violation
PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867
Oops: 0001 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:strlen+0x0/0x20
Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11
48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8
48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246
RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000
RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60
RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000
R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380
R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0
FS: 00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
filter_pred_pchar+0x18/0x40
filter_match_preds+0x31/0x70
ftrace_syscall_enter+0x27a/0x2c0
syscall_trace_enter.constprop.0+0x1aa/0x1d0
do_syscall_64+0x16/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa861d88664
The above happened because the kernel tried to access user space directly
and triggered a "supervisor read access in kernel mode" fault. Worse yet,
the memory could not even be loaded yet, and a SEGFAULT could happen as
well. This could be true for kernel space accessing as well.
To be even more robust, test both kernel and user space strings. If the
string fails to read, then simply have the filter fail.
Note, TASK_SIZE is used to determine if the pointer is user or kernel space
and the appropriate strncpy_from_kernel/user_nofault() function is used to
copy the memory. For some architectures, the compare to TASK_SIZE may always
pick user space or kernel space. If it gets it wrong, the only thing is that
the filter will fail to match. In the future, this needs to be fixed to have
the event denote which should be used. But failing a filter is much better
than panicing the machine, and that can be solved later.
Link: https://lore.kernel.org/all/20220107044951.22080-1-kernelfans@gmail.com/
Link: https://lkml.kernel.org/r/20220110115532.536088fd@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Reported-by: Pingfan Liu <kernelfans@gmail.com>
Tested-by: Pingfan Liu <kernelfans@gmail.com>
Fixes: 87a342f5db ("tracing/filters: Support filtering for char * strings")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently, the syscall trace events call trace_buffer_lock_reserve()
directly, which means that it misses out on some of the filtering
optimizations provided by the helper function
trace_event_buffer_lock_reserve(). Have the syscall trace events call that
instead, as it was missed when adding the update to use the temp buffer
when filtering.
Link: https://lkml.kernel.org/r/20220107225839.823118570@goodmis.org
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 0fc1b09ff1 ("tracing: Use temp buffer when filtering events")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that ftrace function pointers are sorted at compile time, add a test
that makes sure they are sorted at run time. This test is only run if it is
configured in.
Link: https://lkml.kernel.org/r/20211206151858.4d21a24d@gandalf.local.home
Cc: Yinan Liu <yinan@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When the kernel starts, the initialization of ftrace takes
up a portion of the time (approximately 6~8ms) to sort mcount
addresses. We can save this time by moving mcount-sorting to
compile time.
Link: https://lkml.kernel.org/r/20211212113358.34208-2-yinan@linux.alibaba.com
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kstrndup() is a memory allocation-related function, it returns NULL when
some internal memory errors happen. It is better to check the return
value of it so to catch the memory error in time.
Link: https://lkml.kernel.org/r/tencent_4D6E270731456EB88712ED7F13883C334906@qq.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: a42e3c4de9 ("tracing/probe: Add immediate string parameter support")
Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kstrdup() returns NULL when some internal memory errors happen, it is
better to check the return value of it so to catch the memory error in
time.
Link: https://lkml.kernel.org/r/tencent_3C2E330722056D7891D2C83F29C802734B06@qq.com
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 33ea4b2427 ("perf/core: Implement the 'perf_uprobe' PMU")
Signed-off-by: Xiaoke Wang <xkernel.wang@foxmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Disabling only bottom halves via local_bh_disable() disables also
preemption but this remains invisible to tracing. On a CONFIG_PREEMPT
kernel one might wonder why there is no scheduling happening despite the
N flag in the trace. The reason might be the a rcu_read_lock_bh()
section.
Add a 'b' to the tracing output if in task context with disabled bottom
halves.
Link: https://lkml.kernel.org/r/YbcbtdtC/bjCKo57@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Treewide cleanup and consolidation of MSI interrupt handling in
preparation for further changes in this area which are necessary to:
- address existing shortcomings in the VFIO area
- support the upcoming Interrupt Message Store functionality which
decouples the message store from the PCI config/MMIO space
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmHf+SETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobzGD/wNEFl5qQo5mNZ9thP6JSJFOItm7zMc
2QgzCYOqNwAv4jL6Dqo+EHtbShYqDyWzKdKccgqNjmdIqgW8q7/fubN1OPzRsClV
CZG997AsXDGXYlQcE3tXZjkeCWnWEE2AGLnygSkFV1K/r9ALAtFfTBJAWB+UD+Zc
1P8Kxo0q0Jg+DQAMAA5bWfSSjo/Pmpr/1AFjY7+GA8BBeJJgWOyW7H1S+GYEWVOE
RaQP81Sbd6x1JkopxkNqSJ/lbNJfnPJxi2higB56Y0OYn5CuSarYbZUM7oQ2V61t
jN7pcEEvTpjLd6SJ93ry8WOcJVMTbccCklVfD0AfEwwGUGw2VM6fSyNrZfnrosUN
tGBEO8eflBJzGTAwSkz1EhiGKna4o1NBDWpr0sH2iUiZC5G6V2hUDbM+0PQJhDa8
bICwguZElcUUPOprwjS0HXhymnxghTmNHyoEP1yxGoKLTrwIqkH/9KGustWkcBmM
hNtOCwQNqxcOHg/r3MN0KxttTASgoXgNnmFliAWA7XwseRpLWc95XPQFa5sptRhc
EzwumEz17EW1iI5/NyZQcY+jcZ9BdgCqgZ9ECjZkyN4U+9G6iACUkxVaHUUs77jl
a0ISSEHEvJisFOsOMYyFfeWkpIKGIKP/bpLOJEJ6kAdrUWFvlRGF3qlav3JldXQl
ypFjPapDeB5guw==
=vKzd
-----END PGP SIGNATURE-----
Merge tag 'irq-msi-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI irq updates from Thomas Gleixner:
"Rework of the MSI interrupt infrastructure.
This is a treewide cleanup and consolidation of MSI interrupt handling
in preparation for further changes in this area which are necessary
to:
- address existing shortcomings in the VFIO area
- support the upcoming Interrupt Message Store functionality which
decouples the message store from the PCI config/MMIO space"
* tag 'irq-msi-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits)
genirq/msi: Populate sysfs entry only once
PCI/MSI: Unbreak pci_irq_get_affinity()
genirq/msi: Convert storage to xarray
genirq/msi: Simplify sysfs handling
genirq/msi: Add abuse prevention comment to msi header
genirq/msi: Mop up old interfaces
genirq/msi: Convert to new functions
genirq/msi: Make interrupt allocation less convoluted
platform-msi: Simplify platform device MSI code
platform-msi: Let core code handle MSI descriptors
bus: fsl-mc-msi: Simplify MSI descriptor handling
soc: ti: ti_sci_inta_msi: Remove ti_sci_inta_msi_domain_free_irqs()
soc: ti: ti_sci_inta_msi: Rework MSI descriptor allocation
NTB/msi: Convert to msi_on_each_desc()
PCI: hv: Rework MSI handling
powerpc/mpic_u3msi: Use msi_for_each-desc()
powerpc/fsl_msi: Use msi_for_each_desc()
powerpc/pasemi/msi: Convert to msi_on_each_dec()
powerpc/cell/axon_msi: Convert to msi_on_each_desc()
powerpc/4xx/hsta: Rework MSI handling
...
Core:
- Make the clocksource watchdog more robust by better validation checks
of the measurement.
Drivers:
- New drivers for MStar and SSD20xd SOCs
- The usual cleanups and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmHf+n0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQ2lD/9WCp+fGTmOt5zb8dOyuyLFjDljStPZ
zNi4d4Iu3gcBIRcjACtbSI2rAPK5gQyM38c9nlmtFv3zihfmz5bQkMTQ1N7O84Nu
c1iEuTW69l/ZvykSJWApsGIY8zgA41efoLYzhg/dCpQGE2fINiRDyU5ZxbJXmwMW
ipjBCf3F9/WLWoTgvl3cTayd/l+7fnpeM6w9MfujHLyCXCwz484KW/7UIMkTCcxF
b7Y3bTLxP4a/iT/ltFDqvLUjUuJWdmCh6gihcEL+9PD/h6KmQnND+p9KB7tbMRy/
DUOBTCi5gY66RQeGRJPVe+Cx/Wi+8vCiyfXUuSoQGqE39HVYOUzMwWOjOncjLad4
fXSzzCIKRwsB3qKw+2GnDeEx1hIw1/K88V2tA+OgQjdWIginOClzy0jb0dkBRbo5
H1U6mPxb+CTKAl1hXAkfDDCenLTiiGBFbvJUydiJYMcFEZYM166e/jA53xIKHNAz
WEphVRAPA269uIxYBXJU7pA6M5bYqbHhhmrxyWOBbhhZGGj3x685PA1wioeNayMp
SMA7s7kZaOBDuTtjRY/dFDkd/27HKWDkxjZCbbslRRKKO0Zz7qixzspV5LETnABO
NzR5TcNimCyvfKEzSG1PFmzx9P/cnspyLvWj560xL0Z9x1MnsHtiUpibJ8a/Gb45
riPKWGedog8BgQ==
=7vCU
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for the time(r) subsystem:
Core:
- Make the clocksource watchdog more robust by better validation
checks of the measurement.
Drivers:
- New drivers for MStar and SSD20xd SOCs
- The usual cleanups and improvements all over the place"
* tag 'timers-core-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
dt-bindings: timer: Add Mstar MSC313e timer devicetree bindings documentation
clocksource/drivers/msc313e: Add support for ssd20xd-based platforms
clocksource/drivers: Add MStar MSC313e timer support
clocksource/drivers/pistachio: Fix -Wunused-but-set-variable warning
clocksource/drivers/timer-imx-sysctr: Set cpumask to cpu_possible_mask
clocksource/drivers/imx-sysctr: Mark two variable with __ro_after_init
clocksource/drivers/renesas,ostm: Make RENESAS_OSTM symbol visible
clocksource/drivers/renesas-ostm: Add RZ/G2L OSTM support
dt-bindings: timer: renesas: ostm: Document Renesas RZ/G2L OSTM
clocksource/drivers/exynos_mct: Fix silly typo resulting in checkpatch warning
clocksource: Reduce the default clocksource_watchdog() retries to 2
clocksource: Avoid accidental unstable marking of clocksources
dt-bindings: timer: tpm-timer: Add imx8ulp compatible string
reset: Add of_reset_control_get_optional_exclusive()
clocksource/drivers/exynos_mct: Refactor resources allocation
dt-bindings: timer: remove rockchip,rk3066-timer compatible string from rockchip,rk-timer.yaml
dt-bindings: timer: cadence_ttc: Add power-domains
Core:
- Provide a new interface for affinity hints to provide a separation
between hint and actual affinity change which has become a hidden
property of the current interface
- Fix up the in tree usage of the affinity hint interfaces
Drivers:
- No new irqchip drivers!
- Fix GICv3 redistributor table reservation with RT across kexec
- Fix GICv4.1 redistributor view of the VPE table across kexec
- Add support for extra interrupts on spear-shirq
- Make obtaining some interrupts optional for the Renesas drivers
- Various cleanups and bug fixes
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmHf9v0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRK6D/9bQmyITmJ4KLn0HZ1DsvkuR/GB7I8v
yTF99FxIi/F0jlJ7+87Hdm68cfYPMahpiHqSlsf/QE2kkuWYDJmMaPUao14XMdG6
jxrJ1OZtZXeDXyAWkB/gjmiuqyW/e/Myndg0UNUrJ66GqKfxfxtz1/4GfLjgDpIu
TfZQdojvo6T7NTVnU8aAkgKUhM2jL/HxPiR3VUJ+VneSfwKLHzr3+lTY9zkSvJ8s
ATqqGn6+GugJmDWaCI13IJcmBhPU/Gvs+Eqnwz7Xez/6wJftYvJh7vGec3ixS9pw
skjPDnwuHcPl+h0mYMv7ySN7WuqTr0iqYIepdvLUfq6D1WjnHvF5XNcV4W7EzPJN
B/pBosJ97ZAiHgrWsb35/S3bJ0mnB3Ib4WOOIcnRM36JUdNZrnKJntCsyrrmUsYA
s6J1og9Ut7it+F9OFvsuZ2pUv25U8BlzhgfJen8Z0fzV1/2f5LQN0gQGVxqVpwkg
3Cmd5Rmy5h2vlcKKHklLxIP24+UMIb2WyhsDiZ/qYH3zSFFnQPUJ6fvmZIxN/fPx
exU5O8kgsXSwauXWHJJBb+qhKNcUNvUwKGHNMAvM9mh1xytU6ZowjTqqOlCfBWlg
dRXT2xI0ex7liXek6yXa4lN1tabIdnvmYTmueUoFiOCqbUPBO8LTutjdehsUMa4d
xV0a8WEzuk9Q/A==
=myJA
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core:
- Provide a new interface for affinity hints to provide a separation
between hint and actual affinity change which has become a hidden
property of the current interface
- Fix up the in tree usage of the affinity hint interfaces
Drivers:
- No new irqchip drivers!
- Fix GICv3 redistributor table reservation with RT across kexec
- Fix GICv4.1 redistributor view of the VPE table across kexec
- Add support for extra interrupts on spear-shirq
- Make obtaining some interrupts optional for the Renesas drivers
- Various cleanups and bug fixes"
* tag 'irq-core-2022-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
irqchip/renesas-intc-irqpin: Use platform_get_irq_optional() to get the interrupt
irqchip/renesas-irqc: Use platform_get_irq_optional() to get the interrupt
irqchip/gic-v4: Disable redistributors' view of the VPE table at boot time
irqchip/ingenic-tcu: Use correctly sized arguments for bit field
irqchip/gic-v2m: Add const to of_device_id
irqchip/imx-gpcv2: Mark imx_gpcv2_instance with __ro_after_init
irqchip/spear-shirq: Add support for IRQ 0..6
irqchip/gic-v3-its: Limit memreserve cpuhp state lifetime
irqchip/gic-v3-its: Postpone LPI pending table freeing and memreserve
irqchip/gic-v3-its: Give the percpu rdist struct its own flags field
net/mlx4: Use irq_update_affinity_hint()
net/mlx5: Use irq_set_affinity_and_hint()
hinic: Use irq_set_affinity_and_hint()
scsi: lpfc: Use irq_set_affinity()
mailbox: Use irq_update_affinity_hint()
ixgbe: Use irq_update_affinity_hint()
be2net: Use irq_update_affinity_hint()
enic: Use irq_update_affinity_hint()
RDMA/irdma: Use irq_update_affinity_hint()
scsi: mpt3sas: Use irq_set_affinity_and_hint()
...
There is a spelling mistake in a pr_err error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
"Cleanup of the perf/kvm interaction."
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHdvbkACgkQEsHwGGHe
VUrX7w/9FwKUm0WlGcQIAOSdWk85N2qAVH3brYcQHNpTCVe68TOqTCrxCDrGgyUq
2XnCOim99MUlnsVU6QRZqF4yJ8S1tGrc0COJ/qR4SGntucu0oYuDe2aMVq+mWUD7
/IThA0oMRfhki9WwAyUuyCrXzk4blZdlrXyYIRMJGl9xeGNy3cvUtU8f68Kiy22E
OcmQ/o9Etsr38dueAMU1KYEmgSTvG47rS8nfyRUu3QpJHbyLmRXH32PQrm3tduxS
Bw3gMAH5vqq1UDZJ8ZvsPsO0vFX7dtnKEwEKz4qdtRWk9gi8oLGHIwIXC+VtNqpf
mCmX33Jw8uFz9h3JhE84J0j/CgsWHoU6MOs0MOch4Tb69/BfCjQnw1enImhejG8q
YEIDjJf/vgRNaw9PYshiTHT+EJTe9inT3S4eK/ynLRDUEslAqyWZZm7bUE/XrEDi
yRyGIxry/hNZVvRkXT9QBw32fpgnIH2NAMPLEjJSGCRxT89Tfqz0aRDfacCuHTTh
P8pAeiDuy/6RkDlQckOZJWOFFh2IHsykX2l3IJcHqVRqt4ob9b+SZB5qoH/Mv9qb
MSAqdFUupYZFC+6XuPAeX5/Mo+wSkP+pYYSbWNxjUa0yNiYecOjE7/8T2SB2y6Mx
lk2L0ypsZUYSmpHSfvOdPmf6ucj19/5B4+VCX6PQfcNJTnvvhTE=
=tU5G
-----END PGP SIGNATURE-----
Merge tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Borislav Petkov:
"Cleanup of the perf/kvm interaction."
* tag 'perf_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Drop guest callback (un)register stubs
KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c
KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y
KVM: arm64: Convert to the generic perf callbacks
KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
KVM: Move x86's perf guest info callbacks to generic KVM
KVM: x86: More precisely identify NMI from guest when handling PMI
KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable
perf/core: Use static_call to optimize perf_guest_info_callbacks
perf: Force architectures to opt-in to guest callbacks
perf: Add wrappers for invoking guest callbacks
perf/core: Rework guest callbacks to prepare for static_call support
perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv
perf: Stop pretending that perf can handle multiple guest callbacks
KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
KVM: x86: Register perf callbacks after calling vendor's hardware_setup()
perf: Protect perf_guest_cbs with RCU
The commit 1f1562fcd0 ("cgroup/cpuset: Don't let child cpusets
restrict parent in default hierarchy") inteded to relax the check only
on the default hierarchy (or v2 mode) but it dropped the check in v1
too.
This patch returns and separates the legacy-only validations so that
they can be considered only in the v1 mode, which should enforce the old
constraints for the sake of compatibility.
Fixes: 1f1562fcd0 ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy")
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Here is the set of changes for the driver core for 5.17-rc1.
Lots of little things here, including:
- kobj_type cleanups
- auxiliary_bus documentation updates
- auxiliary_device conversions for some drivers (relevant
subsystems all have provided acks for these)
- kernfs lock contention reduction for some workloads
- other tiny cleanups and changes.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYd7deA8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ym8ngCgw0ANwrRPE5b1dthEmfU2f8Knk5kAn0pHQv6R
VRZJypgNfU/Pt0ykstZD
=CO9J
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of changes for the driver core for 5.17-rc1.
Lots of little things here, including:
- kobj_type cleanups
- auxiliary_bus documentation updates
- auxiliary_device conversions for some drivers (relevant subsystems
all have provided acks for these)
- kernfs lock contention reduction for some workloads
- other tiny cleanups and changes.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (43 commits)
kobject documentation: remove default_attrs information
drivers/firmware: Add missing platform_device_put() in sysfb_create_simplefb
debugfs: lockdown: Allow reading debugfs files that are not world readable
driver core: Make bus notifiers in right order in really_probe()
driver core: Move driver_sysfs_remove() after driver_sysfs_add()
firmware: edd: remove empty default_attrs array
firmware: dmi-sysfs: use default_groups in kobj_type
qemu_fw_cfg: use default_groups in kobj_type
firmware: memmap: use default_groups in kobj_type
sh: sq: use default_groups in kobj_type
headers/uninline: Uninline single-use function: kobject_has_children()
devtmpfs: mount with noexec and nosuid
driver core: Simplify async probe test code by using ktime_ms_delta()
nilfs2: use default_groups in kobj_type
kobject: remove kset from struct kset_uevent_ops callbacks
driver core: make kobj_type constant.
driver core: platform: document registration-failure requirement
vdpa/mlx5: Use auxiliary_device driver data helpers
net/mlx5e: Use auxiliary_device driver data helpers
soundwire: intel: Use auxiliary_device driver data helpers
...
- refactor the dma-direct coherent allocator
- turn an macro into an inline in scatterlist.h (Logan Gunthorpe)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmHcfpILHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYP0MA/+N8Vs2ZgemWunnTyZxCOlia+bq9QPXOMlM+GLLYwQ
pMizhYXKmiM6HHkLDpVOtzOQTBztqBmWzt7a5z/xlwtS3SIT4ktS1MkrPwW56l3F
Ml8u1OLv3rTXF/6K9PHU1XvXBIdkFkAnoVLPn1k6V9C8MiEUzo43Nx01p5ooNq/u
+cptbN119BKQ19EAyOWM5QC9hhQoUNRTeWYd9AQjxTMC6mLPF1VKD1KjUM7gwNB5
TxqGReYTJNNXbEsN+s6JvkOPId8Ps9JpwSys6Kx9lcXYqNchTCpo+HApvV1hcQrm
hWBY2u94TPRuBIIp6xe3JBAPhbRTwF5gLEOwJxS+UTj4vcioKUdYnRxEt+JfD5AA
XjlFyAbCwOPw4qXe00+EefQpuSim1nkUb0FXUJC/zRRJjnjmPxxsnZEnLaxlmyOd
FlYwyY1x/TqiFr6ZfVr67msnPFy4tRRyXa/W9zcXKwPgI6/3nbYjkpbnK/RocZx+
4vo8LK0zaHEstrCcCbAwRZk5YgOHRFzrnDhRYAifqJqzlvKqEeF9S8hIRSypcIyG
GD4ViVK4RJQwtKG3XfG8GPhKLTKoL1NQgbNjszSgc1wIt5E+CJ5PUK2TuBVc9I85
xScxWeGVfgV7a8hDZ+fyc07VV+1El7Bk3qN88Uassu1HhMbcOJeVJ2dV7u/VJRf5
vIQ=
=YkF5
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.17' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- refactor the dma-direct coherent allocator
- turn an macro into an inline in scatterlist.h (Logan Gunthorpe)
* tag 'dma-mapping-5.17' of git://git.infradead.org/users/hch/dma-mapping:
lib/scatterlist: cleanup macros into static inline functions
dma-direct: add a dma_direct_use_pool helper
dma-direct: factor the swiotlb code out of __dma_direct_alloc_pages
dma-direct: drop two CONFIG_DMA_RESTRICTED_POOL conditionals
dma-direct: warn if there is no pool for force unencrypted allocations
dma-direct: fail allocations that can't be made coherent
dma-direct: refactor the !coherent checks in dma_direct_alloc
dma-direct: factor out a helper for DMA_ATTR_NO_KERNEL_MAPPING allocations
dma-direct: clean up the remapping checks in dma_direct_alloc
dma-direct: always leak memory that can't be re-encrypted
dma-direct: don't call dma_set_decrypted for remapped allocations
dma-direct: factor out dma_set_{de,en}crypted helpers
Current scheme of having userspace decompress kernel modules before
loading them into the kernel runs afoul of LoadPin security policy, as
it loses link between the source of kernel module on the disk and binary
blob that is being loaded into the kernel. To solve this issue let's
implement decompression in kernel, so that we can pass a file descriptor
of compressed module file into finit_module() which will keep LoadPin
happy.
To let userspace know what compression/decompression scheme kernel
supports it will create /sys/module/compression attribute. kmod can read
this attribute and decide if it can pass compressed file to
finit_module(). New MODULE_INIT_COMPRESSED_DATA flag indicates that the
kernel should attempt to decompress the data read from file descriptor
prior to trying load the module.
To simplify things kernel will only implement single decompression
method matching compression method selected when generating modules.
This patch implements gzip and xz; more can be added later,
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Since commit e513cc1c07 ("module: Remove stop_machine from module
unloading") this comment is no longer correct. Remove it.
Signed-off-by: Yu Chen <chen.yu@easystack.cn>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
"Lots of cleanups and preparation; highlights:
- futex: Cleanup and remove runtime futex_cmpxchg detection
- rtmutex: Some fixes for the PREEMPT_RT locking infrastructure
- kcsan: Share owner_on_cpu() between mutex,rtmutex and rwsem and
annotate the racy owner->on_cpu access *once*.
- atomic64: Dead-Code-Elemination"
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHdvssACgkQEsHwGGHe
VUrbBg//VQvz5BwddIJDj9utt5AvSixNcTF5mJyFKCSIqO0S4J8nCNcvJjZ2bs4S
w1YmInFbp0WFGUhaIZiw0e6KWJUoINTng4MfHDZosS1doT2of53ZaQqXs3i81jDz
87w8ADVHL0x4+BNjdsIwbcuPSDTmJFoyFOdeXTIl9hv9ZULT8m4Mt+LJuUHNZ+vF
rS1jyseVPWkcm5y+Yie0rhip+ygzbfbt0ArsLfRcrBJsKr6oxLxV2DDF+2djXuuP
d2OgGT7VkbgAhoKpzVXUiHsT6ppR5Mn5TLSa4EZ4bPPCUFldOhKuCAImF3T6yVIa
44iX5vQN9v5VHBy6ocPbdOIBuYBYVGCMurh1t7pbpB6G+mmSxMiyta5MY37POwjv
K2JT9mC2A6a4d17gue5FT3mnJMBB4eHwVaDfAwCZs/5rRNuoTz4aY5Xy04Mq0ltI
39uarwBd5hwSugBWg44AS5E9h52E654FQ7g6iS4NtUvJuuaXBTl43EcZWx2+mnPL
zY+iOMVMgg33VIVcm/mlf/6zWL0LXPmILUiA1fp4Q9/n8u1EuOOyeA/GsC9Pl3wO
HY3KpYJA5eQpIk/JEnzKm5ZE3pCrUdH6VDC/SB4owQtafQG6OxyQVP1Gj7KYxZsD
NqqpJ4nkKooc5f5DqVEN8wrjyYsnVxEfriEG09OoR6wI3MqyUA4=
=vrYy
-----END PGP SIGNATURE-----
Merge tag 'locking_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Borislav Petkov:
"Lots of cleanups and preparation. Highlights:
- futex: Cleanup and remove runtime futex_cmpxchg detection
- rtmutex: Some fixes for the PREEMPT_RT locking infrastructure
- kcsan: Share owner_on_cpu() between mutex,rtmutex and rwsem and
annotate the racy owner->on_cpu access *once*.
- atomic64: Dead-Code-Elemination"
[ Description above by Peter Zijlstra ]
* tag 'locking_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/atomic: atomic64: Remove unusable atomic ops
futex: Fix additional regressions
locking: Allow to include asm/spinlock_types.h from linux/spinlock_types_raw.h
x86/mm: Include spinlock_t definition in pgtable.
locking: Mark racy reads of owner->on_cpu
locking: Make owner_on_cpu() into <linux/sched.h>
lockdep/selftests: Adapt ww-tests for PREEMPT_RT
lockdep/selftests: Skip the softirq related tests on PREEMPT_RT
lockdep/selftests: Unbalanced migrate_disable() & rcu_read_lock().
lockdep/selftests: Avoid using local_lock_{acquire|release}().
lockdep: Remove softirq accounting on PREEMPT_RT.
locking/rtmutex: Add rt_mutex_lock_nest_lock() and rt_mutex_lock_killable().
locking/rtmutex: Squash self-deadlock check for ww_rt_mutex.
locking: Remove rt_rwlock_is_contended().
sched: Trigger warning if ->migration_disabled counter underflows.
futex: Fix sparc32/m68k/nds32 build regression
futex: Remove futex_cmpxchg detection
futex: Ensure futex_atomic_cmpxchg_inatomic() is present
kernel/locking: Use a pointer in ww_mutex_trylock().
"Mostly minor things this time; some highlights:
- core-sched: Add 'Forced Idle' accounting; this allows to track how
much CPU time is 'lost' due to core scheduling constraints.
- psi: Fix for MEM_FULL; a task running reclaim would be counted as a
runnable task and prevent MEM_FULL from being reported.
- cpuacct: Long standing fixes for some cgroup accounting issues.
- rt: Bandwidth timer could, under unusual circumstances, be failed to
armed, leading to indefinite throttling."
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHdvGkACgkQEsHwGGHe
VUq3tQ/9GdaCpbo+WgtM20vo3FqzoRCWAtZZRLWm87g9G7FKE6tD1JCZ+cXn63jR
wz4nuTMGg0lHkrmMiHoeTWoRo7Brw3vPdKTbFBxRaPS3gi3qyz8gaDHSKzAHTJSx
L3j5XaTLcZnXwXV0MOphbK8ZD2W0f9PJZJjwYy1HFUrXh1AFT0WaMXL3aXuaZr8M
jYZoB8r5qXsTBgzNZR8unq5bSUXgvoDAqupFU8gvQWYvNFV4NGK9WFQLlznG1ZhE
aE7oHRbpCnb4avbv9xIm/QgLEHeCVTb/4kLBPk57nrW+aXTHX4ZTHuFtFs0nfDHS
yHSgie3hthr5lFQ/c2G4a5bi5EfPcyURmgNHpWrs2zWWtWzVtqy1WAQ//m8twd14
9cMeefQzttPUbOjykj5QNCJPqkkGgKlblz3p9j8NwUBYUBtBIejsEP0UFPoVgZuL
DjeGhPuGGeTqkVEhLD/pb9kSzUsi1ptTJtnzT9EvtBOi+EpnZnFC6jB98qcuRT19
jhlXwlFNH+SNnMrCniTjLhQK5gVEbvzbU86/nj9CHWDTNdu6DFeJv1S+ZBsjRHUe
f8dV9+laXdLK5QJKAeAubq8ciMvacW8pTf/5PJfaFCJHHDs8rgmx/Ip6TxCZzVEG
XEhNqOmMNnvbkj+9a1yk6SyD9QkVmitZrvRiqeoGayQMjsphT3E=
=H0vR
-----END PGP SIGNATURE-----
Merge tag 'sched_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Borislav Petkov:
"Mostly minor things this time; some highlights:
- core-sched: Add 'Forced Idle' accounting; this allows to track how
much CPU time is 'lost' due to core scheduling constraints.
- psi: Fix for MEM_FULL; a task running reclaim would be counted as a
runnable task and prevent MEM_FULL from being reported.
- cpuacct: Long standing fixes for some cgroup accounting issues.
- rt: Bandwidth timer could, under unusual circumstances, be failed
to armed, leading to indefinite throttling."
[ Description above by Peter Zijlstra ]
* tag 'sched_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Replace CFS internal cpu_util() with cpu_util_cfs()
sched/fair: Cleanup task_util and capacity type
sched/rt: Try to restart rt period timer when rt runtime exceeded
sched/fair: Document the slow path and fast path in select_task_rq_fair
sched/fair: Fix per-CPU kthread and wakee stacking for asym CPU capacity
sched/fair: Fix detection of per-CPU kthreads waking a task
sched/cpuacct: Make user/system times in cpuacct.stat more precise
sched/cpuacct: Fix user/system in shown cpuacct.usage*
cpuacct: Convert BUG_ON() to WARN_ON_ONCE()
cputime, cpuacct: Include guest time in user time in cpuacct.stat
psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim
sched/core: Forced idle accounting
psi: Add a missing SPDX license header
psi: Remove repeated verbose comment
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmHceicUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXO6GxAApAREftcLXl3mpn2vmqmEw3/kUeXt
0OPTUU9vb/eJCFAcWV3RpMMH5ThUC+yVS6is2XX0+Bekj5xgUWsfsriH61JAqEN9
juWuApWDuwdceSJTUzaz1JXgRMYbkBbkOyvSnIQWLTUQ/mrvx/g3K1SJWBss/ftN
ngYeB2V1E7XHofRW4pi5W2ZXb7Lcb0OPSY3prVjaeVivg/GKeW5gogoOAQ1D4JLu
NFwmeH0GD1dNUnonfDzNMzQWbWZ+DUE/g0PKwdaVfEOq3bU3Y2DDTwnOvwpqjT2B
EsWJxp49ArpR+OyiJGRfcMB/TpuvPaljZviGaJrpgh/EiVxTDHOzH3MTsRpdVcmk
MxzN7TT47zFzf2LzVrgcsK6RjlGXiieqFdYLD1VjXfwJboKALSwBE9Mb/3120xxV
NINfaaAN3ow8c2jwnfL/WwePK5JBr62d/xwv2O+8WrydoEmC9E1veM+l5zLN3cdC
Y5Wm6JJbVYaydG+6MROr0yPHn9IuUSnUrybpryv/Cez3VM1k6hCpmqJn5SXqYGEu
3sXQi42sPBUU4gfSct8EhnPBNeFyLWGO/7BvaLWWXl09RMvfRyZtExtfNThBosSc
C8WTCicCB3yuQweehaKVzBFAuuE7hdGgdkxh7B50uFbAgGHgViXtAa2WBcemwwwM
wWOG8HOJJPh/DjY=
=kz+D
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20220110' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Four audit patches for v5.17:
- Harden the code through additional use of the struct_size() macro
and zero-length arrays to flexible-array conversions.
- Ensure that processes which generate userspace audit records are
not exempt from the kernel's audit throttling when the audit queues
are being overrun"
* tag 'audit-pr-20220110' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: replace zero-length array with flexible-array member
audit: use struct_size() helper in audit_[send|make]_reply()
audit: ensure userspace is penalized the same as the kernel when under pressure
audit: use struct_size() helper in kmalloc()
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmHcekAUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNMAQ//TlHhQPEkPYZwD0EPXQIkl5IxGZfR
DSg24OMzrh+x/jKVYhMXuW6BlnF10wd2ocG0DNAt9weF22PLrbDsxbzAAnv+MLXM
wCzWLmgKpPpbaG57oa17LMcswDjVr8YbxXPOvyZJ/G3YsWDarH8Ezot8iNlVUkOh
oqjjXTy0dyLUB/FsW7sIGWa3O18SkbI4pDQxL2vcpdDbvPtAY94kNG3bKTONK/kn
OxLrjKscTtu6EuSdFhMqwcUxpfqvqUDrEfiSdNruMlT0/DixHgFlJA8WHXjn7Wpq
XTbqdFrClfpmIiTrPSSszLrZAceegTdefDAf5wgfTxmcfYKiPDF9kPu5UPX1rAgO
hSoXvGvPjez+RzyXmv9S292lkhV4tz/Wg0YlxZ0LUWif/CSZEyeGGoFZs8080Ukl
1C/oNrByriaGGRLVwKNGOg5x/UkP1ipnorNnAiiIhw+xkzfXm12RdisUyUgLh9+w
WsfsC9BJkXMpuvfkgya5PJ677wVNou4ZtJX+2MV8CfGEKTAJp/X/HKh3tWkQFYKx
35QLVO1LD6gJZlSZZTsjZDxUyPwSd8e55GSFn1qKIHuC2jKjSkwCE7JfErIrI/W0
js6HKO2ak7oMTWjxRizANyKI/yva/DeMl+mKCC4QV0xmwlm5JsOe7mYSCNGws7AV
A2qYX7S1xKbLTGI=
=8H+p
-----END PGP SIGNATURE-----
Merge tag 'selinux-pr-20220110' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
"Nothing too significant, but five SELinux patches for v5.17 that do
the following:
- Harden the code through additional use of the struct_size() macro
- Plug some memory leaks
- Clean up the code via removal of the security_add_mnt_opt() LSM
hook and minor tweaks to selinux_add_opt()
- Rename security_task_getsecid_subj() to better reflect its actual
behavior/use - now called security_current_getsecid_subj()"
* tag 'selinux-pr-20220110' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: minor tweaks to selinux_add_opt()
selinux: fix potential memleak in selinux_add_opt()
security,selinux: remove security_add_mnt_opt()
selinux: Use struct_size() helper in kmalloc()
lsm: security_task_getsecid_subj() -> security_current_getsecid_subj()
env->scratched_stack_slots is a 64-bit value, we should use ULL
instead of UL literal values.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christy Lee <christylee@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/r/20220108005854.658596-1-christylee@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This series provides KCSAN fixes and also the ability to take memory
barriers into account for weakly-ordered systems. This last can increase
the probability of detecting certain types of data races.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmHbuRwTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jKDPEACWuzYnd/u/02AHyRd3PIF3Px9uFKlK
TFwaXX95oYSFCXcrmO42YtDUlZm4QcbwNb85KMCu1DvckRtIsNw0rkBU7BGyqv3Z
ZoJEfMNpmC0x9+IFBOeseBHySPVT0x7GmYus05MSh0OLfkbCfyImmxRzgoKJGL+A
ADF9EQb4z2feWjmVEoN8uRaarCAD4f77rSXiX6oTCNDuKrHarqMld/TmoXFrJbu2
QtfwHeyvraKBnZdUoYfVbGVenyKb1vMv4bUlvrOcuJEgIi/J/th4FupR3XCGYulI
aWJTl2TQTGnMoE8VnFHgI27I841w3k5PVL+Y1hr/S4uN1/rIoQQuBzCtlnFeCksa
BiBXsHIchN8N0Dwh8zD8NMd2uxV4t3fmpxXTDAwaOm7vs5hA8AJ0XNu6Sz94Lyjv
wk2CxX41WWUNJVo3gh6SrS4mL6lC8+VvHF1hbIap++jrevj58gj1jAR1fdx4ANlT
e2qA00EeoMngEogDNZH42/Fxs3H9zxrBta2ZbkPkwzIqTHH+4pIQDCy2xO3T3oxc
twdGPYpjYdkf79EGsG4I4R/VA/IfcS09VIWTce8xSDeSnqkgFhcG37r1orJe8hTB
tH+ODkNOsz5HaEoa8OoAL4ko2l0fL99p2AtMPpuQfHjRj7aorF+dJIrqCizASxwx
37PjQgOmHeDHgQ==
=Q5fg
-----END PGP SIGNATURE-----
Merge tag 'kcsan.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"This provides KCSAN fixes and also the ability to take memory barriers
into account for weakly-ordered systems. This last can increase the
probability of detecting certain types of data races"
* tag 'kcsan.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (29 commits)
kcsan: Only test clear_bit_unlock_is_negative_byte if arch defines it
kcsan: Avoid nested contexts reading inconsistent reorder_access
kcsan: Turn barrier instrumentation into macros
kcsan: Make barrier tests compatible with lockdep
kcsan: Support WEAK_MEMORY with Clang where no objtool support exists
compiler_attributes.h: Add __disable_sanitizer_instrumentation
objtool, kcsan: Remove memory barrier instrumentation from noinstr
objtool, kcsan: Add memory barrier instrumentation to whitelist
sched, kcsan: Enable memory barrier instrumentation
mm, kcsan: Enable barrier instrumentation
x86/qspinlock, kcsan: Instrument barrier of pv_queued_spin_unlock()
x86/barriers, kcsan: Use generic instrumentation for non-smp barriers
asm-generic/bitops, kcsan: Add instrumentation for barriers
locking/atomics, kcsan: Add instrumentation for barriers
locking/barriers, kcsan: Support generic instrumentation
locking/barriers, kcsan: Add instrumentation for barriers
kcsan: selftest: Add test case to check memory barrier instrumentation
kcsan: Ignore GCC 11+ warnings about TSan runtime support
kcsan: test: Add test cases for memory barrier instrumentation
kcsan: test: Match reordered or normal accesses
...
This pull request contains the following branches:
doc.2021.11.30c: Documentation updates, perhaps most notably Neil Brown's
writeup of the reference-counting analogy to RCU.
exp.2021.12.07a: Expedited grace-period cleanups.
fastnohz.2021.11.30c: Remove CONFIG_RCU_FAST_NO_HZ due to lack of valid
users. I have asked around, posted a blog entry, and sent this
series to LKML without result.
fixes.2021.11.30c: Miscellaneous fixes.
nocb.2021.12.09a: RCU callback offloading updates, perhaps most notably
Frederic Weisbecker's updates allowing CPUs booted in the
de-offloaded state to be offloaded at runtime.
nolibc.2021.11.30c: nolibc fixes from Willy Tarreau and Anmar Faizi, but
also including Mark Brown's addition of gettid().
tasks.2021.12.09a: RCU Tasks Trace fixes, including changes that increase
the scalability of call_rcu_tasks_trace() for the BPF folks
(Martin Lau and KP Singh).
torture.2021.12.07a: Various fixes including those from Wander Lairson
Costa and Li Zhijian.
torturescript.2021.11.30c: Fixes plus addition of tests for the increased
call_rcu_tasks_trace() scalability.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmHbtukTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jAX3D/4mrDqAPhAWLWKp7klRhvwypDxj0cxd
/TuGNcZN+YdvNfwozcrog+8yiPxcxhNW1pMESi7SolAhRwuk1JEjiclY+7ORYd6a
/dmJB/lQBezGAdgVabRaJjfLKikpQ+/EnzKee3jjTS1XhJRJe/hDwlVP2B6IROfy
iko5yi+hxfhQdPW6UcpTPCl/4Jn63d9+2SIlW16H0LhzlJeYYsWz4tqOEKYeiHeB
Zxq90InCVmb3YYJzOtk/G7pGQ2RxKPR6/ilm87yzAfJD0Dawd2pgYeDoGvzx94S6
CmhvA6GmwO3JOL6lH891AQVXskCODSJdosP/7otm9u36XJT+5lNOeLRsLbS0Sd9t
BrJKfC7wBFuuIug8j5k3+QSXiKB7Q5JpXEhOjH4BIrkSL0Z0jSVsrZwCSbiUkjZZ
CdF19bL+4h4x5ZL3pndsplX+9BDXsKEgGHWeuzzB4rmsUMtBg84HyfbPp8mLxm6B
i7a1hNVQ5rFWYj6TpI1ZgOBIX07i21OyMAUbXn5JSWUmOyPp2V6D4Sp1zdlvRM0r
hKkIg73NP6ah9QZQTp7T1rIjVmFc2KjbmNZQegjR2pHykPCChT6xnlFix4InV4Ma
BDtigP6vhWz1YfKPjek5WESzHmMRoxdpFjqDY//Uj8/bKBccldO0osERKWtdDlDL
bwMNjny3PPLRng==
=K6AN
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates, perhaps most notably Neil Brown's writeup of
the reference-counting analogy to RCU.
- Expedited grace-period cleanups.
- Remove CONFIG_RCU_FAST_NO_HZ due to lack of valid users. I have asked
around, posted a blog entry, and sent this series to LKML without
result.
- Miscellaneous fixes.
- RCU callback offloading updates, perhaps most notably Frederic
Weisbecker's updates allowing CPUs booted in the de-offloaded state
to be offloaded at runtime.
- nolibc fixes from Willy Tarreau and Anmar Faizi, but also including
Mark Brown's addition of gettid().
- RCU Tasks Trace fixes, including changes that increase the
scalability of call_rcu_tasks_trace() for the BPF folks (Martin Lau
and KP Singh).
- Various fixes including those from Wander Lairson Costa and Li
Zhijian.
- Fixes plus addition of tests for the increased call_rcu_tasks_trace()
scalability.
* tag 'rcu.2022.01.09a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (87 commits)
rcu/nocb: Merge rcu_spawn_cpu_nocb_kthread() and rcu_spawn_one_nocb_kthread()
rcu/nocb: Allow empty "rcu_nocbs" kernel parameter
rcu/nocb: Create kthreads on all CPUs if "rcu_nocbs=" or "nohz_full=" are passed
rcu/nocb: Optimize kthreads and rdp initialization
rcu/nocb: Prepare nocb_cb_wait() to start with a non-offloaded rdp
rcu/nocb: Remove rcu_node structure from nocb list when de-offloaded
rcu-tasks: Use fewer callbacks queues if callback flood ends
rcu-tasks: Use separate ->percpu_dequeue_lim for callback dequeueing
rcu-tasks: Use more callback queues if contention encountered
rcu-tasks: Avoid raw-spinlocked wakeups from call_rcu_tasks_generic()
rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set initial queueing
rcu-tasks: Make rcu_barrier_tasks*() handle multiple callback queues
rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations
rcu-tasks: Abstract invocations of callbacks
rcu-tasks: Abstract checking of callback lists
rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
rcu-tasks: Inspect stalled task's trc state in locked state
rcu-tasks: Use spin_lock_rcu_node() and friends
rcutorture: Combine n_max_cbs from all kthreads in a callback flood
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmHcFXQACgkQUqAMR0iA
lPIYCw/7BbaC04tqwwznlL97rDW5LpbxJp/9bheJ3tE9mjsIVm2ExU/RLwXaeAhM
K+aT0FB9s6u03OXOV3Dc2aFzwTXUamhHwgjp4uPs3+Colin1zGW1GMNTaSxGVm8Y
zJfdrasjhJeucxNda+VtuGBkKwzs3dZjlNOUi1RzhJNGJvQtLDAZp04P9gAv6RDD
gA/59YOx1q8m/Mig04PVWzKwiRl/zQLgGaXrgMYID2VrR2d/ZXSVY+itpq1GVOr7
JWycMveIhAle8aoupbacfZofX5vHShaigfkJRI6MtVHBq7dVyfpTWElqdnjOJaZy
AxwyTbcm0USfCK2K1Yl3L6TcU1DMaryGWJpJH/HvlR7yqEa8RVIeP2rLIlBTNHlP
Caouz0Hrvy1pbLsdd39NXZohnY50DWgml4I0wwhf03IaVU9XaGtdAaKdXL45NefI
ghHTNLCO9e4X1/lQT39EJp33/nZmSoN5vr6+0PQqlWXNgcUbMUAnA/K01jl3NKXP
fBcb7WigIW0CEReMqB7WyJ42at0uk8Px4JEuYGmxpDSQIj+HqaQAiruUjEP9tQ6K
b+LwRKgE7c2eNsHMBPNQ7txjen7rN6D8tVUEgF3Vp45uCm/wdp2kIZ27nOYWYRt3
BVY/6GOIQZ7wJQPztIIaYfd1+3m9DIoq6YsqYniknz88oU4KN5k=
=k1A+
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Remove some twists in the console registration code. It does not
change the existing behavior except for one corner case. The proper
default console (with tty binding) will be registered again even when
it has been removed in the meantime. It is actually a bug fix.
Anyway, this modified behavior requires some manual interaction.
- Optimize gdb extension for huge ring buffers.
- Do not use atomic operations for a local bitmap variable.
- Update git links in MAINTAINERS.
* tag 'printk-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
MAINTAIERS/printk: Add link to printk git
MAINTAINERS/vsprintf: Update link to printk git tree
scripts/gdb: lx-dmesg: read records individually
printk/console: Clean up boot console handling in register_console()
printk/console: Remove need_default_console variable
printk/console: Remove unnecessary need_default_console manipulation
printk/console: Rename has_preferred_console to need_default_console
printk/console: Split out code that enables default console
vsprintf: Use non-atomic bitmap API when applicable
Pull workqueue updates from Tejun Heo:
- The code around workqueue scheduler hooks got reorganized early 2019
which unfortuately introdued a couple subtle and rare race conditions
where preemption can mangle internal workqueue state triggering a
WARN and possibly causing a stall or at least delay in execution.
Frederic fixed both early December and the fixes were sitting in
for-5.16-fixes which I forgot to push. They are here now. I'll
forward them to stable after they land.
- The scheduler hook reorganization has more implicatoins for workqueue
code in that the hooks are now more strictly synchronized and thus
the interacting operations can become more straight-forward.
Lai is in the process of simplifying workqueue code and this pull
request contains some of the patches.
* 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Remove the cacheline_aligned for nr_running
workqueue: Move the code of waking a worker up in unbind_workers()
workqueue: Remove schedule() in unbind_workers()
workqueue: Remove outdated comment about exceptional workers in unbind_workers()
workqueue: Remove the advanced kicking of the idle workers in rebind_workers()
workqueue: Remove the outdated comment before wq_worker_sleeping()
workqueue: Fix unbind_workers() VS wq_worker_sleeping() race
workqueue: Fix unbind_workers() VS wq_worker_running() race
workqueue: Upgrade queue_work_on() comment
Pull cgroup updates from Tejun Heo:
"Nothing too interesting. The only two noticeable changes are a subtle
cpuset behavior fix and trace event id field being expanded to u64
from int. Most others are code cleanups"
* 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: convert 'allowed' in __cpuset_node_allowed() to be boolean
cgroup/rstat: check updated_next only for root
cgroup: rstat: explicitly put loop variant in while
cgroup: return early if it is already on preloaded list
cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy
cgroup: Trace event cgroup id fields should be u64
cgroup: fix a typo in comment
cgroup: get the wrong css for css_alloc() during cgroup_init_subsys()
cgroup: rstat: Mark benign data race to silence KCSAN
We noticed our tc ebpf tools can't start after we upgrade our in-house kernel
version from 4.19 to 5.10. That is because of the behaviour change in bpffs
caused by commit d2935de7e4 ("vfs: Convert bpf to use the new mount API").
In our tc ebpf tools, we do strict environment check. If the environment is
not matched, we won't allow to start the ebpf progs. One of the check is whether
bpffs is properly mounted. The mount information of bpffs in kernel-4.19 and
kernel-5.10 are as follows:
- kernel 4.19
$ mount -t bpf bpffs /sys/fs/bpf
$ mount -t bpf
bpffs on /sys/fs/bpf type bpf (rw,relatime)
- kernel 5.10
$ mount -t bpf bpffs /sys/fs/bpf
$ mount -t bpf
none on /sys/fs/bpf type bpf (rw,relatime)
The device name in kernel-5.10 is displayed as none instead of bpffs, then our
environment check fails. Currently we modify the tools to adopt to the kernel
behaviour change, but I think we'd better change the kernel code to keep the
behavior consistent.
After this change, the mount information will be displayed the same with the
behavior in kernel-4.19, for example:
$ mount -t bpf bpffs /sys/fs/bpf
$ mount -t bpf
bpffs on /sys/fs/bpf type bpf (rw,relatime)
Fixes: d2935de7e4 ("vfs: Convert bpf to use the new mount API")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/bpf/20220108134623.32467-1-laoar.shao@gmail.com
- Add new P-state driver for AMD processors (Huang Rui).
- Fix initialization of min and max frequency QoS requests in the
cpufreq core (Rafael Wysocki).
- Fix EPP handling on Alder Lake in intel_pstate (Srinivas Pandruvada).
- Make intel_pstate update cpuinfo.max_freq when notified of HWP
capabilities changes and drop a redundant function call from that
driver (Rafael Wysocki).
- Improve IRQ support in the Qcom cpufreq driver (Ard Biesheuvel,
Stephen Boyd, Vladimir Zapolskiy).
- Fix double devm_remap() in the Mediatek cpufreq driver (Hector Yuan).
- Introduce thermal pressure helpers for cpufreq CPU cooling (Lukasz
Luba).
- Make cpufreq use default_groups in kobj_type (Greg Kroah-Hartman).
- Make cpuidle use default_groups in kobj_type (Greg Kroah-Hartman).
- Fix two comments in cpuidle code (Jason Wang, Yang Li).
- Allow model-specific normal EPB value to be used in the intel_epb
sysfs attribute handling code (Srinivas Pandruvada).
- Simplify locking in pm_runtime_put_suppliers() (Rafael Wysocki).
- Add safety net to supplier device release in the runtime PM core
code (Rafael Wysocki).
- Capture device status before disabling runtime PM for it (Rafael
Wysocki).
- Add new macros for declaring PM operations to allow drivers to
avoid guarding them with CONFIG_PM #ifdefs or __maybe_unused and
update some drivers to use these macros (Paul Cercueil).
- Allow ACPI hardware signature to be honoured during restore from
hibernation (David Woodhouse).
- Update outdated operating performance points (OPP) documentation
(Tang Yizhou).
- Reduce log severity for informative message regarding frequency
transition failures in devfreq (Tzung-Bi Shih).
- Add DRAM frequency controller devfreq driver for Allwinner sunXi
SoCs (Samuel Holland).
- Add missing COMMON_CLK dependency to sun8i devfreq driver (Arnd
Bergmann).
- Add support for new layout of Psys PowerLimit Register on SPR to
the Intel RAPL power capping driver (Zhang Rui).
- Fix typo in a comment in idle_inject.c (Jason Wang).
- Remove unused function definition from the DTPM (Dynamit Thermal
Power Management) power capping framework (Daniel Lezcano).
- Reduce DTPM trace verbosity (Daniel Lezcano).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmHcgkgSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxs34P/3kFhRk7qrwEekx6F11im6caLKT9+Qap
PuGVqfTbK7TupVQDVGFBEjTjgKY7Ph7Fcr4bqn6wvNOp96cjXyOSk/c1fcpS3Bpr
b1PYsFsb9diNKE462sGGYClyCT3X5qQqtpxzOl3g4I1PWKTC1mKFm4Jm2m6S6cFq
DKhsgYKFzQSZNb1wJM4JjHS9c3BRygqp4nfEAmifu5b9tLZf7stWnFHhbGq63M9m
OwHOrEEnzhf4pOXGZTvIXeczgE6IcuDdlGkIg7XMHnmKSNvj1HqhEgi2lfSRb98z
5eI4S6JymCJGVK+gr8iVCq1iJ+LKqV3YPXRqvI35/+NqIKYxMt2ZivQQf5s3aQLe
26gUulD3O6Pz5tMlwcDElD4/tcClfg35PCD/VzpRR8TAo8vLBb63kZ5v6+HM34ZJ
6QbLTNZJTnGmEqxMccUxP+HhZz8ssqpLAC+R2sE5yXbNpIZq8CbPiGb65RGiX3SG
CmRKqH/xQVNKBYP0ChjmUyhKcBxOnx1Xu8AhsN7gRAy0aht7j7OdjTnJuGiX6gu3
Q5WxvVvkekyfhuFQ5TST9y/fzvMJWzeaA6GhVIr6RoBmshNQGTb0H4HXARxS3Ah5
qjd7ao7BFLa898FCHaHIpmFWp0wF5iljwCJQVP3I2qUpPvDJxEtsxc4CF/AZzyNR
VudoFqLoIV5C
=1egI
-----END PGP SIGNATURE-----
Merge tag 'pm-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"The most signigicant change here is the addition of a new cpufreq
'P-state' driver for AMD processors as a better replacement for the
venerable acpi-cpufreq driver.
There are also other cpufreq updates (in the core, intel_pstate, ARM
drivers), PM core updates (mostly related to adding new macros for
declaring PM operations which should make the lives of driver
developers somewhat easier), and a bunch of assorted fixes and
cleanups.
Summary:
- Add new P-state driver for AMD processors (Huang Rui).
- Fix initialization of min and max frequency QoS requests in the
cpufreq core (Rafael Wysocki).
- Fix EPP handling on Alder Lake in intel_pstate (Srinivas
Pandruvada).
- Make intel_pstate update cpuinfo.max_freq when notified of HWP
capabilities changes and drop a redundant function call from that
driver (Rafael Wysocki).
- Improve IRQ support in the Qcom cpufreq driver (Ard Biesheuvel,
Stephen Boyd, Vladimir Zapolskiy).
- Fix double devm_remap() in the Mediatek cpufreq driver (Hector
Yuan).
- Introduce thermal pressure helpers for cpufreq CPU cooling (Lukasz
Luba).
- Make cpufreq use default_groups in kobj_type (Greg Kroah-Hartman).
- Make cpuidle use default_groups in kobj_type (Greg Kroah-Hartman).
- Fix two comments in cpuidle code (Jason Wang, Yang Li).
- Allow model-specific normal EPB value to be used in the intel_epb
sysfs attribute handling code (Srinivas Pandruvada).
- Simplify locking in pm_runtime_put_suppliers() (Rafael Wysocki).
- Add safety net to supplier device release in the runtime PM core
code (Rafael Wysocki).
- Capture device status before disabling runtime PM for it (Rafael
Wysocki).
- Add new macros for declaring PM operations to allow drivers to
avoid guarding them with CONFIG_PM #ifdefs or __maybe_unused and
update some drivers to use these macros (Paul Cercueil).
- Allow ACPI hardware signature to be honoured during restore from
hibernation (David Woodhouse).
- Update outdated operating performance points (OPP) documentation
(Tang Yizhou).
- Reduce log severity for informative message regarding frequency
transition failures in devfreq (Tzung-Bi Shih).
- Add DRAM frequency controller devfreq driver for Allwinner sunXi
SoCs (Samuel Holland).
- Add missing COMMON_CLK dependency to sun8i devfreq driver (Arnd
Bergmann).
- Add support for new layout of Psys PowerLimit Register on SPR to
the Intel RAPL power capping driver (Zhang Rui).
- Fix typo in a comment in idle_inject.c (Jason Wang).
- Remove unused function definition from the DTPM (Dynamit Thermal
Power Management) power capping framework (Daniel Lezcano).
- Reduce DTPM trace verbosity (Daniel Lezcano)"
* tag 'pm-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (53 commits)
x86, sched: Fix undefined reference to init_freq_invariance_cppc() build error
cpufreq: amd-pstate: Fix Kconfig dependencies for AMD P-State
cpufreq: amd-pstate: Fix struct amd_cpudata kernel-doc comment
cpuidle: use default_groups in kobj_type
x86: intel_epb: Allow model specific normal EPB value
MAINTAINERS: Add AMD P-State driver maintainer entry
Documentation: amd-pstate: Add AMD P-State driver introduction
cpufreq: amd-pstate: Add AMD P-State performance attributes
cpufreq: amd-pstate: Add AMD P-State frequencies attributes
cpufreq: amd-pstate: Add boost mode support for AMD P-State
cpufreq: amd-pstate: Add trace for AMD P-State module
cpufreq: amd-pstate: Introduce the support for the processors with shared memory solution
cpufreq: amd-pstate: Add fast switch function for AMD P-State
cpufreq: amd-pstate: Introduce a new AMD P-State driver to support future processors
ACPI: CPPC: Add CPPC enable register function
ACPI: CPPC: Check present CPUs for determining _CPC is valid
ACPI: CPPC: Implement support for SystemIO registers
x86/msr: Add AMD CPPC MSR definitions
x86/cpufeatures: Add AMD Collaborative Processor Performance Control feature flag
cpufreq: use default_groups in kobj_type
...
Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible,
or at least perform the freeing outside of the socket lock section
to decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice
and net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported
in the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or
MSM8974 (qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch
driver with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of:
lan78xx, ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer granularity,
stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHbkZAACgkQMUZtbf5S
IruYkQ//XX7BggcwBfukPK83j0dONolClijqKcKR08g4vB5L8GXvv6OErKIWrh4k
h8JanCH352ZkbCSw3MvFdm825UYQv8vPMd6Qks/LJ4aSKqCuy4MIlAo+yOw4Km3O
i7++lRfma6DqHHI59wvLjWoxZSPu8lL+rI8UsZ5qMOlnNlGAOXsNrzRjaqQ3FddY
AMxZeBUtrPqUCCQZFq3U8apkYzUp7CA/3XR9zRcja3uPbrtOV2G+4whRF90qGNWz
Tm/QvJ9F/Ab292cbhxR4KuaQ3hUhaCQyDjbZk3+FZzZpAVhYTVqcNjny6+yXmbiP
NXRtwemnl1NlWKMnJM8lEeY48u626tRIkxA/Wtd61uoO5uKUSxfGP+UpUi+DfXbF
yIw50VQ7L2bpxXP/HjtmhVgZDaWKYyh22Zw4Hp/muMJz0hgUB0KODY3tf2jUWbjJ
0oEgocWyzhhwMQKqupTDCIaRgIs2ewYr4ZrFDhI3HnHC/vv1VjoPRUPIyxwppD2N
cXvZb3B1sWK8iX5gCbISGzyU4bB7I0rvJSTU42ueti7n6NqRFZ79qHQpYnnY+JdO
z1qOwY/d/yWfBoXVKRtRg2qz6CdEt5BQklwAgVEBgrFpf58gp694EwGMb1htY14J
r/k9bVpmyIFpUnBH2CPMRfBVA3tUTqzyzzFV4AMw40NYLKmhLdo=
=KLm3
-----END PGP SIGNATURE-----
Merge tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible, or at
least perform the freeing outside of the socket lock section to
decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice and
net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported in
the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974
(qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver
with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of: lan78xx,
ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer
granularity, stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support"
* tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits)
tcp: tcp_send_challenge_ack delete useless param `skb`
net/qla3xxx: Remove useless DMA-32 fallback configuration
rocker: Remove useless DMA-32 fallback configuration
hinic: Remove useless DMA-32 fallback configuration
lan743x: Remove useless DMA-32 fallback configuration
net: enetc: Remove useless DMA-32 fallback configuration
cxgb4vf: Remove useless DMA-32 fallback configuration
cxgb4: Remove useless DMA-32 fallback configuration
cxgb3: Remove useless DMA-32 fallback configuration
bnx2x: Remove useless DMA-32 fallback configuration
et131x: Remove useless DMA-32 fallback configuration
be2net: Remove useless DMA-32 fallback configuration
vmxnet3: Remove useless DMA-32 fallback configuration
bna: Simplify DMA setting
net: alteon: Simplify DMA setting
myri10ge: Simplify DMA setting
qlcnic: Simplify DMA setting
net: allwinner: Fix print format
page_pool: remove spinlock in page_pool_refill_alloc_cache()
amt: fix wrong return type of amt_send_membership_update()
...
Pull random number generator updates from Jason Donenfeld:
"These a bit more numerous than usual for the RNG, due to folks
resubmitting patches that had been pending prior and generally renewed
interest.
There are a few categories of patches in here:
1) Dominik Brodowski and I traded a series back and forth for a some
weeks that fixed numerous issues related to seeds being provided
at extremely early boot by the firmware, before other parts of the
kernel or of the RNG have been initialized, both fixing some
crashes and addressing correctness around early boot randomness.
One of these is marked for stable.
2) I replaced the RNG's usage of SHA-1 with BLAKE2s in the entropy
extractor, and made the construction a bit safer and more
standard. This was sort of a long overdue low hanging fruit, as we
were supposed to have phased out SHA-1 usage quite some time ago
(even if all we needed here was non-invertibility). Along the way
it also made extraction 131% faster. This required a bit of
Kconfig and symbol plumbing to make things work well with the
crypto libraries, which is one of the reasons why I'm sending you
this pull early in the cycle.
3) I got rid of a truly superfluous call to RDRAND in the hot path,
which resulted in a whopping 370% increase in performance.
4) Sebastian Andrzej Siewior sent some patches regarding PREEMPT_RT,
the full series of which wasn't ready yet, but the first two
preparatory cleanups were good on their own. One of them touches
files in kernel/irq/, which is the other reason why I'm sending
you this pull early in the cycle.
5) Other assorted correctness fixes from Eric Biggers, Jann Horn,
Mark Brown, Dominik Brodowski, and myself"
* 'random-5.17-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: don't reset crng_init_cnt on urandom_read()
random: avoid superfluous call to RDRAND in CRNG extraction
random: early initialization of ChaCha constants
random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs
random: harmonize "crng init done" messages
random: mix bootloader randomness into pool
random: do not throw away excess input to crng_fast_load
random: do not re-init if crng_reseed completes before primary init
random: fix crash on multiple early calls to add_bootloader_randomness()
random: do not sign extend bytes for rotation when mixing
random: use BLAKE2s instead of SHA1 in extraction
lib/crypto: blake2s: include as built-in
random: fix data race on crng init time
random: fix data race on crng_node_pool
irq: remove unused flags argument from __handle_irq_event_percpu()
random: remove unused irq_flags argument from add_interrupt_randomness()
random: document add_hwgenerator_randomness() with other input functions
MAINTAINERS: add git tree for random.c
accesing it in order to prevent any potential data races, and convert
all users to those new accessors
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcgFoACgkQEsHwGGHe
VUqXeRAAvcNEfFw6BvXeGfFTxKmOrsRtu2WCkAkjvamyhXMCrjBqqHlygLJFCH5i
2mc6HBohzo4vBFcgi3R5tVkGazqlthY1KUM9Jpk7rUuUzi0phTH7n/MafZOm9Es/
BHYcAAyT/NwZRbCN0geccIzBtbc4xr8kxtec7vkRfGDx8B9/uFN86xm7cKAaL62G
UDs0IquDPKEns3A7uKNuvKztILtuZWD1WcSkbOULJzXgLkb+cYKO1Lm9JK9rx8Ds
8tjezrJgOYGLQyyv0i3pWelm3jCZOKUChPslft0opvVUbrNd8piehvOm9CWopHcB
QsYOWchnULTE9o4ZAs/1PkxC0LlFEWZH8bOLxBMTDVEY+xvmDuj1PdBUpncgJbOh
dunHzsvaWproBSYUXA9nKhZWTVGl+CM8Ks7jXjl3IPynLd6cpYZ/5gyBVWEX7q3e
8htG95NzdPPo7doxMiNSKGSmSm0Np1TJ/i89vsYeGfefsvsq53Fyjhu7dIuTWHmU
2YUe6qHs6dF9x1bkHAAZz6T9Hs4BoGQBcXUnooT9JbzVdv2RfTPsrawdu8dOnzV1
RhwCFdFcll0AIEl0T9fCYzUI/Ga8ZS0roXs5NZ4wl0lwr0BGFwiU8WC1FUdGsZo9
0duaa0Tpv0OWt6rIMMB/E9QsqCDsQ4CMHuQpVVw+GOO5ux9kMms=
=v6Xn
-----END PGP SIGNATURE-----
Merge tag 'core_entry_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull thread_info flag accessor helper updates from Borislav Petkov:
"Add a set of thread_info.flags accessors which snapshot it before
accesing it in order to prevent any potential data races, and convert
all users to those new accessors"
* tag 'core_entry_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc: Snapshot thread flags
powerpc: Avoid discarding flags in system_call_exception()
openrisc: Snapshot thread flags
microblaze: Snapshot thread flags
arm64: Snapshot thread flags
ARM: Snapshot thread flags
alpha: Snapshot thread flags
sched: Snapshot thread flags
entry: Snapshot thread flags
x86: Snapshot thread flags
thread_info: Add helpers to snapshot thread flags
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcE9wACgkQEsHwGGHe
VUoMmw//QEFHlmmWH9rIz23/2xv6Y+mR0G9JXm3YHtraId7BA8Kr9VI+4fxBjNxA
aeCrgBxFEOq32b44GpeKBj4U1QWNi3lHzg0lYmw8zTab0Lxkuh8tw/2Qc15iLfBv
sJZ7VQZ39TxR70ng3Q7I7Dox1Gu4nlP9d8nO2boxkepTXxx6UFPYCRPgZoMm2EYH
fw9CauJIb5j6Ka2EOU1wWn+IKaxGz/Moe6FEQY7CH8OaJW3zcXyWL4GFdc2sDB41
hyi29mRte3PT1G4RAcakLDh7ME781bGCo2xHqtyaCBiCvRkex23TyZ9FEC35xcDb
gs/0AMeC4z1XVX/quEqJWcQRHjvDQY3nMvWnS0vfCOtuqBSTdSffe6j0wvFS4cc8
ks3JbePCeJNYoBG+71RAG8+J0ZsfOm1NS42vHCQ8PuGU0V67ca0r6oJqMGwZSwfl
+iYWb68pvQkYX/kX7E7S3qe5PlXvB7XTo2WhOXeZbONwZf23qiqvjdGoQUhhX/ry
G8bI3mojJG8U8bvViZYfVhnFscLnuLEqs3GJJGndwHyI7y7xIC/VsR7CZrfufDO9
2r8NXQXI7PYmLsBKMPRFfIvmTXhZwIqNwsdBmrU0mGoFaMrWdrIMlQV8qCrPg2hT
7YIxDzldmSlovYCUh7sh1y8fZmkhTXPgb3URluBkEPuzzi59Trg=
=brI6
-----END PGP SIGNATURE-----
Merge tag 'core_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull notifier fix from Borislav Petkov:
"Return an error when a notifier callback has been registered already"
* tag 'core_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
notifier: Return an error when a callback has already been registered
The MSI entries for multi-MSI are populated en bloc for the MSI descriptor,
but the current code invokes the population inside the per interrupt loop
which triggers a warning in the sysfs code and causes the interrupt
allocation to fail.
Move it outside of the loop so it works correctly for single and multi-MSI.
Fixes: bf5e758f02 ("genirq/msi: Simplify sysfs handling")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/87leznqx2a.ffs@tglx
for-5.16-fixes contains two subtle race conditions which were introduced by
scheduler side code cleanups. The branch didn't get pushed out, so merge
into for-5.17.
Merge cpuidle updates, PM core updates and one hiberation-related
update for 5.17-rc1:
- Make cpuidle use default_groups in kobj_type (Greg Kroah-Hartman).
- Fix two comments in cpuidle code (Jason Wang, Yang Li).
- Simplify locking in pm_runtime_put_suppliers() (Rafael Wysocki).
- Add safety net to supplier device release in the runtime PM core
code (Rafael Wysocki).
- Capture device status before disabling runtime PM for it (Rafael
Wysocki).
- Add new macros for declaring PM operations to allow drivers to
avoid guarding them with CONFIG_PM #ifdefs or __maybe_unused and
update some drivers to use these macros (Paul Cercueil).
- Allow ACPI hardware signature to be honoured during restore from
hibernation (David Woodhouse).
* pm-cpuidle:
cpuidle: use default_groups in kobj_type
cpuidle: Fix cpuidle_remove_state_sysfs() kerneldoc comment
cpuidle: menu: Fix typo in a comment
* pm-core:
PM: runtime: Simplify locking in pm_runtime_put_suppliers()
mmc: mxc: Use the new PM macros
mmc: jz4740: Use the new PM macros
PM: runtime: Add safety net to supplier device release
PM: runtime: Capture device status before disabling runtime PM
PM: core: Add new *_PM_OPS macros, deprecate old ones
PM: core: Redefine pm_ptr() macro
r8169: Avoid misuse of pm_ptr() macro
* pm-sleep:
PM: hibernate: Allow ACPI hardware signature to be honoured
- KCSAN enabled for arm64.
- Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.
- Some more SVE clean-ups and refactoring in preparation for SME support
(scalable matrix extensions).
- BTI clean-ups (SYM_FUNC macros etc.)
- arm64 atomics clean-up and codegen improvements.
- HWCAPs for FEAT_AFP (alternate floating point behaviour) and
FEAT_RPRESS (increased precision of reciprocal estimate and reciprocal
square root estimate).
- Use SHA3 instructions to speed-up XOR.
- arm64 unwind code refactoring/unification.
- Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP == 1
(potentially set by a hypervisor; user-space already does this).
- Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.
- Other fixes and clean-ups; highlights: fix the handling of erratum
1418040, correct the calculation of the nomap region boundaries,
introduce io_stop_wc() mapped to the new DGH instruction (data
gathering hint).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmHXNtYACgkQa9axLQDI
XvHBGw/+OVGdbORxwrU+uRb7N6qIJkrW/mmM4x1KLo1i+REZLb8/VlXm0xC60FG+
39x6FSVkRr+lLDfTqpQsOez5FpdsvOe9Fc4L3bwniDg+EPo7x65VmP2dw/Ae2q0i
87xyWCczx5hFEPF/1sb1R1pm3bTXjeklBkdv+OXhwflLOwpCp1J8z8WJK8qJVFX6
CmuE6Q4fDQr0ghl9Nf8DiAr20mHDh8wMKNUJOg4waaQOOCta6q1oJ3qfz6E9z1eW
zEE3dfZgBCx7HCRc3KGgzT7H4Ces3BYvhBYP6bJRliVI88XdPiM4MfdGL4UIb27Q
NLAdr+FVzk/YLzMHtxSfkT10nBqoOPWUTckLu9jIIl5cpBX73Wiz7jfzBvqFmC/y
opSFMZ3lwQPM5WAPtAlZptA3GPPySeInVmvUgB7IQ+1Q1T1n8ri1y5hzTYC4Sc/g
amJI1rXf1Al8+2zFBggr6Up+EOnfV9nAwrzLXkRlASsfmvY4dnVWg3NWfBqtEHAq
VuZCecSgawxuSlpmJ4VGbLrBFaz18bn9EzujR5fFvi5Qcg1CMFOROi2+6IynopNV
IS0R8j6fwgQPA5lcnNIPeJRRkQoqO4l8bPDzeXEny0BSw313EgBSo9aQtnjyIJbp
BTuDHARKs+/NvDPvd8GQkxNPgwJnVOL9pdgNAolEu1/k7JtnIS0=
=ecyi
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- KCSAN enabled for arm64.
- Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.
- Some more SVE clean-ups and refactoring in preparation for SME
support (scalable matrix extensions).
- BTI clean-ups (SYM_FUNC macros etc.)
- arm64 atomics clean-up and codegen improvements.
- HWCAPs for FEAT_AFP (alternate floating point behaviour) and
FEAT_RPRESS (increased precision of reciprocal estimate and
reciprocal square root estimate).
- Use SHA3 instructions to speed-up XOR.
- arm64 unwind code refactoring/unification.
- Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP ==
1 (potentially set by a hypervisor; user-space already does this).
- Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.
- Other fixes and clean-ups; highlights: fix the handling of erratum
1418040, correct the calculation of the nomap region boundaries,
introduce io_stop_wc() mapped to the new DGH instruction (data
gathering hint).
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
arm64: Use correct method to calculate nomap region boundaries
arm64: Drop outdated links in comments
arm64: perf: Don't register user access sysctl handler multiple times
drivers: perf: marvell_cn10k: fix an IS_ERR() vs NULL check
perf/smmuv3: Fix unused variable warning when CONFIG_OF=n
arm64: errata: Fix exec handling in erratum 1418040 workaround
arm64: Unhash early pointer print plus improve comment
asm-generic: introduce io_stop_wc() and add implementation for ARM64
arm64: Ensure that the 'bti' macro is defined where linkage.h is included
arm64: remove __dma_*_area() aliases
docs/arm64: delete a space from tagged-address-abi
arm64: Enable KCSAN
kselftest/arm64: Add pidbench for floating point syscall cases
arm64/fp: Add comments documenting the usage of state restore functions
kselftest/arm64: Add a test program to exercise the syscall ABI
kselftest/arm64: Allow signal tests to trigger from a function
kselftest/arm64: Parameterise ptrace vector length information
arm64/sve: Minor clarification of ABI documentation
arm64/sve: Generalise vector length configuration prctl() for SME
arm64/sve: Make sysctl interface for SVE reusable by SME
...
The event_command.parse() callback is responsible for parsing and
registering triggers. The existing command implementions for this
callback duplicate a lot of the same code, so to clean up and
consolidate those implementations, introduce a handful of helper
functions for implementors to use.
This also makes it easier for new commands to be implemented and
allows them to focus more on the customizations they provide rather
than obscuring and complicating it with boilerplate code.
Link: https://lkml.kernel.org/r/c1ff71f594d45177706571132bd3119491097221.1641823001.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name of the func() callback on event_trigger_ops is too generic
and is easily confused with other callbacks with that name, so change
it to something that reflects its actual purpose.
In this case, the main purpose of the callback is to implement an
event trigger, so call it trigger() instead.
Also add some more documentation to event_trigger_ops describing the
callbacks a bit better.
Link: https://lkml.kernel.org/r/36ab812e3ee74ee03ae0043fda41a858ee728c00.1641823001.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The name of the func() callback on event_command is too generic and is
easily confused with other callbacks with that name, so change it to
something that reflects its actual purpose.
In this case, the main purpose of the callback is to parse an event
command, so call it parse() instead.
Link: https://lkml.kernel.org/r/7784e321840752ed88aac0b349c0c685fc9247b1.1641823001.git.zanussi@kernel.org
Signed-off-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull clocksource watchdog updates from Paul McKenney:
- Avoid accidental unstable marking of clocksources by rejecting
clocksource measurements where the source of the skew is the delay
reading reference clocksource itself. This change avoids many of the
current false positives caused by epic cache-thrashing workloads.
- Reduce the default clocksource_watchdog() retries to 2, thus offsetting
the increased overhead due to #1 above rereading the reference
clocksource.
Link: https://lore.kernel.org/lkml/20220105001723.GA536708@paulmck-ThinkPad-P17-Gen-1
In the function bacct_add_task the code reading task->exit_code was
introduced in commit f3cef7a994 ("[PATCH] csa: basic accounting over
taskstats"), and it is not entirely clear what the taskstats interface
is trying to return as only returning the exit_code of the first task
in a process doesn't make a lot of sense.
As best as I can figure the intent is to return task->exit_code after
a task exits. The field is returned with per task fields, so the
exit_code of the entire process is not wanted. Only the value of the
first task is returned so this is not a useful way to get the per task
ptrace stop code. The ordinary case of returning this value is
returning after a task exits, which also precludes use for getting
a ptrace value.
It is common to for the first task of a process to also be the last
task of a process so this field may have done something reasonable by
accident in testing.
Make ac_exitcode a reliable per task value by always returning it for
every exited task.
Setting ac_exitcode in a sensible mannter makes it possible to continue
to provide this value going forward.
Cc: Balbir Singh <bsingharora@gmail.com>
Fixes: f3cef7a994 ("[PATCH] csa: basic accounting over taskstats")
Link: https://lkml.kernel.org/r/20220103213312.9144-5-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The function wait_task_zombie is defined to always returns the process not
thread exit status. Unfortunately when process group exit support
was added to wait_task_zombie the WNOWAIT case was overlooked.
Usually tsk->exit_code and tsk->signal->group_exit_code will be in sync
so fixing this is bug probably has no effect in practice. But fix
it anyway so that people aren't scratching their heads about why
the two code paths are different.
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 2c66151cbc2c ("[PATCH] sys_exit() threading improvements, BK-curr")
Link: https://lkml.kernel.org/r/20220103213312.9144-3-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The comment about coredumps not reaching do_group_exit and the
corresponding BUG_ON are bogus.
What happens and has happened for years is that get_signal calls
do_coredump (which sets SIGNAL_GROUP_EXIT and group_exit_code) and
then do_group_exit passing the signal number. Then do_group_exit
ignores the exit_code it is passed and uses signal->group_exit_code
from the coredump.
The comment and BUG_ON were correct when they were added during the
2.5 development cycle, but became obsolete and incorrect when
get_signal was changed to fall through to do_group_exit after
do_coredump in 2.6.10-rc2.
So remove the stale comment and BUG_ON
Fixes: 63bd6144f191 ("[PATCH] Invalid BUG_ONs in signal.c")
History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/20220103213312.9144-2-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
All profile_handoff_task does is notify the task_free_notifier chain.
The helpers task_handoff_register and task_handoff_unregister are used
to add and delete entries from that chain and are never called.
So remove the dead code and make it much easier to read and reason
about __put_task_struct.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/87fspyw6m0.fsf@email.froward.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When I say remove I mean remove. All profile_task_exit and
profile_munmap do is call a blocking notifier chain. The helpers
profile_task_register and profile_task_unregister are not called
anywhere in the tree. Which means this is all dead code.
So remove the dead code and make it easier to read do_exit.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20220103213312.9144-1-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Fix kernel-doc warnings in kernel/signal.c:
kernel/signal.c:1830: warning: Function parameter or member 'force_coredump' not described in 'force_sig_seccomp'
kernel/signal.c:2873: warning: missing initial short description on line:
* signal_delivered -
Also add a closing parenthesis to the comments in signal_delivered().
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Richard Weinberger <richard@nod.at>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Marco Elver <elver@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20211222031027.29694-1-rdunlap@infradead.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This helper is misleading. It tests for an ongoing exec as well as
the process having received a fatal signal.
Sometimes it is appropriate to treat an on-going exec differently than
a process that is shutting down due to a fatal signal. In particular
taking the fast path out of exit_signals instead of retargeting
signals is not appropriate during exec, and not changing the the exit
code in do_group_exit during exec.
Removing the helper makes it more obvious what is going on as both
cases must be coded for explicitly.
While removing the helper fix the two cases where I have observed
using signal_group_exit resulted in the wrong result.
In exit_signals only test for SIGNAL_GROUP_EXIT so that signals are
retargetted during an exec.
In do_group_exit use 0 as the exit code during an exec as de_thread
does not set group_exit_code. As best as I can determine
group_exit_code has been is set to 0 most of the time during
de_thread. During a thread group stop group_exit_code is set to the
stop signal and when the thread group receives SIGCONT group_exit_code
is reset to 0.
Link: https://lkml.kernel.org/r/20211213225350.27481-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only remaining user of group_exit_task is exec. Rename the field
so that it is clear which part of the code uses it.
Update the comment above the definition of group_exec_task to document
how it is currently used.
Link: https://lkml.kernel.org/r/20211213225350.27481-7-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>