In the following commit:
7675104990 ("sched: Implement lockless wake-queues")
we gained lockless wake-queues.
The -RT kernel managed to lockup itself with those. There could be multiple
attempts for task X to enqueue it for a wakeup _even_ if task X is already
running.
The reason is that task X could be runnable but not yet on CPU. The the
task performing the wakeup did not leave the CPU it could performe
multiple wakeups.
With the proper timming task X could be running and enqueued for a
wakeup. If this happens while X is performing a fork() then its its
child will have a !NULL `wake_q` member copied.
This is not a problem as long as the child task does not participate in
lockless wakeups :)
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 7675104990 ("sched: Implement lockless wake-queues")
Link: http://lkml.kernel.org/r/20151221171710.GA5499@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
on 32-bit systems:
UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
signed integer overflow:
87950 * 47742 cannot be represented in type 'int'
The most likely effect of this bug are bad load average numbers
resulting in weird scheduling. It's also likely that this can
persist for a longer time - until the system goes idle for
a long time so that all load avg numbers get reset.
[ This is the CFS load average metric, not the procfs output, which
is separate. ]
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/1450097243-30137-1-git-send-email-aryabinin@virtuozzo.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a race on CPU unplug where we free the swevent hash array
while it can still have events on. This will result in a
use-after-free which is BAD.
Simply do not free the hash array on unplug. This leaves the thing
around and no use-after-free takes place.
When the last swevent dies, we do a for_each_possible_cpu() iteration
anyway to clean these up, at which time we'll free it, so no leakage
will occur.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. The recordmcount change had an output that used sprintf() (incorrectly)
when it should have been a fprintf() to stderr.
2. The printk_formats file could crash if someone added a trace_printk()
in the core kernel, and also added one in a module. This does not
affect production kernels. Only kernels where developers add trace_printk()
for debugging can crash.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWi+jEAAoJEKKk/i67LK/8at0IAIMkbBMbRVjCM0tWTMM/2rZr
nD+2UOVZHDI7rSdhOC0yfRL041uiu/wI4DV6FAJZX8D5BumS7Wwv/GItwwNU3+TD
ZI6OG9f/6OxoC1jFUY8CvpSqAeV6uoro4heSzjprirSUsGwrFlTuHMt2NyEl0FvO
985HIzfTcb3yVFvsjm7Uyv1SsOdPL+BldDc46mgo8fXv3VYvvbqTP5NMkx7YyMdm
Dlo90b1nQ8bk3bjG4RvYmlnfK+HfbB2TD+rz3xJ+YaFRoJIov0/BzimeZaI3Aw/R
9TjLqwBN8ASVxc3A+/AQdUEserzXl7RSJHT/92YIQc8FkaS50cXX80Xk7ez0JVk=
=nb9G
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.4-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more fixes:
1. The recordmcount change had an output that used sprintf()
(incorrectly) when it should have been a fprintf() to stderr.
2. The printk_formats file could crash if someone added a
trace_printk() in the core kernel, and also added one in a module.
This does not affect production kernels. Only kernels where
developers add trace_printk() for debugging can crash"
* tag 'trace-v4.4-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix setting of start_index in find_next()
ftrace/scripts: Fix incorrect use of sprintf in recordmcount
Some sysfs attributes in /sys/power/ should really be read-only,
so add support for that, convert those attributes to read-only
and drop the stub .show() routines from them.
Original-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
panic at t_show.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 2957 Comm: sh Tainted: G W O 3.14.55-x86_64-01062-gd4acdc7 #2
RIP: 0010:[<ffffffff811375b2>]
[<ffffffff811375b2>] t_show+0x22/0xe0
RSP: 0000:ffff88002b4ebe80 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
FS: 0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
Call Trace:
[<ffffffff811dc076>] seq_read+0x2f6/0x3e0
[<ffffffff811b749b>] vfs_read+0x9b/0x160
[<ffffffff811b7f69>] SyS_read+0x49/0xb0
[<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
---[ end trace 5bd9eb630614861e ]---
Kernel panic - not syncing: Fatal exception
When the first time find_next calls find_next_mod_format, it should
iterate the trace_bprintk_fmt_list to find the first print format of
the module. However in current code, start_index is smaller than *pos
at first, and code will not iterate the list. Latter container_of will
get the wrong address with former v, which will cause mod_fmt be a
meaningless object and so is the returned mod_fmt->fmt.
This patch will fix it by correcting the start_index. After fixed,
when the first time calls find_next_mod_format, start_index will be
equal to *pos, and code will iterate the trace_bprintk_fmt_list to
get the right module printk format, so is the returned mod_fmt->fmt.
Link: http://lkml.kernel.org/r/5684B900.9000309@intel.com
Cc: stable@vger.kernel.org # 3.12+
Fixes: 102c9323c3 "tracing: Add __tracepoint_string() to export string pointers"
Signed-off-by: Qiu Peiyang <peiyangx.qiu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.
The per-hashtable spinlock is used for protecting bucket's
hlist, and per-bucket lock is just enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that
contention can be decreased a lot.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The spinlock is just used for protecting the per-bucket
hlist, so it isn't needed for selecting bucket.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preparing for removing global per-hashtable lock, so
the counter need to be defined as aotmic_t first.
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The posix_clock_poll function is supposed to return a bit mask of
POLLxxx values. However, in case the hardware has disappeared (due to
hot plugging for example) this code returns -ENODEV in a futile
attempt to throw an error at the file descriptor level. The kernel's
file_operations interface does not accept such error codes from the
poll method. Instead, this function aught to return POLLERR.
The value -ENODEV does, in fact, contain the POLLERR bit (and almost
all the other POLLxxx bits as well), but only by chance. This patch
fixes code to return a proper bit mask.
Credit goes to Markus Elfring for pointing out the suspicious
signed/unsigned mismatch.
Reported-by: Markus Elfring <elfring@users.sourceforge.net>
igned-off-by: Richard Cochran <richardcochran@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Link: http://lkml.kernel.org/r/1450819198-17420-1-git-send-email-richardcochran@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make the inode argument of the inode_getsecid hook non-const so that we
can use it to revalidate invalid security labels.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <pmoore@redhat.com>
The file tracing_enable is obsolete and does not exist anymore. Replace
the comment that references it with the proper tracing_on file.
Link: http://lkml.kernel.org/r/1450787141-45544-1-git-send-email-chuhu@redhat.com
Signed-off-by: Chuyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The start and end variables were only used when ftrace_module_init() was
split up into multiple functions. No need to keep them around after the
merger.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This bpf_verifier_ops structure is never modified, like the other
bpf_verifier_ops structures, so declare it as const.
Done with the help of Coccinelle.
Link: http://lkml.kernel.org/r/1449855359-13724-1-git-send-email-Julia.Lawall@lip6.fr
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jiri Olsa noted that the change to replace the control_ops did not update
the trampoline for when running perf on a single CPU and with CONFIG_PREEMPT
disabled (where dynamic ops, like perf, can use trampolines directly). The
result was that perf function could be called when RCU is not watching as
well as not handle the ftrace_local_disable().
Modify the ftrace_ops_get_func() to also check the RCU and PER_CPU ops flags
and use the recursive function if they are set. The recursive function is
modified to check those flags and execute the appropriate checks if they are
set.
Link: http://lkml.kernel.org/r/20151201134213.GA14155@krava.brq.redhat.com
Reported-by: Jiri Olsa <jolsa@redhat.com>
Patch-fixed-up-by: Jiri Olsa <jolsa@redhat.com>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently perf has its own list function within the ftrace infrastructure
that seems to be used only to allow for it to have per-cpu disabling as well
as a check to make sure that it's not called while RCU is not watching. It
uses something called the "control_ops" which is used to iterate over ops
under it with the control_list_func().
The problem is that this control_ops and control_list_func unnecessarily
complicates the code. By replacing FTRACE_OPS_FL_CONTROL with two new flags
(FTRACE_OPS_FL_RCU and FTRACE_OPS_FL_PER_CPU) we can remove all the code
that is special with the control ops and add the needed checks within the
generic ftrace_list_func().
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When showing all tramps registered to a ftrace record in the file
enabled_functions, it exits the loop with ops == NULL. But then it is
suppose to show the function on the ops->trampoline and
add_trampoline_func() is called with the given ops. But because ops is now
NULL (to exit the loop), it always shows the static trampoline instead of
the one that is really registered to the record.
The call to add_trampoline_func() that shows the trampoline for the given
ops needs to be called at every iteration.
Fixes: 39daa7b9e8 "ftrace: Show all tramps registered to a record on ftrace_bug()"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add include asm/paravirt.h to cputime.c, as steal_account_process_tick
calls paravirt_steal_clock, which is defined in asm/paravirt.h.
The ifdef CONFIG_PARAVIRT is necessary because not all archs have an
asm/paravirt.h to include.
The reason why currently cputime.c compiles, even though include
<asm/paravirt.h> is missing, is that on x86 asm/paravirt.h is included
by one of the other headers included in kernel/sched/cputime.c:
On arm and arm64, where I am about to introduce asm/paravirt.h and
stolen time support, without #include <asm/paravirt.h> in cputime.c, I
would get an error.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Since there will be several places checking if fwnode.type
is equal FWNODE_IRQCHIP, this patch adds a convenient function
for this purpose.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
While reviewing Michael Kerrisk's recent futex manpage update, I noticed
that we allow the FUTEX_CLOCK_REALTIME flag for FUTEX_WAIT_BITSET but
not for FUTEX_WAIT.
FUTEX_WAIT is treated as a simple version for FUTEX_WAIT_BITSET
internally (with a bitmask of FUTEX_BITSET_MATCH_ANY). As such, I cannot
come up with a reason for this exclusion for FUTEX_WAIT.
This change does modify the behavior of the futex syscall, changing a
call with FUTEX_WAIT | FUTEX_CLOCK_REALTIME from returning -ENOSYS, to be
equivalent to FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME with a bitset of
FUTEX_BITSET_MATCH_ANY.
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Link: http://lkml.kernel.org/r/9f3bdc116d79d23f5ee72ceb9a2a857f5ff8fa29.1450474525.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
out_unlock: does not only drop the locks, it also drops the refcount
on the pi_state. Really intuitive.
Move the label after the put_pi_state() call and use 'break' in the
error handling path of the requeue loop.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.526665141@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In the error handling cases we neither have pi_state nor a reference
to it. Remove the pointless code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.432780944@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Documentation of the pi_state refcounting in the requeue code is non
existent. Add it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.335938312@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
free_pi_state() is confusing as it is in fact only freeing/caching the
pi state when the last reference is gone. Rename it to put_pi_state()
which reflects better what it is doing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.259636467@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.
Add the missing free_pi_state() call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <darren@dvhart.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe <Andy_Lowe@mentor.com>
Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
The clocksource validation which makes sure that the newly read value
is not smaller than the last value only works if the clocksource mask
is 64bit, i.e. the counter is 64bit wide. But we want to use that
mechanism also for clocksources which are less than 64bit wide.
So instead of checking whether bit 63 is set, we check whether the
most significant bit of the clocksource mask is set in the delta
result. If it is set, we return 0.
[ tglx: Simplified the implementation, added a comment and massaged
the commit message ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Cc: <linux-arm-kernel@lists.infradead.org>
Link: http://lkml.kernel.org/r/56349607.6070708@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull the MSI wire bridge implementation from Marc Zyngier along with
the first user of it. This is infrastructure to support a wired
interrupt to MSI interrupt brigde. The first user is mbigen found in
Hisilicon ARM SoCs.
Get the core time(keeping) updates from John Stultz
- NTP robustness tweaks
- Another signed overflow nailed down
- More y2038 changes
- Stop alarmtimer after resume
- MAINTAINERS update
- Selftest fixes
Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other CPUs
CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other CPUs
infinite loop
If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.
In another case:
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
<NMI>
io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop
Clearly, this is an undesirable result.
To fix this problem, this patch changes crash_kexec() to exclude others
by using the panic_cpu atomic.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrs
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends an NMI IPI to CPUs which haven't called panic() to stop them,
save their register information and do some cleanups for crash dumping.
However, if such a CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.
For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows:
CPU 0 CPU 1
=========================== ==========================
receive an unknown NMI
unknown_nmi_error()
panic() receive an unknown NMI
spin_trylock(&panic_lock) unknown_nmi_error()
crash_kexec() panic()
spin_trylock(&panic_lock)
panic_smp_self_stop()
infinite loop
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
infinite loop...
Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
so the NMI is not handled and the callback function to save registers is
never called.
In practice, this can happen on some servers which broadcast NMIs to all
CPUs when the NMI button is pushed.
To save registers in this case, we need to:
a) Return from NMI handler instead of looping infinitely
or
b) Call the callback function directly from the infinite loop
Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices. So, we chose b).
This patch does the following:
1. Move the infinite looping of CPUs which haven't called panic() in NMI
context (actually done by panic_smp_self_stop()) outside of panic() to
enable us to refer pt_regs. Please note that panic_smp_self_stop() is
still used for normal context.
2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
registers and do some cleanups after setting waiting_for_crash_ipi which
is used for counting down the number of CPUs which handled the callback
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If panic on NMI happens just after panic() on the same CPU, panic() is
recursively called. Kernel stalls, as a result, after failing to acquire
panic_lock.
To avoid this problem, don't call panic() in NMI context if we've
already entered panic().
For that, introduce nmi_panic() macro to reduce code duplication. In
the case of panic on NMI, don't return from NMI handlers if another CPU
already panicked.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: kexec@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Back in the days where eBPF (or back then "internal BPF" ;->) was not
exposed to user space, and only the classic BPF programs internally
translated into eBPF programs, we missed the fact that for classic BPF
A and X needed to be cleared. It was fixed back then via 83d5b7ef99
("net: filter: initialize A and X registers"), and thus classic BPF
specifics were added to the eBPF interpreter core to work around it.
This added some confusion for JIT developers later on that take the
eBPF interpreter code as an example for deriving their JIT. F.e. in
f75298f5c3 ("s390/bpf: clear correct BPF accumulator register"), at
least X could leak stack memory. Furthermore, since this is only needed
for classic BPF translations and not for eBPF (verifier takes care
that read access to regs cannot be done uninitialized), more complexity
is added to JITs as they need to determine whether they deal with
migrations or native eBPF where they can just omit clearing A/X in
their prologue and thus reduce image size a bit, see f.e. cde66c2d88
("s390/bpf: Only clear A and X for converted BPF programs"). In other
cases (x86, arm64), A and X is being cleared in the prologue also for
eBPF case, which is unnecessary.
Lets move this into the BPF migration in bpf_convert_filter() where it
actually belongs as long as the number of eBPF JITs are still few. It
can thus be done generically; allowing us to remove the quirk from
__bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
while reducing code duplication on this matter in current(/future) eBPF
JITs.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Tested-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: Yang Shi <yang.shi@linaro.org>
Acked-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Zi Shen Lim <zlim.lnx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
The Cavium guys reported a soft lockup on their arm64 machine, caused by
commit c55a6ffa62 ("locking/osq: Relax atomic semantics"):
mutex_optimistic_spin+0x9c/0x1d0
__mutex_lock_slowpath+0x44/0x158
mutex_lock+0x54/0x58
kernfs_iop_permission+0x38/0x70
__inode_permission+0x88/0xd8
inode_permission+0x30/0x6c
link_path_walk+0x68/0x4d4
path_openat+0xb4/0x2bc
do_filp_open+0x74/0xd0
do_sys_open+0x14c/0x228
SyS_openat+0x3c/0x48
el0_svc_naked+0x24/0x28
This is because in osq_lock we initialise the node for the current CPU:
node->locked = 0;
node->next = NULL;
node->cpu = curr;
and then publish the current CPU in the lock tail:
old = atomic_xchg_acquire(&lock->tail, curr);
Once the update to lock->tail is visible to another CPU, the node is
then live and can be both read and updated by concurrent lockers.
Unfortunately, the ACQUIRE semantics of the xchg operation mean that
there is no guarantee the contents of the node will be visible before
lock tail is updated. This can lead to lock corruption when, for
example, a concurrent locker races to set the next field.
Fixes: c55a6ffa62 ("locking/osq: Relax atomic semantics"):
Reported-by: David Daney <ddaney@caviumnetworks.com>
Reported-by: Andrew Pinski <andrew.pinski@caviumnetworks.com>
Tested-by: Andrew Pinski <andrew.pinski@caviumnetworks.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thus its been occasionally noted that users have seen
confusing warnings like:
Adjusting tsc more than 11% (5941981 vs 7759439)
We try to limit the maximum total adjustment to 11% (10% tick
adjustment + 0.5% frequency adjustment). But this is done by
bounding the requested adjustment values, and the internal
steering that is done by tracking the error from what was
requested and what was applied, does not have any such limits.
This is usually not problematic, but in some cases has a risk
that an adjustment could cause the clocksource mult value to
overflow, so its an indication things are outside of what is
expected.
It ends up most of the reports of this 11% warning are on systems
using chrony, which utilizes the adjtimex() ADJ_TICK interface
(which allows a +-10% adjustment). The original rational for
ADJ_TICK unclear to me but my assumption it was originally added
to allow broken systems to get a big constant correction at boot
(see adjtimex userspace package for an example) which would allow
the system to work w/ ntpd's 0.5% adjustment limit.
Chrony uses ADJ_TICK to make very aggressive short term corrections
(usually right at startup). Which push us close enough to the max
bound that a few late ticks can cause the internal steering to push
past the max adjust value (tripping the warning).
Thus this patch adds some extra logic to enforce the max adjustment
cap in the internal steering.
Note: This has the potential to slow corrections when the ADJ_TICK
value is furthest away from the default value. So it would be good to
get some testing from folks using chrony, to make sure we don't
cause any troubles there.
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Tested-by: Miroslav Lichvar <mlichvar@redhat.com>
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The function "second_overflow" uses "unsign long"
as its input parameter type which will overflow after
year 2106 on 32bit systems.
Thus this patch replaces it with time64_t type.
While the 64-bit division is expensive, "next_ntp_leap_sec"
has been calculated already, so we can just re-use it in the
TIME_INS/DEL cases, allowing one expensive division per
leapsecond instead of re-doing the divsion once a second after
the leap flag has been set.
Signed-off-by: DengChao <chao.deng@linaro.org>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The type of static variant "time_reftime" and the call of
get_seconds in ntp are both not y2038 safe.
So change the type of time_reftime to time64_t and replace
get_seconds with __ktime_get_real_seconds.
The local variant "secs" in ntp_update_offset represents
seconds between now and last ntp adjustment, it seems impossible
that this time will last more than 68 years, so keep its type as
"long".
Reviewed-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: DengChao <chao.deng@linaro.org>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
In order to fix Y2038 issues in the ntp code we will need replace
get_seconds() with ktime_get_real_seconds() but as the ntp code uses
the timekeeping lock which is also used by ktime_get_real_seconds(),
we need a version without locking.
Add a new function __ktime_get_real_seconds() in timekeeping to
do this.
Reviewed-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: DengChao <chao.deng@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
To be able to allocate interrupts from the MSI layer down,
add a new msi_domain_populate_irqs entry point.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The .prepare callbacks are so far only called from msi_domain_alloc_irqs.
In order to reuse that code, split that code and create a
msi_domain_prepare_irqs function that the existing code can call into.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
We are soon going to need the MSI layer to call into the domain
allocators. Instead of open coding this, make the standard
irq_domain_alloc_irqs_recursive function available to the MSI
layer.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup") added global
spinlock cgroup_sk_update_lock but erroneously skipped initializer
leading to uninitialized spinlock warning. Fix it by using
DEFINE_SPINLOCK().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: David S. Miller <davem@davemloft.net>
The new VMD device driver needs to iterate over a list of
"demultiplexing" interrupts. Protecting that list with a lock is not
possible because the list is also required in code pathes which hold
irq descriptor lock. Therefor the demultiplexing interrupt handler
would create a lock inversion scenario if it calls a demux handler
with the list protection lock held.
A solution for this is to free the irq descriptor via RCU, so the
list can be walked with rcu read lock held.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
If a interrupt chip utilizes chip->buslock then free_irq() can
deadlock in the following way:
CPU0 CPU1
interrupt(X) (Shared or spurious)
free_irq(X) interrupt_thread(X)
chip_bus_lock(X)
irq_finalize_oneshot(X)
chip_bus_lock(X)
synchronize_irq(X)
synchronize_irq() waits for the interrupt thread to complete,
i.e. forever.
Solution is simple: Drop chip_bus_lock() before calling
synchronize_irq() as we do with the irq_desc lock. There is nothing to
be protected after the point where irq_desc lock has been released.
This adds chip_bus_lock/unlock() to the remove_irq() code path, but
that's actually correct in the case where remove_irq() is called on
such an interrupt. The current users of remove_irq() are not affected
as none of those interrupts is on a chip which requires buslock.
Reported-by: Fredrik Markström <fredrik.markstrom@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bc ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for renaming and hard links to the fs. Most of this can be
implemented by using simple library operations under the same constraints
that we don't use a reserved name like elsewhere. Linking can be useful
to share/manage things like maps across subsystem users. It works within
the file system boundary, but is not allowed for directories.
Symbolic links are explicitly not implemented here, as it can be better
done already by doing bind mounts inside bpf fs to set up shared directories
f.e. useful when using volumes in docker containers that map a private
working directory into /sys/fs/bpf/ which contains itself a bind mounted
path from the host's /sys/fs/bpf/ mount that is shared among multiple
containers. For single maps instead of whole directory, hard links can
be easily used to do the same.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the full stop_machine() routine is only enabled on SMP if
module unloading is enabled, or if the CPUs are hotpluggable. This
leads to configurations where stop_machine() is broken as it will then
only run the callback on the local CPU with irqs disabled, and not stop
the other CPUs or run the callback on them.
For example, this breaks MTRR setup on x86 in certain configs since
ea8596bb2d ("kprobes/x86: Remove unused text_poke_smp() and
text_poke_smp_batch() functions") as the MTRR is only established on the
boot CPU.
This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
and HOTPLUG_CPU config options to compile the correct stop_machine() for
the architecture, removing the false dependency on MODULE_UNLOAD in the
process.
Link: https://lkml.org/lkml/2014/10/8/124
References: https://bugs.freedesktop.org/show_bug.cgi?id=84794
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Pranith Kumar <bobby.prani@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Iulia Manda <iulia.manda21@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For adjtimex()'s ADJ_SETOFFSET, make sure the tv_usec value is
sane. We might multiply them later which can cause an overflow
and undefined behavior.
This patch introduces new helper functions to simplify the
checking code and adds comments to clarify
Orginally this patch was by Sasha Levin, but I've basically
rewritten it, so he should get credit for finding the issue
and I should get the blame for any mistakes made since.
Also, credit to Richard Cochran for the phrasing used in the
comment for what is considered valid here.
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
We need to make sure that the offset is valid before manipulating it,
otherwise it might overflow on the multiplication.
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
[jstultz: Reworked one of the checks so it makes more sense]
Signed-off-by: John Stultz <john.stultz@linaro.org>
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup fixes from Tejun Heo:
"More change than I'd have liked at this stage. The pids controller
and the changes made to cgroup core to support it introduced and
revealed several important issues.
- Assigning membership to a newly created task and migrating it can
race leading to incorrect accounting. Oleg fixed it by widening
threadgroup synchronization. It looks like we'll be able to merge
it with a different percpu rwsem which is used in fork path making
things simpler and cheaper.
- The recent change to extend cgroup membership to zombies (so that
pid accounting can extend till the pid is actually released) missed
pinning the underlying data structures leading to use-after-free.
Fixed.
- v2 hierarchy was calling subsystem callbacks with the wrong target
cgroup_subsys_state based on the incorrect assumption that they
share the same target. pids is the first controller affected by
this. Subsys callbacks updated so that they can deal with
multi-target migrations"
* 'for-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup_pids: don't account for the root cgroup
cgroup: fix handling of multi-destination migration from subtree_control enabling
cgroup_freezer: simplify propagation of CGROUP_FROZEN clearing in freezer_attach()
cgroup: pids: kill pids_fork(), simplify pids_can_fork() and pids_cancel_fork()
cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
cgroup: make css_set pin its css's to avoid use-afer-free
cgroup: fix cftype->file_offset handling
Pull perf fixes from Ingo Molnar:
"This tree includes four core perf fixes for misc bugs, three fixes to
x86 PMU drivers, and two updates to old email addresses"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Do not send exit event twice
perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro
perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell
perf: Fix PERF_EVENT_IOC_PERIOD deadlock
treewide: Remove old email address
perf/x86: Fix LBR call stack save/restore
perf: Update email address in MAINTAINERS
perf/core: Robustify the perf_cgroup_from_task() RCU checks
perf/core: Fix RCU problem with cgroup context switching code
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
touch_softlockup_watchdog() is used to tell watchdog that scheduler
stall is expected. One group of usage is from paths where the task
may not be able to yield for a long time such as performing slow PIO
to finicky device and coming out of suspend. The other is to account
for scheduler and timer going idle.
For scheduler softlockup detection, there's no reason to distinguish
the two cases; however, workqueue lockup detector is planned and it
can use the same signals from the former group while the latter would
spuriously prevent detection. This patch introduces a new function
touch_softlockup_watchdog_sched() and convert the latter group to call
it instead. For now, it just calls touch_softlockup_watchdog() and
there's no functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ulrich Obergfell <uobergfe@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Task or work item involved in memory reclaim trying to flush a
non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
deadlock. Trigger WARN_ONCE() if such conditions are detected.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Certain interrupt controller drivers have a register set that does not
make it easy to save/restore the mask of enabled/disabled interrupts
at suspend/resume time. At resume time, such drivers rely on the core
kernel irq subsystem to tell whether such or such interrupt is enabled
or not, in order to restore the proper state in the interrupt
controller register.
While the irqd_irq_disabled() provides the relevant information for
global interrupts, there is no similar function to query the
enabled/disabled state of a per-CPU interrupt.
Therefore, this commit complements the percpu_irq API with an
irq_percpu_is_enabled() function.
[ tglx: Simplified the implementation and added kerneldoc ]
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Tawfik Bayouk <tawfik@marvell.com>
Cc: Nadav Haklai <nadavh@marvell.com>
Cc: Lior Amsalem <alior@marvell.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: Gregory Clement <gregory.clement@free-electrons.com>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Link: http://lkml.kernel.org/r/1445347435-2333-2-git-send-email-thomas.petazzoni@free-electrons.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWZMgaAAoJEHm+PkMAQRiGGcIH+gNS/hbN2DKW7wphl1QuaV7C
1fror8AvpwbGa/o0yuxovaVuZzAR0TF31vn+gAemF4U/hnM25xqxEHXYZEVv8OWw
mbz4/z+jbVk3SiS5AiZPIZgL4W6RZnG5QYfiTVGPlBHuBznW2ITlNlClBOmBL45o
uhb3bjTzi70AZ7Gh6i9sHgJoHg6D9u/ZxLaLcWnM79BzyTMHTf2t0wnrQmh66lEE
hp7Rn9wXv9bk/e3iH7CVUb97P4IWhhkmfqcoturqAg9+C/M26b0VmvQp9Sy8S6Pd
FVQ+SUIZllj5ZDKe9mOcs37czlxTr0keEFqzWeMh/7y4iuI3RaRp/qb+7mX5sIE=
=WGZ1
-----END PGP SIGNATURE-----
Back merge tag 'v4.4-rc4' into drm-next
We've picked up a few conflicts and it would be nice
to resolve them before we move onwards.
The return value from rcu_gp_init() is always used as a bool, so
this commit makes it be a bool.
Reported-by: Iftekhar Ahmed <ahmedi@oregonstate.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch removes a potential deadlock hazard by moving the
wake_up_process() in rcu_spawn_gp_kthread() out from under rnp->lock.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit replaces a local_irq_save()/local_irq_restore() pair with
a lockdep assertion that interrupts are already disabled. This should
remove the corresponding overhead from the interrupt entry/exit fastpaths.
This change was inspired by the fact that Iftekhar Ahmed's mutation
testing showed that removing rcu_irq_enter()'s call to local_ird_restore()
had no effect, which might indicate that interrupts were always enabled
anyway.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The cpu_needs_another_gp() function is currently of type int, but only
returns zero or one. Bow to reality and make it be of type bool.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Now that the rcu_state structure's ->rda field is compile-time initialized,
there is no need to pass the per-CPU rcu_data structure into rcu_init_one().
This commit therefore eliminates this now-unused parameter.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_expedited, rcu_normal, and rcu_normal_after_boot kernel boot
parameters are pointless in the case of TINY_RCU because in that case
synchronous grace periods, both expedited and normal, are no-ops.
However, these three symbols contribute several hundred bytes of bloat.
This commit therefore uses CPP directives to avoid compiling this code
in TINY_RCU kernels.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The clocksource watchdog reporting was improved by 0b046b217a.
I want to add the info of CPU where the watchdog detects a
deviation because it is necessary to identify the trouble spot
if the clocksource is TSC.
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <john.stultz@linaro.org>
1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
of the same logic to avoid keeping a semi-redundant struct timespec
in struct timekeeper.
However, the commit also introduced a subtle semantic change - where
clocksource_cyc2ns() uses purely unsigned math, the new version introduces
a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
overflow the following shift will still give a negative result. The
choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
generally happen if there's a ~10 minute pause in examining the
clocksource.
This can be triggered on a powerpc KVM guest by stopping it from qemu for
a bit over 10 minutes. After resuming time has jumped backwards several
minutes causing numerous problems (jiffies does not advance, msleep()s can
be extended by minutes..). It doesn't happen on x86 KVM guests, because
the guest TSC is effectively frozen while the guest is stopped, which is
not the case for the powerpc timebase.
Obviously an unsigned (64 bit) overflow will only take twice as long as a
signed, 63-bit overflow. I don't know the time code well enough to know
if that will still cause incorrect calculations, or if a 64-bit overflow
is avoided elsewhere.
Still, an incorrect forwards clock adjustment will cause less trouble than
time going backwards. So, this patch removes the potential for
intermediate signed overflow.
Cc: stable@vger.kernel.org (3.7+)
Suggested-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: John Stultz <john.stultz@linaro.org>
The following commit which went into mainline through networking tree
3b13758f51 ("cgroups: Allow dynamically changing net_classid")
conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.
1f7dd3e5a6 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")
The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths. The latter
drops @css from cgrp_attach().
Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task. We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Nina Schiff <ninasc@fb.com>
Pull scheduler fixes from Thomas Gleixner:
"This updates contains the following changes:
- Fix a signal handling regression in the bit wait functions.
- Avoid false positive warnings in the wakeup path.
- Initialize the scheduler root domain properly.
- Handle gtime calculations in proc/$PID/stat proper.
- Add more documentation for the barriers in try_to_wake_up().
- Fix a subtle race in try_to_wake_up() which might cause a task to
be scheduled on two cpus
- Compile static helper function only when it is used"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule()
sched/core: Better document the try_to_wake_up() barriers
sched/cputime: Fix invalid gtime in proc
sched/core: Clear the root_domain cpumasks in init_rootdomain()
sched/core: Remove false-positive warning from wake_up_process()
sched/wait: Fix signal handling in bit wait helpers
sched/rt: Hide the push_irq_work_func() declaration
Various functions implement the same pattern to send IPIs to an
event's CPU. Collapse the easy ones in a common helper function to
reduce duplication.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case we monitor events system wide, we get EXIT event
(when configured) twice for each task that exited.
Note doubled lines with same pid/tid in following example:
$ sudo ./perf record -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ]
$ sudo ./perf report -D | grep EXIT
0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250)
1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253)
2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252)
The reason is that the cpu contexts are processes each time
we call perf_event_task. I'm changing the perf_event_aux logic
to serve task_ctx and cpu contexts separately, which ensure we
don't get EXIT event generated twice on same cpu context.
This does not affect other auxiliary events, as they don't
use task_ctx at all.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1446649205-5822-1-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, ->gp_state is printed as an integer, which slows debugging.
This commit therefore prints a symbolic name in addition to the integer.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Updated to fix relational operator called out by Dan Carpenter. ]
[ paulmck: More "const", as suggested by Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, rcu_torture_writer_state is printed as an integer, which slows
debugging. This commit therefore prints a symbolic name in addition to
the integer.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: More "const", as suggested by Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This commit increases debug information in the case where the grace-period
kthread is being prevented from running by dumping that kthread's stack.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Split into prior commit and this commit, as suggested by
Josh Triplett. ]
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, if the RCU grace-period kthread has not yet been created,
in which case the starvation-check code will print zero for the state,
which maps to TASK_RUNNING. This could clearly be quite confusing, so
this commit prints ~0, which does not map to any legal ->state value.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Calling set_memory_rw() and set_memory_ro() for every iteration of the
loop in klp_write_object_relocations() is messy, inefficient, and
error-prone.
Change all the read-only pages to read-write before the loop and convert
them back to read-only again afterwards.
Suggested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
As agreed with Rusty, we're taking a current module-next pile through
livepatching.git, as it contains solely patches that are pre-requisity
for module page protection cleanups in livepatching. Rusty will be
restarting module-next from scratch.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently, percpu symbols from .data..percpu ELF section of a module are
not copied over and stored in final symtab array of struct module.
Consequently such symbol cannot be returned via kallsyms API (for
example kallsyms_lookup_name). This can be especially confusing when the
percpu symbol is exported. Only its __ksymtab et al. are present in its
symtab.
The culprit is in layout_and_allocate() function where SHF_ALLOC flag is
dropped for .data..percpu section. There is in fact no need to copy the
section to final struct module, because kernel module loader allocates
extra percpu section by itself. Unfortunately only symbols from
SHF_ALLOC sections are copied due to a check in is_core_symbol().
The patch changes is_core_symbol() function to copy over also percpu
symbols (their st_shndx points to .data..percpu ELF section). We do it
only if CONFIG_KALLSYMS_ALL is set to be consistent with the rest of the
function (ELF section is SHF_ALLOC but !SHF_EXECINSTR). Finally
elf_type() returns type 'a' for a percpu symbol because its address is
absolute.
Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Modules have three sections: text, rodata and writable data. The code
handled the case where these overlapped, however they never can:
debug_align() ensures they are always page-aligned.
This is why we got away with manually traversing the pages in
set_all_modules_text_rw() without rounding.
We create three helper functions: frob_text(), frob_rodata() and
frob_writable_data(). We then call these explicitly at every point,
so it's clear what we're doing.
We also expose module_enable_ro() and module_disable_ro() for
livepatch to use.
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Makes it easier to handle init vs core cleanly, though the change is
fairly invasive across random architectures.
It simplifies the rbtree code immediately, however, while keeping the
core data together in the same cachline (now iff the rbtree code is
enabled).
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
An exact mapping would be within_module_core(), but at this stage
(MODULE_STATE_GOING) the init section is empty, and this is clearer.
Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When setting a module's RO and NX permissions, set_section_ro_nx() is
used, but when clearing them, unset_module_{init,core}_ro_nx() are used.
The unset functions don't have the same checks the set function has for
partial page protections. It's probably harmless, but it's still
confusingly asymmetrical.
Instead, use the same logic to do both. Also add some new
set_module_{init,core}_ro_nx() helper functions for more symmetry with
the unset functions.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
We need the scheduler's fastpaths to be, well, fast, and unnecessarily
disabling and re-enabling interrupts is not necessarily consistent with
this goal. Especially given that there are regions of the scheduler that
already have interrupts disabled.
This commit therefore moves the call to rcu_note_context_switch()
to one of the interrupts-disabled regions of the scheduler, and
removes the now-redundant disabling and re-enabling of interrupts from
rcu_note_context_switch() and the functions it calls.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested
by Peter Zijlstra. ]
Currently, rcu_prepare_for_idle() checks for tick_nohz_active, even on
individual NOCBs CPUs, unless all CPUs are marked as NOCBs CPUs at build
time. This check is pointless on NOCBs CPUs because they never have any
callbacks posted, given that all of their callbacks are handed off to the
corresponding rcuo kthread. There is a check for individually designated
NOCBs CPUs, but it pointelessly follows the check for tick_nohz_active.
This commit therefore moves the check for individually designated NOCBs
CPUs up with the check for CONFIG_RCU_NOCB_CPU_ALL.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This function no longer has #ifdefs, so this commit removes the
header comment calling them out.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Several releases have come and gone without the warning triggering,
so remove the lock-acquisition loop. Retain the WARN_ON_ONCE()
out of sheer paranoia.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit applies an early-exit approach to rcu_sched_qs(), reducing
the nesting level and saving a line of code.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The Kconfig currently controlling compilation of this code is:
init/Kconfig:config TREE_RCU_TRACE
init/Kconfig: def_bool RCU_TRACE && ( TREE_RCU || PREEMPT_RCU )
...meaning that it currently is not being built as a module by anyone.
Lets remove the modular code that is essentially orphaned, so that
when reading the file there is no doubt it is builtin-only.
Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. We could
consider moving this to an earlier initcall if desired.
We don't replace module.h with init.h since the file already has that.
We also delete the moduleparam.h include that is left over from
commit 64db4cfff9 (""Tree RCU": scalable
classic RCU implementation") since it is not needed here either.
We morph some tags like MODULE_AUTHOR into the comments at the top of
the file for documentation purposes.
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the rcu_node_class[], rcu_fqs_class[], and rcu_exp_class[]
arrays needlessly pollute the global namespace within tree.c. This
commit therefore converts them to static local variables within
rcu_init_one().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Expedited grace periods can speed up boot, but are undesirable in
aggressive real-time systems. This commit therefore introduces a
kernel parameter rcupdate.rcu_normal_after_boot that disables
expedited grace periods just before init is spawned.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Although expedited grace periods can be quite useful, and although their
OS jitter has been greatly reduced, they can still pose problems for
extreme real-time workloads. This commit therefore adds a rcu_normal
kernel boot parameter (which can also be manipulated via sysfs)
to suppress expedited grace periods, that is, to treat requests for
expedited grace periods as if they were requests for normal grace periods.
If both rcu_expedited and rcu_normal are specified, rcu_normal wins.
This means that if you are relying on expedited grace periods to speed up
boot, you will want to specify rcu_expedited on the kernel command line,
and then specify rcu_normal via sysfs once boot completes.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds print statements that check the rcu_node structure to
find which ->expmask bits and which ->exp_tasks structures are blocking
the current expedited grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, if a grace period ends just as the stall-warning timeout
fires, an empty stall warning will be printed. This is not helpful,
so this commit avoids these useless warnings by rechecking completion
after awakening in synchronize_sched_expedited_wait().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Currently, the piggybacked-work checks carried out by sync_exp_work_done()
atomically increment a small set of variables (the ->expedited_workdone0,
->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3
fields in the rcu_state structure), which will form a memory-contention
bottleneck given a sufficiently large number of CPUs concurrently invoking
either synchronize_rcu_expedited() or synchronize_sched_expedited().
This commit therefore moves these for fields to the per-CPU rcu_data
structure, eliminating the memory contention. The show_rcuexp() function
also changes to sum up each field in the rcu_data structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit saves a couple lines of code and reduces indentation
by inverting the sense of an "if" statement in the function
sync_rcu_exp_select_cpus().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The memory barrier in rcu_seq_snap() is needed only for grace periods,
so this commit moves it to the grace-period-oriented wrapper
rcu_exp_gp_seq_snap().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Analogy with the ->qsmaskinitnext field might lead one to believe that
->expmaskinitnext tracks online CPUs. This belief is incorrect: Any CPU
that has ever been online will have its bit set in the ->expmaskinitnext
field. This commit therefore adds a comment to make this clear, at
least to people who read comments.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If there is only one CPU, then invoking synchronize_sched_expedited()
is by definition a grace period. This commit checks for this condition
and does a short-circuit return in that case.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.
This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s
For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt. This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.
The pv_wait_head() function was renamed as pv_wait_head_or_lock()
as it was modified to acquire the lock before returning. This is
necessary because of possible lock stealing attempts from other tasks.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of data which
enable us to calculate the kicking and wakeup latencies which have
a heavy dependency on the CPUs being used.
The statistical counters are per-cpu variables to minimize the
performance overhead in their updates. These counters are exported
via the debugfs filesystem under the qlockstat directory. When the
corresponding debugfs files are read, summation and computing of the
required data are then performed.
The measured latencies for different CPUs are:
CPU Wakeup Kicking
--- ------ -------
Haswell-EX 63.6us 7.4us
Westmere-EX 67.6us 9.3us
The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.
A sample of statistical counters after system bootup (with vCPU
overcommit) was:
pv_hash_hops=1.00
pv_kick_unlock=1148
pv_kick_wake=1146
pv_latency_kick=11040
pv_latency_wake=194840
pv_spurious_wakeup=7
pv_wait_again=4
pv_wait_head=23
pv_wait_node=1129
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.
Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.
Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:
10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares
In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.
This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.
By doing so, the perf profile became:
9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares
The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:
Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.
With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.
This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When building a kernel with a gcc 6 snapshot the compiler complains
about unused const static variables for prio_to_weight and prio_to_mult
for multiple scheduler files (all but core.c and autogroup.c)
The way the array is currently declared it will be duplicated in
every scheduler file that includes sched.h, which seems rather wasteful.
Move the array out of line into core.c. I also added a sched_ prefix
to avoid any potential name space collisions.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The cputime can only be updated by the current task itself, even in
vtime case. So we can safely use seqcount instead of seqlock as there
is no writer concurrency involved.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Readers need to know if vtime runs at all on some CPU somewhere, this
is a fast-path check to determine if we need to check further the need
to add up any tickless cputime delta.
This fast path check uses context tracking state because vtime is tied
to context tracking as of now. This check appears to be confusing though
so lets use a vtime function that deals with context tracking details
in vtime implementation instead.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
vtime_accounting_enabled() checks if vtime is running on the current CPU
and is as such a misnomer. Lets rename it to a function that reflect its
locality. We are going to need the current name for a function that tells
if vtime runs at all on some CPU.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a task runs on a housekeeper (a CPU running with the periodic tick
with neighbours running tickless), it doesn't account cputime using vtime
but relies on the tick. Such a task has its vtime_snap_whence value set
to VTIME_INACTIVE.
Readers won't handle that correctly though. As long as vtime is running
on some CPU, readers incorretly assume that vtime runs on all CPUs and
always compute the tickless cputime delta, which is only junk on
housekeepers.
So lets fix this with checking that the target runs on a vtime CPU through
the appropriate state check before computing the tickless delta.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-5-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
VTIME_SLEEPING state happens either when:
1) The task is sleeping and no tickless delta is to be added on the task
cputime stats.
2) The CPU isn't running vtime at all, so the same properties of 1) applies.
Lets rename the vtime symbol to reflect both states.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is an extra cost in task_cputime() and task_cputime_scaled() when
nohz_full is not activated. When vtime accounting is not enabled, we
don't need to get deltas of utime and stime under vtime seqlock.
This patch removes that cost with adding a shortcut route if vtime
accounting is not enabled.
Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless cputime delta.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.
This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.
These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These are some notes on the scheduler locking and how it provides
program order guarantees on SMP systems.
( This commit is in the locking tree, because the new documentation
refers to a newly introduced locking primitive. )
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.
This primitive has two benefits:
- it documents control dependencies,
- its typically cheaper than using smp_load_acquire() in a loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So we want to change a locking API, but the scheduler uses it, and a conflict
is generated by a recent scheduler fix.
Pick up the pending scheduler fixes to make life easier.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Oleg noticed that its possible to falsely observe p->on_cpu == 0 such
that we'll prematurely continue with the wakeup and effectively run p on
two CPUs at the same time.
Even though the overlap is very limited; the task is in the middle of
being scheduled out; it could still result in corruption of the
scheduler data structures.
CPU0 CPU1
set_current_state(...)
<preempt_schedule>
context_switch(X, Y)
prepare_lock_switch(Y)
Y->on_cpu = 1;
finish_lock_switch(X)
store_release(X->on_cpu, 0);
try_to_wake_up(X)
LOCK(p->pi_lock);
t = X->on_cpu; // 0
context_switch(Y, X)
prepare_lock_switch(X)
X->on_cpu = 1;
finish_lock_switch(Y)
store_release(Y->on_cpu, 0);
</preempt_schedule>
schedule();
deactivate_task(X);
X->on_rq = 0;
if (X->on_rq) // false
if (t) while (X->on_cpu)
cpu_relax();
context_switch(X, ..)
finish_lock_switch(X)
store_release(X->on_cpu, 0);
Avoid the load of X->on_cpu being hoisted over the X->on_rq load.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Explain how the control dependency and smp_rmb() end up providing
ACQUIRE semantics and pair with smp_store_release() in
finish_lock_switch().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
/proc/stats shows invalid gtime when the thread is running in guest.
When vtime accounting is not enabled, we cannot get a valid delta.
The delta is calculated with now - tsk->vtime_snap, but tsk->vtime_snap
is only updated when vtime accounting is runtime enabled.
This patch makes task_gtime() just return gtime without computing the
buggy non-existing tickless delta when vtime accounting is not enabled.
Use context_tracking_is_enabled() to check if vtime is accounting on
some cpu, in which case only we need to check the tickless delta. This
way we fix the gtime value regression on machines not running nohz full.
The kernel config contains CONFIG_VIRT_CPU_ACCOUNTING_GEN=y and
CONFIG_NO_HZ_FULL_ALL=n and boot without nohz_full.
I ran and stop a busy loop in VM and see the gtime in host.
Dump the 43rd field which shows the gtime in every second:
# while :; do awk '{print $3" "$43}' /proc/3955/task/4014/stat; sleep 1; done
S 4348
R 7064566
R 7064766
R 7064967
R 7065168
S 4759
S 4759
During running busy loop, it returns large value.
After applying this patch, we can see right gtime.
# while :; do awk '{print $3" "$43}' /proc/10913/task/10956/stat; sleep 1; done
S 5338
R 5365
R 5465
R 5566
R 5666
S 5726
S 5726
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1447948054-28668-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.
The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.
Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because wakeups can (fundamentally) be late, a task might not be in
the expected state. Therefore testing against a task's state is racy,
and can yield false positives.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: oleg@redhat.com
Fixes: 9067ac85d5 ("wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED task")
Link: http://lkml.kernel.org/r/1448933660-23082-1-git-send-email-sasha.levin@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Vladimir reported getting RCU stall warnings and bisected it back to
commit:
743162013d ("sched: Remove proliferation of wait_on_bit() action functions")
That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mark.rutland@arm.com
Cc: neilb@suse.de
Cc: oleg@redhat.com
Fixes: 743162013d ("sched: Remove proliferation of wait_on_bit() action functions")
Fixes: cbbce82209 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before the system go to suspend (S3), if user create a timer
with clockid CLOCK_REALTIME_ALARM/CLOCK_BOOTTIME_ALARM and set a
"large" timeout value to this timer. The function
alarmtimer_suspend will be called to setup a timeout value to
RTC timer to avoid the system sleep over time. However, if the
system wakeup early than RTC timeout, the RTC timer will not be
cleared. And this will cause the hpet_rtc_interrupt come
unexpectedly until the RTC timeout. To fix this problem, just
adding alarmtimer_resume to cancel the RTC timer.
This was noticed because the HPET RTC emulation fires an
interrupt every 16ms(=1/2^DEFAULT_RTC_SHIFT) up to the point
where the alarm time is reached.
This program always hits this situation
(https://lkml.org/lkml/2015/11/8/326), if system wake up earlier
than alarm time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Zhuo-hao Lee <zhuo-hao.lee@intel.com>
[jstultz: Tweak commit subject & formatting slightly]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Conflicts:
drivers/net/ethernet/renesas/ravb_main.c
kernel/bpf/syscall.c
net/ipv4/ipmr.c
All three conflicts were cases of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"A lot of Thanksgiving turkey leftovers accumulated, here goes:
1) Fix bluetooth l2cap_chan object leak, from Johan Hedberg.
2) IDs for some new iwlwifi chips, from Oren Givon.
3) Fix rtlwifi lockups on boot, from Larry Finger.
4) Fix memory leak in fm10k, from Stephen Hemminger.
5) We have a route leak in the ipv6 tunnel infrastructure, fix from
Paolo Abeni.
6) Fix buffer pointer handling in arm64 bpf JIT,f rom Zi Shen Lim.
7) Wrong lockdep annotations in tcp md5 support, fix from Eric
Dumazet.
8) Work around some middle boxes which prevent proper handling of TCP
Fast Open, from Yuchung Cheng.
9) TCP repair can do huge kmalloc() requests, build paged SKBs
instead. From Eric Dumazet.
10) Fix msg_controllen overflow in scm_detach_fds, from Daniel
Borkmann.
11) Fix device leaks on ipmr table destruction in ipv4 and ipv6, from
Nikolay Aleksandrov.
12) Fix use after free in epoll with AF_UNIX sockets, from Rainer
Weikusat.
13) Fix double free in VRF code, from Nikolay Aleksandrov.
14) Fix skb leaks on socket receive queue in tipc, from Ying Xue.
15) Fix ifup/ifdown crach in xgene driver, from Iyappan Subramanian.
16) Fix clearing of persistent array maps in bpf, from Daniel
Borkmann.
17) In TCP, for the cross-SYN case, we don't initialize tp->copied_seq
early enough. From Eric Dumazet.
18) Fix out of bounds accesses in bpf array implementation when
updating elements, from Daniel Borkmann.
19) Fill gaps in RCU protection of np->opt in ipv6 stack, from Eric
Dumazet.
20) When dumping proxy neigh entries, we have to accomodate NULL
device pointers properly, from Konstantin Khlebnikov.
21) SCTP doesn't release all ipv6 socket resources properly, fix from
Eric Dumazet.
22) Prevent underflows of sch->q.qlen for multiqueue packet
schedulers, also from Eric Dumazet.
23) Fix MAC and unicast list handling in bnxt_en driver, from Jeffrey
Huang and Michael Chan.
24) Don't actively scan radar channels, from Antonio Quartulli"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (110 commits)
net: phy: reset only targeted phy
bnxt_en: Setup uc_list mac filters after resetting the chip.
bnxt_en: enforce proper storing of MAC address
bnxt_en: Fixed incorrect implementation of ndo_set_mac_address
net: lpc_eth: remove irq > NR_IRQS check from probe()
net_sched: fix qdisc_tree_decrease_qlen() races
openvswitch: fix hangup on vxlan/gre/geneve device deletion
ipv4: igmp: Allow removing groups from a removed interface
ipv6: sctp: implement sctp_v6_destroy_sock()
arm64: bpf: add 'store immediate' instruction
ipv6: kill sk_dst_lock
ipv6: sctp: add rcu protection around np->opt
net/neighbour: fix crash at dumping device-agnostic proxy entries
sctp: use GFP_USER for user-controlled kmalloc
sctp: convert sack_needed and sack_generation to bits
ipv6: add complete rcu protection around np->opt
bpf: fix allocation warnings in bpf maps and integer overflow
mvebu: dts: enable IP checksum with jumbo frames for Armada 38x on Port0
net: mvneta: enable setting custom TX IP checksum limit
net: mvneta: fix error path for building skb
...
events on pids. It filters all events where only tasks with their pid in that
file exists. It also handles the sched_switch and sched_wakeup trace events
where the current task does not have its pid in the file, but the task
either being switched to or awaken does.
Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both of
these tracepoints use the same class as the sched_wakeup tracepoint, and
they too should be included in what gets filtered by the set_event_pid file.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJWX7XpAAoJEKKk/i67LK/8G6gH/0W9QxCt/iS+J+x3gAPPs+/9
jtwZgAOWjq+118ZpWORtgRcoH2r/sJUNwlXqhMojAHsPlwZsr6TXkWJkgyNdZZ7B
QdUtZrr+egGYvd7TE0ONi/XrLTe9VLtBQsh5pN7l9fF9TjxYUmu5V9LplH9z0RxW
Hw8EzqGzG2iZnXYCnErtu5jRLmr18f2u9aUptPAc4bYPLVUUw9M9MqRV/ZwQxsaX
1mfIoR5SVC5IWW/R07qjULlbFpvNXkVJ56HwXMVBN44mYz3eUGYBKzjyAJ0Ugymf
CNDPzh4HgVFsEqDedr0D8T5WZNJSUErdbHVSWze+CCUNfYikvU7gNmxNJ89Q3P4=
=LsnU
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"During the merge window I added a new file that is used to filter
trace events on pids. It filters all events where only tasks with
their pid in that file exists. It also handles the sched_switch and
sched_wakeup trace events where the current task does not have its pid
in the file, but the task either being switched to or awaken does.
Unfortunately, I forgot about sched_wakeup_new and sched_waking. Both
of these tracepoints use the same class as the sched_wakeup
tracepoint, and they too should be included in what gets filtered by
the set_event_pid file"
* tag 'trace-v4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add sched_wakeup_new and sched_waking tracepoints for pid filter
The following directory structure will allow for cases when the same
function name exists in a single object.
/sys/kernel/livepatch/<patch>/<object>/<function,sympos>
The sympos number corresponds to the nth occurrence of the symbol name in
kallsyms for the patched object.
An example of patching multiple symbols can be found here:
https://github.com/dynup/kpatch/issues/493
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In cases of duplicate symbols, sympos will be used to disambiguate instead
of val. By default sympos will be 0, and patching will only succeed if
the symbol is unique. Specifying a positive value will ensure that
occurrence of the symbol in kallsyms for the patched object will be used
for patching if it is valid. For external relocations sympos is not
supported.
Remove klp_verify_callback, klp_verify_args and klp_verify_vmlinux_symbol
as they are no longer used.
From the klp_reloc structure remove val, as it can be refactored as a
local variable in klp_write_object_relocations.
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently, patching objects with duplicate symbol names fail because the
creation of the sysfs function directory collides with the previous
attempt. Appending old_addr to the function name is problematic as it
reveals the address of the function being patch to a normal user. Using
the symbol's occurrence in kallsyms to postfix the function name in the
sysfs directory solves the issue of having consistent unique names and
ensuring that the address is not exposed to a normal user.
In addition, using the symbol position as the user's method to disambiguate
symbols instead of addr allows for disambiguating symbols in modules as
well for both function addresses and for relocs. This also simplifies much
of the code. Special handling for kASLR is no longer needed and can be
removed. The klp_find_verify_func_addr function can be replaced by
klp_find_object_symbol, and klp_verify_vmlinux_symbol and its callback can
be removed completely.
In cases of duplicate symbols, old_sympos will be used to disambiguate
instead of old_addr. By default old_sympos will be 0, and patching will
only succeed if the symbol is unique. Specifying a positive value will
ensure that occurrence of the symbol in kallsyms for the patched object
will be used for patching if it is valid.
In addition, make old_addr an internal structure field not to be specified
by the user. Finally, remove klp_find_verify_func_addr as it can be
replaced by klp_find_object_symbol directly.
Support for symbol position disambiguation for relocations is added in the
next patch in this series.
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Because accounting resources for the root cgroup sometimes incurs
measureable overhead for workloads which don't care about cgroup and
often ends up calculating a number which is available elsewhere in a
slightly different form, cgroup is not in the business of providing
system-wide statistics. The pids controller which was introduced
recently was exposing "pids.current" at the root. This patch disable
accounting for root cgroup and removes the file from the root
directory.
While this is a userland visible behavior change, pids has been
available only in one version and was badly broken there, so I don't
think this will be noticeable. If it turns out to be a problem, we
can reinstate it for v1 hierarchies.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
If one or more tasks get moved into a frozen css, the frozen state is
cleared up from the destination css so that it can be reasserted once
the migrated tasks are frozen. freezer_attach() implements this in
two separate steps - clearing CGROUP_FROZEN on the target css while
processing each task and propagating the clearing upwards after the
task loop is done if necessary.
This patch merges the two steps. Propagation now takes place inside
the task loop. This simplifies the code and prepares it for the fix
of multi-destination migration.
Signed-off-by: Tejun Heo <tj@kernel.org>
For large map->value_size the user space can trigger memory allocation warnings like:
WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
__alloc_pages_nodemask+0x695/0x14e0()
Call Trace:
[< inline >] __dump_stack lib/dump_stack.c:15
[<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
[<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
[<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
[< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989
[<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
[<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
[< inline >] alloc_pages include/linux/gfp.h:451
[<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
[<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
[<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
[< inline >] kmalloc_large include/linux/slab.h:390
[<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
[< inline >] kmalloc include/linux/slab.h:463
[< inline >] map_update_elem kernel/bpf/syscall.c:288
[< inline >] SYSC_bpf kernel/bpf/syscall.c:744
To avoid never succeeding kmalloc with order >= MAX_ORDER check that
elem->value_size and computed elem_size are within limits for both hash and
array type maps.
Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
so keep those kmalloc-s as-is.
Large value_size can cause integer overflows in elem_size and map.pages
formulas, so check for that as well.
Fixes: aaac3ba95e ("bpf: charge user for creation of BPF maps and programs")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
During own review but also reported by Dmitry's syzkaller [1] it has been
noticed that we trigger a heap out-of-bounds access on eBPF array maps
when updating elements. This happens with each map whose map->value_size
(specified during map creation time) is not multiple of 8 bytes.
In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
used to align array map slots for faster access. However, in function
array_map_update_elem(), we update the element as ...
memcpy(array->value + array->elem_size * index, value, array->elem_size);
... where we access 'value' out-of-bounds, since it was allocated from
map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
and later on copied through copy_from_user(value, uvalue, map->value_size).
Thus, up to 7 bytes, we can access out-of-bounds.
Same could happen from within an eBPF program, where in worst case we
access beyond an eBPF program's designated stack.
Since 1be7f75d16 ("bpf: enable non-root eBPF programs") didn't hit an
official release yet, it only affects priviledged users.
In case of array_map_lookup_elem(), the verifier prevents eBPF programs
from accessing beyond map->value_size through check_map_access(). Also
from syscall side map_lookup_elem() only copies map->value_size back to
user, so nothing could leak.
[1] http://github.com/google/syzkaller
Fixes: 28fbcfa08d ("bpf: add array type of eBPF maps")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>