* If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[]
before calilng ->destroy() making CSS inaccessible to the callback,
and didn't unlink ss->sibling. As no modular controller uses
->use_id, this doesn't cause any actual problems.
* cgroup_unload_subsys() was forgetting to free idr, call
->pre_destroy() and clear ->active. As there currently is no
modular controller which uses ->use_id, ->pre_destroy() or ->active,
this doesn't cause any actual problems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Make cgroup_init_subsys() grab cgroup_mutex while initializing a
subsystem so that all helpers and callbacks are called under the
context they expect. This isn't strictly necessary as
cgroup_init_subsys() doesn't race with anybody but will allow adding
lockdep assertions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Consistently use @css and @dummytop in these two functions instead of
referring to them indirectly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, CSS_* flags are defined as bit positions and manipulated
using atomic bitops. There's no reason to use atomic bitops for them
and bit positions are clunkier to deal with than bit masks. Make
CSS_* bit masks instead and use the usual C bitwise operators to
access them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup->dentry is marked and used as a RCU pointer; however, it isn't
one - the final dentry put doesn't go through call_rcu(). cgroup and
dentry share the same RCU freeing rule via synchronize_rcu() in
cgroup_diput() (kfree_rcu() used on cgrp is unnecessary). If cgrp is
accessible under RCU read lock, so is its dentry and dereferencing
cgrp->dentry doesn't need any further RCU protection or annotation.
While not being accurate, before the previous patch, the RCU accessors
served a purpose as memory barriers - cgroup->dentry used to be
assigned after the cgroup was made visible to cgroup_path(), so the
assignment and dereferencing in cgroup_path() needed the memory
barrier pair. Now that list_add_tail_rcu() happens after
cgroup->dentry is assigned, this no longer is necessary.
Remove the now unnecessary and misleading RCU annotations from
cgroup->dentry. To make up for the removal of rcu_dereference_check()
in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts
the dereference rule of @cgrp, not cgrp->dentry.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
While creating a new cgroup, cgroup_create() links the newly allocated
cgroup into various places before trying to create its directory.
Because cgroup life-cycle is tied to the vfs objects, this makes it
impossible to use cgroup_rmdir() for rolling back creation - the
removal logic depends on having full vfs objects.
This patch moves directory creation above linking and collect linking
operations to one place. This allows directory creation failure to
share error exit path with css allocation failures and any failure
sites afterwards (to be added later) can use cgroup_rmdir() logic to
undo creation.
Note that this also makes the memory barriers around cgroup->dentry,
which currently is misleadingly using RCU operations, unnecessary.
This will be handled in the next patch.
While at it, locking BUG_ON() on i_mutex is converted to
lockdep_assert_held().
v2: Patch originally removed %NULL dentry check in cgroup_path();
however, Li pointed out that this patch doesn't make it
unnecessary as ->create() may call cgroup_path(). Drop the
change for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The operation order of cgroup creation is about to change and
cgroup_create_dir() is more of a hindrance than a proper abstraction.
Open-code it by moving the parent nlink adjustment next to self nlink
adjustment in cgroup_create_file() and the rest to cgroup_create().
This patch doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Not strictly necessary but it's annoying to have uninitialized
list_head around.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
vtime_account() is only called from irq entry. irqs
are always disabled at this point so we can safely
remove the irq disabling guards on that function.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
On ia64 and powerpc, vtime context switch only consists
in flushing system and user pending time, plus a few
arch housekeeping.
Consolidate that into a generic implementation. s390 is
a special case because pending user and system time accounting
there is hard to dissociate. So it's keeping its own implementation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Prepending irq-unsafe vtime APIs with underscores was actually
a bad idea as the result is a big mess in the API namespace that
is even waiting to be further extended. Also these helpers
are always called from irq safe callers except kvm. Just
provide a vtime_account_system_irqsafe() for this specific
case so that we can remove the underscore prefix on other
vtime functions.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Now that we have been through every permission check in the kernel
having uid == 0 and gid == 0 in your local user namespace no
longer adds any special privileges. Even having a full set
of caps in your local user namespace is safe because capabilies
are relative to your local user namespace, and do not confer
unexpected privileges.
Over the long term this should allow much more of the kernels
functionality to be safely used by non-root users. Functionality
like unsharing the mount namespace that is only unsafe because
it can fool applications whose privileges are raised when they
are executed. Since those applications have no privileges in
a user namespaces it becomes safe to spoof and confuse those
applications all you want.
Those capabilities will still need to be enabled carefully because
we may still need things like rlimits on the number of unprivileged
mounts but that is to avoid DOS attacks not to avoid fooling root
owned processes.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
This will allow for support for unprivileged mounts in a new user namespace.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Unsharing of the pid namespace unlike unsharing of other namespaces
does not take affect immediately. Instead it affects the children
created with fork and clone. The first of these children becomes the init
process of the new pid namespace, the rest become oddball children
of pid 0. From the point of view of the new pid namespace the process
that created it is pid 0, as it's pid does not map.
A couple of different semantics were considered but this one was
settled on because it is easy to implement and it is usable from
pam modules. The core reasons for the existence of unshare.
I took a survey of the callers of pam modules and the following
appears to be a representative sample of their logic.
{
setup stuff include pam
child = fork();
if (!child) {
setuid()
exec /bin/bash
}
waitpid(child);
pam and other cleanup
}
As you can see there is a fork to create the unprivileged user
space process. Which means that the unprivileged user space
process will appear as pid 1 in the new pid namespace. Further
most login processes do not cope with extraneous children which
means shifting the duty of reaping extraneous child process to
the creator of those extraneous children makes the system more
comprehensible.
The practical reason for this set of pid namespace semantics is
that it is simple to implement and verify they work correctly.
Whereas an implementation that requres changing the struct
pid on a process comes with a lot more races and pain. Not
the least of which is that glibc caches getpid().
These semantics are implemented by having two notions
of the pid namespace of a proces. There is task_active_pid_ns
which is the pid namspace the process was created with
and the pid namespace that all pids are presented to
that process in. The task_active_pid_ns is stored
in the struct pid of the task.
Then there is the pid namespace that will be used for children
that pid namespace is stored in task->nsproxy->pid_ns.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
for the system init process, and another way for pid namespace
init processes test pid->nr == 1 and use the same code for both.
For the global init this results in SIGNAL_UNKILLABLE being set
much earlier in the initialization process.
This is a small cleanup and it paves the way for allowing unshare and
enter of the pid namespace as that path like our global init also will
not set CLONE_NEWPID.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Pid namespaces are designed to be inescapable so verify that the
passed in pid namespace is a child of the currently active
pid namespace or the currently active pid namespace itself.
Allowing the currently active pid namespace is important so
the effects of an earlier setns can be cancelled.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
task_active_pid_ns(current) != current->ns_proxy->pid_ns will
soon be allowed to support unshare and setns.
The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of current->ns_proxy->pid_ns. However
that leads to strange cases like trying to have a single process be
init in multiple pid namespaces, which is racy and hard to think
about.
The definition of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns could be that
we create a child pid namespace of task_active_pid_ns(current). While
that seems less racy it does not provide any utility.
Therefore define the semantics of creating a child pid namespace when
task_active_pid_ns(current) != current->ns_proxy->pid_ns to be that the
pid namespace creation fails. That is easy to implement and easy
to think about.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Looking at pid_ns->nr_hashed is a bit simpler and it works for
disjoint process trees that an unshare or a join of a pid_namespace
may create.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Set nr_hashed to -1 just before we schedule the work to cleanup proc.
Test nr_hashed just before we hash a new pid and if nr_hashed is < 0
fail.
This guaranteees that processes never enter a pid namespaces after we
have cleaned up the state to support processes in a pid namespace.
Currently sending SIGKILL to all of the process in a pid namespace as
init exists gives us this guarantee but we need something a little
stronger to support unsharing and joining a pid namespace.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Track the number of pids in the proc hash table. When the number of
pids goes to 0 schedule work to unmount the kernel mount of proc.
Move the mount of proc into alloc_pid when we allocate the pid for
init.
Remove the surprising calls of pid_ns_release proc in fork and
proc_flush_task. Those code paths really shouldn't know about proc
namespace implementation details and people have demonstrated several
times that finding and understanding those code paths is difficult and
non-obvious.
Because of the call path detach pid is alwasy called with the
rtnl_lock held free_pid is not allowed to sleep, so the work to
unmounting proc is moved to a work queue. This has the side benefit
of not blocking the entire world waiting for the unnecessary
rcu_barrier in deactivate_locked_super.
In the process of making the code clear and obvious this fixes a bug
reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
succeeded and copy_net_ns failed.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.
Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.
So I have used task_active_pid_ns everywhere I can.
In fork since the pid has not yet been attached to the
process I use ns_of_pid, to achieve the same effect.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Capture the the user namespace that creates the pid namespace
- Use that user namespace to test if it is ok to write to
/proc/sys/kernel/ns_last_pid.
Zhao Hongjiang <zhaohongjiang@huawei.com> noticed I was missing a put_user_ns
in when destroying a pid_ns. I have foloded his patch into this one
so that bisects will work properly.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The user namespace which creates a new network namespace owns that
namespace and all resources created in it. This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives. Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.
This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The user namespace which creates a new network namespace owns that
namespace and all resources created in it. This way we can target
capability checks for privileged operations against network resources to
the user_ns which created the network namespace in which the resource
lives. Privilege to the user namespace which owns the network
namespace, or any parent user namespace thereof, provides the same
privilege to the network resource.
This patch is reworked from a version originally by
Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
klogd is woken up asynchronously from the tick in order
to do it safely.
However if printk is called when the tick is stopped, the reader
won't be woken up until the next interrupt, which might not fire
for a while. As a result, the user may miss some message.
To fix this, lets implement the printk tick using a lazy irq work.
This subsystem takes care of the timer tick state and can
fix up accordingly.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
On irq work initialization, let the user choose to define it
as "lazy" or not. "Lazy" means that we don't want to send
an IPI (provided the arch can anyway) when we enqueue this
work but we rather prefer to wait for the next timer tick
to execute our work if possible.
This is going to be a benefit for non-urgent enqueuers
(like printk in the future) that may prefer not to raise
an IPI storm in case of frequent enqueuing on short periods
of time.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
If we are in nohz and there's still irq_work to be done when the idle
task is about to go offline, give a nasty warning. Everything should
have been flushed from the CPU_DYING notifier already. Further attempts
to enqueue an irq_work are buggy because irqs are disabled by
__cpu_disable(). The best we can do is to report the issue to the user.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
In order not to offline a CPU with pending irq works, flush the
queue from CPU_DYING. The notifier is called by stop_machine on
the CPU that is going down. The code will not be called from irq context
(so things like get_irq_regs() wont work) but I'm not sure what the
requirements are for irq_work in that regard (Peter?). But irqs are
disabled and the CPU is about to go offline. Might as well flush the work.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Don't stop the tick if we have pending irq works on the
queue, otherwise if the arch can't raise self-IPIs, we may not
find an opportunity to execute the pending works for a while.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
We need some quick way to check if the CPU has stopped
its tick. This will be useful to implement the printk tick
using the irq work subsystem.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Currently, callback invocations from callback-free CPUs are accounted to
the CPU that registered the callback, but using the same field that is
used for normal callbacks. This makes it impossible to determine from
debugfs output whether callbacks are in fact being diverted. This commit
therefore adds a separate ->n_nocbs_invoked field in the rcu_data structure
in which diverted callback invocations are counted. RCU's debugfs tracing
still displays normal callback invocations using ci=, but displayed
diverted callbacks with nci=.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
RCU callback execution can add significant OS jitter and also can
degrade both scheduling latency and, in asymmetric multiprocessors,
energy efficiency. This commit therefore adds the ability for selected
CPUs ("rcu_nocbs=" boot parameter) to have their callbacks offloaded
to kthreads. If the "rcu_nocb_poll" boot parameter is also specified,
these kthreads will do polling, removing the need for the offloaded
CPUs to do wakeups. At least one CPU must be doing normal callback
processing: currently CPU 0 cannot be selected as a no-CBs CPU.
In addition, attempts to offline the last normal-CBs CPU will fail.
This feature was inspired by Jim Houston's and Joe Korty's JRCU, and
this commit includes fixes to problems located by Fengguang Wu's
kbuild test robot.
[ paulmck: Added gfp.h include file as suggested by Fengguang Wu. ]
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
urgent.2012.10.27a: Fix for RCU user-mode transition (already in -tip).
doc.2012.11.08a: Documentation updates, most notably codifying the
memory-barrier guarantees inherent to grace periods.
fixes.2012.11.13a: Miscellaneous fixes.
srcu.2012.10.27a: Allow statically allocated and initialized srcu_struct
structures (courtesy of Lai Jiangshan).
stall.2012.11.13a: Add more diagnostic information to RCU CPU stall
warnings, also decrease from 60 seconds to 21 seconds.
hotplug.2012.11.08a: Minor updates to CPU hotplug handling.
tracing.2012.11.08a: Improved debugfs tracing, courtesy of Michael Wang.
idle.2012.10.24a: Updates to RCU idle/adaptive-idle handling, including
a boot parameter that maps normal grace periods to expedited.
Resolved conflict in kernel/rcutree.c due to side-by-side change.
This was always racy, but 268720903f
"uprobes: Rework register_for_each_vma() to make it O(n)" should be
blamed anyway, it made everything worse and I didn't notice.
register/unregister call build_map_info() and then do install/remove
breakpoint for every mm which mmaps inode/offset. This can obviously
race with fork()->dup_mmap() in between and we can miss the child.
uprobe_register() could be easily fixed but unregister is much worse,
the new mm inherits "int3" from parent and there is no way to detect
this if uprobe goes away.
So this patch simply adds percpu_down_read/up_read around dup_mmap(),
and percpu_down_write/up_write into register_for_each_vma().
This adds 2 new hooks into dup_mmap() but we can kill uprobe_dup_mmap()
and fold it into uprobe_end_dup_mmap().
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Trace buffer size is now per-cpu, so that there are the following two
patterns in resizing of buffers.
(1) resize per-cpu buffers to same given size
(2) resize per-cpu buffers to another trace_array's buffer size
for each CPU (such as preparing the max_tr which is equivalent
to the global_trace's size)
__tracing_resize_ring_buffer() can be used for (1), and had
implemented (2) inside it for resetting the global_trace to the
original size.
(2) was also implemented in another place. So this patch assembles
them in a new function - resize_buffer_duplicate_size().
Link: http://lkml.kernel.org/r/20121017025616.2627.91226.stgit@falsita
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is a typo here where '&' is used instead of '|' and it turns the
statement into a noop. The original code is equivalent to:
iter->flags &= ~((1 << 2) & (1 << 4));
Link: http://lkml.kernel.org/r/20120609161027.GD6488@elgon.mountain
Cc: stable@vger.kernel.org # all of them
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since the software suspend extents are organized in an rbtree, use rb_entry
instead of container_of, as it is semantically more appropriate in order to
get a node as it is iterated.
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Replace strict_strtoul() with kstrtoul() in pm_async_store() and
pm_qos_power_write().
[rjw: Modified subject and changelog.]
Signed-off-by: Daniel Walter <sahne@0x90.at>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.
cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.
There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
governor will predict it is repeat mode and there is another IPI wake up idle
CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.
In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.
Below is another case which will clearly show the patch much benefit:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>
volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;
void usage(void)
{
fprintf(stderr,
"Usage: idle_predict [options]\n"
" --help -h Print this help\n"
" --thread -n Thread number\n"
" --loop -l Loop times in shallow Cstate\n"
" --delay -t Sleep time (uS)in shallow Cstate\n");
}
void *simple_loop() {
int idle_num = 1;
while (!(*shutdown)) {
*count = *count + 1;
if (idle_num % loop)
usleep(delay);
else {
/* sleep 1 second */
usleep(1000000);
idle_num = 0;
}
idle_num++;
}
}
static void sighand(int sig)
{
*shutdown = 1;
}
int main(int argc, char *argv[])
{
sigset_t sigset;
int signum = SIGALRM;
int i, c, er = 0, thread_num = 8;
pthread_t pt[1024];
static char optstr[] = "n:l:t:h:";
while ((c = getopt(argc, argv, optstr)) != EOF)
switch (c) {
case 'n':
thread_num = atoi(optarg);
break;
case 'l':
loop = atoi(optarg);
break;
case 't':
delay = atoi(optarg);
break;
case 'h':
default:
usage();
exit(1);
}
printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
count = malloc(sizeof(long));
shutdown = malloc(sizeof(int));
*count = 0;
*shutdown = 0;
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigprocmask (SIG_BLOCK, &sigset, NULL);
signal(SIGINT, sighand);
signal(SIGTERM, sighand);
for(i = 0; i < thread_num ; i++)
pthread_create(&pt[i], NULL, simple_loop, NULL);
for (i = 0; i < thread_num; i++)
pthread_join(pt[i], NULL);
exit(0);
}
Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop
We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.
While after patched the kernel, we find that deep C-state will keep >99.6%.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Even if acpi_processor_handle_eject() offlines cpu, there is a chance
to online the cpu after that. So the patch closes the window by using
get/put_online_cpus().
Why does the patch change _cpu_up() logic?
The patch cares the race of hot-remove cpu and _cpu_up(). If the patch
does not change it, there is the following race.
hot-remove cpu | _cpu_up()
------------------------------------- ------------------------------------
call acpi_processor_handle_eject() |
call cpu_down() |
call get_online_cpus() |
| call cpu_hotplug_begin() and stop here
call arch_unregister_cpu() |
call acpi_unmap_lsapic() |
call put_online_cpus() |
| start and continue _cpu_up()
return acpi_processor_remove() |
continue hot-remove the cpu |
So _cpu_up() can continue to itself. And hot-remove cpu can also continue
itself. If the patch changes _cpu_up() logic, the race disappears as below:
hot-remove cpu | _cpu_up()
-----------------------------------------------------------------------
call acpi_processor_handle_eject() |
call cpu_down() |
call get_online_cpus() |
| call cpu_hotplug_begin() and stop here
call arch_unregister_cpu() |
call acpi_unmap_lsapic() |
cpu's cpu_present is set |
to false by set_cpu_present()|
call put_online_cpus() |
| start _cpu_up()
| check cpu_present() and return -EINVAL
return acpi_processor_remove() |
continue hot-remove the cpu |
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpu_hotplug_pm_callback should have higher priority than
bsp_pm_callback which depends on cpu_hotplug_pm_callback to disable cpu hotplug
to avoid race during bsp online checking.
This is to hightlight the priorities between the two callbacks in case people
may overlook the order.
Ideally the priorities should be defined in macro/enum instead of fixed values.
To do that, a seperate patchset may be pushed which will touch serveral other
generic files and is out of scope of this patchset.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1352835171-3958-7-git-send-email-fenghua.yu@intel.com
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Flush the cache so that the instructions written to the XOL area are
visible.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Work claiming wants to be SMP-safe.
And by the time we try to claim a work, if it is already executing
concurrently on another CPU, we want to succeed the claiming and queue
the work again because the other CPU may have missed the data we wanted
to handle in our work if it's about to complete there.
This scenario is summarized below:
CPU 1 CPU 2
----- -----
(flags = 0)
cmpxchg(flags, 0, IRQ_WORK_FLAGS)
(flags = 3)
[...]
xchg(flags, IRQ_WORK_BUSY)
(flags = 2)
func()
if (flags & IRQ_WORK_PENDING)
(not true)
cmpxchg(flags, flags, IRQ_WORK_FLAGS)
(flags = 3)
[...]
cmpxchg(flags, IRQ_WORK_BUSY, 0);
(fail, pending on CPU 2)
This state machine is synchronized using [cmp]xchg() on the flags.
As such, the early IRQ_WORK_PENDING check in CPU 2 above is racy.
By the time we check it, we may be dealing with a stale value because
we aren't using an atomic accessor. As a result, CPU 2 may "see"
that the work is still pending on another CPU while it may be
actually completing the work function exection already, leaving
our data unprocessed.
To fix this, we start by speculating about the value we wish to be
in the work->flags but we only make any conclusion after the value
returned by the cmpxchg() call that either claims the work or let
the current owner handle the pending work for us.
Changelog-heavily-inspired-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Anish Kumar <anish198519851985@gmail.com>
The IRQ_WORK_BUSY flag is set right before we execute the
work. Once this flag value is set, the work enters a
claimable state again.
So if we have specific data to compute in our work, we ensure it's
either handled by another CPU or locally by enqueuing the work again.
This state machine is guanranteed by atomic operations on the flags.
So when we set IRQ_WORK_BUSY without using an xchg-like operation,
we break this guarantee as in the following summarized scenario:
CPU 1 CPU 2
----- -----
(flags = 0)
old_flags = flags;
(flags = 0)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(flags = 3)
[...]
flags = IRQ_WORK_BUSY
(flags = 2)
func()
(sees flags = 3)
cmpxchg(flags, old_flags,
old_flags | IRQ_WORK_FLAGS)
(give up)
cmpxchg(flags, 2, 0);
(flags = 0)
CPU 1 claims a work and executes it, so it sets IRQ_WORK_BUSY and
the work is again in a claimable state. Now CPU 2 has new data to process
and try to claim that work but it may see a stale value of the flags
and think the work is still pending somewhere that will handle our data.
This is because CPU 1 doesn't set IRQ_WORK_BUSY atomically.
As a result, the data expected to be handle by CPU 2 won't get handled.
To fix this, use xchg() to set IRQ_WORK_BUSY, this way we ensure the CPU 2
will see the correct value with cmpxchg() using the expected ordering.
Changelog-heavily-inspired-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Anish Kumar <anish198519851985@gmail.com>
The rcu_is_cpu_rrupt_from_idle() needs to allow for one interrupt level
from the idle loop, but TINY_RCU checks for a call from the idle loop
itself. This commit fixes this issue.
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit explicitly states the memory-ordering properties of the
RCU grace-period primitives. Although these properties were in some
sense implied by the fundmental property of RCU ("a grace period must
wait for all pre-existing RCU read-side critical sections to complete"),
stating it explicitly will be a great labor-saving device.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Several new rcutorture module parameters have been added, but are not
printed to the console at the beginning and end of tests, which makes
it difficult to reproduce a prior test. This commit therefore adds
these new module parameters to the list printed at the beginning and
the end of the tests.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 29c00b4a1d (rcu: Add event-tracing for RCU callback
invocation) added a regression in rcu_do_batch()
Under stress, RCU is supposed to allow to process all items in queue,
instead of a batch of 10 items (blimit), but an integer overflow makes
the effective limit being 1. So, unless there is frequent idle periods
(during which RCU ignores batch limits), RCU can be forced into a
state where it cannot keep up with the callback-generation rate,
eventually resulting in OOM.
This commit therefore converts a few variables in rcu_do_batch() from
int to long to fix this problem, along with the module parameters
controlling the batch limits.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # 3.2 +
Show raw time stamp values for stats per cpu if you choose counter or tsc mode
for trace_clock. Although a unit of tracing time stamp is nsec in local or global mode,
the units in counter and TSC mode are tracing counter and cycles respectively.
Link: http://lkml.kernel.org/r/1352837903-32191-3-git-send-email-dhsharp@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In order to promote interoperability between userspace tracers and ftrace,
add a trace_clock that reports raw TSC values which will then be recorded
in the ring buffer. Userspace tracers that also record TSCs are then on
exactly the same time base as the kernel and events can be unambiguously
interlaced.
Tested: Enabled a tracepoint and the "tsc" trace_clock and saw very large
timestamp values.
v2:
Move arch-specific bits out of generic code.
v3:
Rename "x86-tsc", cleanups
v7:
Generic arch bits in Kbuild.
Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1352837903-32191-1-git-send-email-dhsharp@google.com
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Now that timekeeping is protected by its own locks, rename
the xtime_lock to jifffies_lock to better describe what it
protects.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Commit f1b8274 ("clocksource: Cleanup clocksource selection") removed all
external references to clocksource_jiffies so there is no need to have the
symbol globally visible.
Fixes the following sparse warning:
CHECK kernel/time/jiffies.c kernel/time/jiffies.c:61:20: warning: symbol 'clocksource_jiffies' was not declared. Should it be static?
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull futex fix from Thomas Gleixner:
"Single fix for a long standing futex race when taking over a futex
whose owner died. You can end up with two owners, which violates
quite some rules."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Handle futex_pi OWNER_DIED take over correctly
Sankara reported that the genirq core code fails to adjust the
affinity of an interrupt thread in several cases:
1) On request/setup_irq() the call to setup_affinity() happens before
the new action is registered, so the new thread is not notified.
2) For secondary shared interrupts nothing notifies the new thread to
change its affinity.
3) Interrupts which have the IRQ_NO_BALANCE flag set are not moving
the thread either.
Fix this by setting the thread affinity flag right on thread creation
time. This ensures that under all circumstances the thread moves to
the right place. Requires a check in irq_thread_check_affinity for an
existing affinity mask (CONFIG_CPU_MASK_OFFSTACK=y)
Reported-and-tested-by: Sankara Muthukrishnan <sankara.m@gmail.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1209041738200.2754@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved. They all operated
separately.
This patch implements proper hierarchy support. If a cgroup is
frozen, all its descendants are frozen. A cgroup is thawed iff it and
all its ancestors are THAWED. freezer.self_freezing shows the current
freezing state for the cgroup itself. freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.
freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state. update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.
Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations. Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.
As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.
Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too. This behavior change is
intended and has been warned via .broken_hierarchy.
v2: Michal spotted a bug in freezer_change_state() - descendants were
inheriting from the wrong ancestor. Fixed.
v3: Documentation/cgroups/freezer-subsystem.txt updated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
A cgroup is online and visible to iteration between ->post_create()
and ->pre_destroy(). This patch introduces CGROUP_FREEZER_ONLINE and
toggles it from the newly added freezer_post_create() and
freezer_pre_destroy() while holding freezer->lock such that a
cgroup_freezer can be reilably distinguished to be online. This will
be used by full hierarchy support.
ONLINE test is added to freezer_apply_state() but it currently doesn't
make any difference as freezer_write() can only be called for an
online cgroup.
Adjusting system_freezing_cnt on destruction is moved from
freezer_destroy() to the new freezer_pre_destroy() for consistency.
This patch doesn't introduce any noticeable behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
the two flags. This is to prepare for full hierarchy support.
freezer_apply_date() is updated such that it can handle setting and
clearing of both flags. The two flags are also exposed to userland
via read-only files self_freezing and parent_freezing.
Other than the added cgroupfs files, this patch doesn't introduce any
behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
As the scheduled full hierarchy support requires more than one
freezing condition, switch it to mask of flags. If FREEZING is not
set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
both FREEZING and FROZEN are set. Now that tasks can be attached to
an already frozen cgroup, this also makes freezing condition checks
more natural.
This patch doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
* Make freezer_change_state() take bool @freeze instead of enum
freezer_state.
* Separate out freezer_apply_state() out of freezer_change_state().
This makes freezer_change_state() a rather silly thin wrapper. It
will be filled with hierarchy handling later on.
This patch doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
* Clean-up indentation and line-breaks. Drop the invalid comment
about freezer->lock.
* Make all internal functions take @freezer instead of both @cgroup
and @freezer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants. This patch adds the following
three macros.
* cgroup_for_each_child() - walk immediate children of a cgroup.
* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
in pre-order tree traversal.
* cgroup_for_each_descendant_post() - visit all descendants of a
cgroup in post-order tree traversal.
All three only require the user to hold RCU read lock during
traversal. Verifying that each iterated cgroup is online is the
responsibility of the user. When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate state
updates to descendants in reliable way. See comments for details.
v2: s/config/state/ in commit message and comments per Michal. More
documentation on synchronization rules.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Use RCU safe list operations for cgroup->children. This will be used
to implement cgroup children / descendant walking which can be used by
controllers.
Note that cgroup_create() now puts a new cgroup at the end of the
->children list instead of head. This isn't strictly necessary but is
done so that the iteration order is more conventional.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.
This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.
This patch adds ->post_create(). It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().
When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance. It will be explained with the descendant iterators.
v2: Added a paragraph about its future use w/ descendant iterators per
Michal.
v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
This commit adds a per-RCU-flavor "rcuexp" file that dumps out
statistics for synchonize_sched_expedited().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit removes the old debugfs interfaces, so that the new
directory-per-RCU-flavor versions remain. Because the RCU flavor is
given by the directory name, there is no need to print it out, so remove
the name from the printout.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch add new 'rcuhier' to each flavor's folder, now we could use:
'cat /debugfs/rcu/rsp/rcuhier'
to get the selected rsp info.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch add new 'rcugp' to each flavor's folder, now we could use:
'cat /debugfs/rcu/rsp/rcugp'
to get the selected rsp info.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch add new 'rcuboost' to each flavor's folder, now we could use:
'cat /debugfs/rcu/rsp/rcuboost'
to get the selected rsp info.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch add new 'rcubarrier' to each flavor's folder, now we could use:
'cat /debugfs/rcu/rsp/rcubarrier'
to get the selected rsp info.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_state structure's ->completed field is unsigned long, so this
commit adjusts show_one_rcugp()'s printf() format to suit. Also add
the required ACCESS_ONCE() directives while we are in this function.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch removes the interface "rcudata.csv" since it is apparently
not used.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch removed the old RCU debugfs interface and replaced it with
the new one.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch implements the new 'rcu_pending' interface under each rsp
directory, by using the 'CPU units sequence reading', thus avoiding loss
of tracing data.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch implements the new 'rcudata.csv' interface under each rsp
directory, by using the 'CPU units sequence reading', thus avoiding loss
of tracing data.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch implements the new 'rcudata' interface under each rsp
directory, by using the 'CPU units sequence reading', thus avoiding loss
of tracing data.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch add the fundamental facility used by the following patches, so we
can implement the 'CPU units sequence reading' later.
This helps us avoid losing data when there are too many CPUs and too
small of a buffer, since this new approach allows userspace to read out
the data one CPU at a time. Thus, if the buffer is not large enough,
userspace will get whatever CPUs fit, and can then issue another read
for the remainder of the data.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This patch will create subdirectory according to each flavor of rcu, the new
structure will be:
/debugfs/rcu/ -> rsp_0
-> rsp_1
-> ...
So we can go to '/debugfs/rcu/rsp_0' and get the cpu info of rsp_0 there.
The flavors of RCU are currently rcu_bh, rcu_preempt, and rcu_sched.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit adds the counters to rcu_state and updates them in
synchronize_rcu_expedited() to provide the data needed for debugfs
tracing.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tracing (debugfs) of expedited RCU primitives is required, which in turn
requires that the relevant data be located where the tracing code can find
it, not in its current static global variables in kernel/rcutree.c.
This commit therefore moves sync_sched_expedited_started and
sync_sched_expedited_done to the rcu_state structure, as fields
->expedited_start and ->expedited_done, respectively.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is a counter scheme similar to ticket locking that
synchronize_sched_expedited() uses to service multiple concurrent
callers with the same expedited grace period. Upon entry, a
sync_sched_expedited_started variable is atomically incremented,
and upon completion of a expedited grace period a separate
sync_sched_expedited_done variable is atomically incremented.
However, if a synchronize_sched_expedited() is delayed while
in try_stop_cpus(), concurrent invocations will increment the
sync_sched_expedited_started counter, which will eventually overflow.
If the original synchronize_sched_expedited() resumes execution just
as the counter overflows, a concurrent invocation could incorrectly
conclude that an expedited grace period elapsed in zero time, which
would be bad. One could rely on counter size to prevent this from
happening in practice, but the goal is to formally validate this
code, so it needs to be fixed anyway.
This commit therefore checks the gap between the two counters before
incrementing sync_sched_expedited_started, and if the gap is too
large, does a normal grace period instead. Overflow is thus only
possible if there are more than about 3.5 billion threads on 32-bit
systems, which can be excluded until such time as task_struct fits
into a single byte and 4G/4G patches are accepted into mainline.
It is also easy to encode this limitation into mechanical theorem
provers.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The ->onofflock field in the rcu_state structure at one time synchronized
CPU-hotplug operations for RCU. However, its scope has decreased over time
so that it now only protects the lists of orphaned RCU callbacks. This
commit therefore renames it to ->orphan_lock to reflect its current use.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
'start' is set to buf + buflen and do the '--' immediately.
Just set it to 'buf + buflen - 1' directly.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Pull rmdir updates into for-3.8 so that further callback updates can
be put on top. This pull created a trivial conflict between the
following two commits.
8c7f6edbda ("cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them")
ed95779340 ("cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs")
The former added a field to cgroup_subsys and the latter removed one
from it. They happen to be colocated causing the conflict. Keeping
what's added and removing what's removed resolves the conflict.
Signed-off-by: Tejun Heo <tj@kernel.org>
All ->pre_destory() implementations return 0 now, which is the only
allowed return value. Make it return void.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
destruction rollback somewhat working. cgroup_rmdir() used to drain
CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
helpers were used to allow the task performing rmdir to wait for the
next relevant event.
Unfortunately, the wait is visible to controllers too and the
mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
sleep at rmdir").
Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
unnecessary. Remove it and all the mechanisms supporting it. Note
that memcontrol.c changes are essentially revert of 887032670d
("cgroup avoid permanent sleep at rmdir").
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Because ->pre_destroy() could fail and can't be called under
cgroup_mutex, cgroup destruction did something very ugly.
1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
2. Release cgroup_mutex and call ->pre_destroy().
3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
otherwise.
4. Continue destroying.
In addition to being ugly, it has been always broken in various ways.
For example, memcg ->pre_destroy() expects the cgroup to be inactive
after it's done but tasks can be attached and detached between #2 and
#3 and the conditions that memcg verified in ->pre_destroy() might no
longer hold by the time control reaches #3.
Now that ->pre_destroy() is no longer allowed to fail. We can switch
to the following.
1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
2. Deactivate CSS's and mark the cgroup removed thus preventing any
further operations which can invalidate the verification from #1.
3. Release cgroup_mutex and call ->pre_destroy().
4. Re-grab cgroup_mutex and continue destroying.
After this change, controllers can safely assume that ->pre_destroy()
will only be called only once for a given cgroup and, once
->pre_destroy() is called, the cgroup will stay dormant till it's
destroyed.
This removes the only reason ->pre_destroy() can fail - new task being
attached or child cgroup being created inbetween. Error out path is
removed and ->pre_destroy() invocation is open coded in
cgroup_rmdir().
v2: cgroup_call_pre_destroy() removal moved to this patch per Michal.
Commit message updated per Glauber.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
This patch makes cgroup_create() fail if @parent is marked removed.
This is to prepare for further updates to cgroup_rmdir() path.
Note that this change isn't strictly necessary. cgroup can only be
created via mkdir and the removed marking and dentry removal happen
without releasing cgroup_mutex, so cgroup_create() can never race with
cgroup_rmdir(). Even after the scheduled updates to cgroup_rmdir(),
cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
rendering the added liveliness check unnecessary.
Do it anyway such that locking is contained inside cgroup proper and
we don't get nasty surprises if we ever grow another caller of
cgroup_create().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
CSS_REMOVED is one of the several contortions which were necessary to
support css reference draining on cgroup removal. All css->refcnts
which need draining should be deactivated and verified to equal zero
atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts
needed to be re-activated and css_tryget() shouldn't fail in the
process.
This was achieved by letting css_tryget() busy-loop until either the
refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
(committing to removal).
Now that css refcnt draining is no longer used, there's no need for
atomic rollback mechanism. css_tryget() simply can look at the
reference count and fail if it's deactivated - it's never getting
re-activated.
This patch removes CSS_REMOVED and updates __css_tryget() to fail if
the refcnt is deactivated. As deactivation and removal are a single
step now, they no longer need to be protected against css_tryget()
happening from irq context. Remove local_irq_disable/enable() from
cgroup_rmdir().
Note that this removes css_is_removed() whose only user is VM_BUG_ON()
in memcontrol.c. We can replace it with a check on the refcnt but
given that the only use case is a debug assert, I think it's better to
simply unexport it.
v2: Comment updated and explanation on local_irq_disable/enable()
added per Michal Hocko.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
handling") removed the last user of __DEPRECATED_clear_css_refs. This
patch removes __DEPRECATED_clear_css_refs and mechanisms to support
it.
* Conditionals dependent on __DEPRECATED_clear_css_refs removed.
* cgroup_clear_css_refs() can no longer fail. All that needs to be
done are deactivating refcnts, setting CSS_REMOVED and putting the
base reference on each css. Remove cgroup_clear_css_refs() and the
failure path, and open-code the loops into cgroup_rmdir().
This patch keeps the two for_each_subsys() loops separate while open
coding them. They can be merged now but there are scheduled changes
which need them to be separate, so keep them separate to reduce the
amount of churn.
local_irq_save/restore() from cgroup_clear_css_refs() are replaced
with local_irq_disable/enable() for simplicity. This is safe as
cgroup_rmdir() is always called with IRQ enabled. Note that this IRQ
switching is necessary to ensure that css_tryget() isn't called from
IRQ context on the same CPU while lower context is between CSS
deactivation and setting CSS_REMOVED as css_tryget() would hang
forever in such cases waiting for CSS to be re-activated or
CSS_REMOVED set. This will go away soon.
v2: cgroup_call_pre_destroy() removal dropped per Michal. Commit
message updated to explain local_irq_disable/enable() conversion.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
No functional changes.
powerpc is the only user of arch_uprobe_enable/disable_step() helpers,
but they should die. They can not be used correctly, every arch needs
its own implementation (like x86 does). And they do not really help
even as initial-and-almost-working code, arch_uprobe_*_xol() hooks can
easily use user_enable/disable_single_step() directly.
Change arch_uprobe_*_step() to do nothing, and convert powerpc to use
ptrace helpers. This is equally wrong, powerpc needs the arch-specific
fixes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add trace_options to the kernel command line parameter to be able to
set options at early boot. For example, to enable stack dumps of
events, add the following:
trace_options=stacktrace
This along with the trace_event option, you can get not only
traces of the events but also the stack dumps with them.
Requested-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Have the ring buffer commit function use the irq_work infrastructure to
wake up any waiters waiting on the ring buffer for new data. The irq_work
was created for such a purpose, where doing the actual wake up at the
time of adding data is too dangerous, as an event or function trace may
be in the midst of the work queue locks and cause deadlocks. The irq_work
will either delay the action to the next timer interrupt, or trigger an IPI
to itself forcing an interrupt to do the work (in a safe location).
With irq_work, all ring buffer commits can safely do wakeups, removing
the need for the ring buffer commit "nowake" variants, which were used
by events and function tracing. All commits can now safely use the
normal commit, and the "nowake" variants can be removed.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing_enabled file was used as a quick way to stop
tracers, and try to bring down overhead for things like
the latency tracers (irqsoff, wakeup, etc). But it didn't
work that well.
The tracing_on file was created as a really fast way to
stop recording into the ftrace ring buffer and can interact
with the kernel. That is a tracing_off() call in the kernel
can disable recording of events, and then from userspace one
could echo 1 into the tracing_on file to continue it. The
tracing_enabled function did too much to allow for this.
The tracing_on has taken over as a way to start and stop tracing
and the tracing_enabled file should not be used. But because of
its existance, it still confuses people. Over a year ago the
following commit was added:
commit 6752ab4a9c
Author: Steven Rostedt <srostedt@redhat.com>
Date: Tue Feb 8 13:54:06 2011 -0500
tracing: Deprecate tracing_enabled for tracing_on
This commit added a WARN_ON() if the tracing_enabled file's variable
was changed. After this was added, only LatencyTop complained, and
they soon fixed their tool as there was no reason that LatencyTop
should touch this file as it was using the perf ring buffers which
this file does not interact with. But since that time no one else
has complained about this WARN_ON(). Thus it is safe to assume that
this file is no longer needed. Time to get rid of it.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The tracing_enabled file has been deprecated as it never was able
to serve its purpose well. The tracing_on file has taken over.
Instead of having code to keep tracing_enabled, have the tracing_enabled
file just set tracing_on, and remove the tracing_enabled variable.
This allows us to remove the tracing_enabled file. The reason that
the remove is in a different change set and not removed here is
in case we find some lonely userspace tool that requires the file
to exist. Then the removal patch will get reverted, but this one
will not.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The function register_tracer() is only used by kernel core code,
that never needs to remove the tracer. As trace_events have become
the main way to add new tracing to the kernel, the need to
unregister a tracer has diminished. Remove the unused function
unregister_tracer(). If a need arises where we need it, then we
can always add it back.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The open function used by available_events is the same as set_event even
though it uses different seq functions. This causes a side effect of
writing into available_events clearing all events, even though
available_events is suppose to be read only.
There's no reason to keep a single function for just the open and have
both use different functions for everything else. It is a little
confusing and causes strange behavior. Just have each have their own
function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring_buffer_oldest_event_ts() should return a value of u64 type, because
ring_buffer_per_cpu->buffer_page->buffer_data_page->time_stamp is u64 type.
Link: http://lkml.kernel.org/r/1349998076-15495-5-git-send-email-dhsharp@google.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Because the "tsc" clock isn't in nanoseconds, the ring buffer must be
reset when changing clocks so that incomparable timestamps don't end up
in the same trace.
Tested: Confirmed switching clocks resets the trace buffer.
Google-Bug-Id: 6980623
Link: http://lkml.kernel.org/r/1349998076-15495-3-git-send-email-dhsharp@google.com
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch removes the timecompare code from the kernel. The top five
reasons to do this are:
1. There are no more users of this code.
2. The original idea was a bit weak.
3. The original author has disappeared.
4. The code was not general purpose but tuned to a particular hardware,
5. There are better ways to accomplish clock synchronization.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Tested-by: Bob Liu <lliubbo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the comments of function tick_sched_timer(), the sentence
"timer->base->cpu_base->lock held" is not right.
In function __run_hrtimer(), before call timer->function(),
the cpu_base->lock has been unlocked.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Cc: fei.li@intel.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As irq_thread_check_affinity is called ONLY inside the while loop in
the irq thread, the core affinity is set only when an interrupt
occurs. This patch sets the core affinity right after the irq thread
is created and before it waits for interrupts. In real-tiime targets
that do not typically change the core affinity of irqs during
run-time, this patch will save additional latency of an irq thread in
setting the core affinity during the first interrupt occurrence for
that irq.
Signed-off-by: Sankara S Muthukrishnan <sankara.m@ni.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/CAFQPvXeVZ858WFYimEU5uvLNxLDd6bJMmqWihFmbCf3ntokz0A@mail.gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Attempts to retrigger nested threaded IRQs currently fail because they
have no primary handler. In order to support retrigger of nested
IRQs, the parent IRQ needs to be retriggered.
To fix, when an IRQ needs to be resent, if the interrupt has a parent
IRQ and runs in the context of the parent IRQ, then resend the parent.
Also, handle_nested_irq() needs to clear the replay flag like the
other handlers, otherwise check_irq_resend() will set it and it will
never be cleared. Without clearing, it results in the first resend
working fine, but check_irq_resend() returning early on subsequent
resends because the replay flag is still set.
Problem discovered on ARM/OMAP platforms where a nested IRQ that's
also a wakeup IRQ happens late in suspend and needed to be retriggered
during the resume process.
[khilman@ti.com: changelog edits, clear IRQS_REPLAY in handle_nested_irq()]
Reported-by: Kevin Hilman <khilman@ti.com>
Tested-by: Kevin Hilman <khilman@ti.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1350425269-11489-1-git-send-email-khilman@deeprootsystems.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076
The detailed problem analysis shows:
Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.
T1 lock_futex_pi(F);
T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.
T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.
T3 lock_futex_pi(F);
--> Succeeds due to the check for F's userspace TID field == 0
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes T2 via
rt_mutex_unlock(&pi_state->mutex)
T2 --> acquires pi_state->mutex and gains real ownership of the
pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
T3 --> observes inconsistent state
This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.
So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:
if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.
AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.
Now the proposed change
- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ if (unlikely(ownerdied ||
+ !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.
What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!
Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.
Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.
lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.
So the above scenario changes to:
T1 lock_futex_pi(F);
T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.
T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.
T3 lock_futex_pi(F);
--> Finds pi_state and blocks on pi_state->rt_mutex
T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes it via
rt_mutex_unlock(&pi_state->mutex)
T2 --> acquires pi_state->mutex and gains ownership of the pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space
This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.
Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.
As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)
Reported-and-tested-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart <dvhart@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The functions defined in include/trace/syscalls.h are not used directly
since struct ftrace_event_class was introduced. Remove them from the
header file and rearrange the ftrace_event_class declarations in
trace_syscalls.c.
Link: http://lkml.kernel.org/r/1339112785-21806-2-git-send-email-vnagarnaik@google.com
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove ftrace_format_syscall() declaration; it is neither defined nor
used. Also update a comment and formatting.
Link: http://lkml.kernel.org/r/1339112785-21806-1-git-send-email-vnagarnaik@google.com
Signed-off-by: David Sharp <dhsharp@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Whenever an event is registered, the comm of tasks are saved at
every task switch instead of saving them at every event. But if
an event isn't executed much, the comm cache will be filled up
by tasks that did not record the event and you lose out on the comms
that did.
Here's an example, if you enable the following events:
echo 1 > /debug/tracing/events/kvm/kvm_cr/enable
echo 1 > /debug/tracing/events/net/net_dev_xmit/enable
Note, there's no kvm running on this machine so the first event will
never be triggered, but because it is enabled, the storing of comms
will continue. If we now disable the network event:
echo 0 > /debug/tracing/events/net/net_dev_xmit/enable
and look at the trace:
cat /debug/tracing/trace
sshd-2672 [001] ..s2 375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s1 376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
sshd-2672 [001] ..s2 377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
sshd-2672 [001] ..s1 377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
sshd-2672 [001] ..s2 377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
sshd-2672 [001] ..s1 377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0
We see that process 2672 which triggered the events has the comm "sshd".
But if we run hackbench for a bit and look again:
cat /debug/tracing/trace
<...>-2672 [001] ..s2 375.731616: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.731617: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 375.859356: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.859357: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 375.947351: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 375.947352: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 376.035383: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s1 376.035383: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=242 rc=0
<...>-2672 [001] ..s2 377.563806: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6de0 len=226 rc=0
<...>-2672 [001] ..s1 377.563807: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6de0 len=226 rc=0
<...>-2672 [001] ..s2 377.563834: net_dev_xmit: dev=eth0 skbaddr=ffff88005cbb6be0 len=114 rc=0
<...>-2672 [001] ..s1 377.563842: net_dev_xmit: dev=br0 skbaddr=ffff88005cbb6be0 len=114 rc=0
The stored "sshd" comm has been flushed out and we get a useless "<...>".
But by only storing comms after a trace event occurred, we can run
hackbench all day and still get the same output.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The functon tracing_sched_wakeup_trace() does an open coded unlock
commit and save stack. This is what the trace_nowake_buffer_unlock_commit()
is for.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
If comm recording is not enabled when trace_printk() is used then
you just get this type of output:
[ adding trace_printk("hello! %d", irq); in do_IRQ ]
<...>-2843 [001] d.h. 80.812300: do_IRQ: hello! 14
<...>-2734 [002] d.h2 80.824664: do_IRQ: hello! 14
<...>-2713 [003] d.h. 80.829971: do_IRQ: hello! 14
<...>-2814 [000] d.h. 80.833026: do_IRQ: hello! 14
By enabling the comm recorder when trace_printk is enabled:
hackbench-6715 [001] d.h. 193.233776: do_IRQ: hello! 21
sshd-2659 [001] d.h. 193.665862: do_IRQ: hello! 21
<idle>-0 [001] d.h1 193.665996: do_IRQ: hello! 21
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since tracing is not used by 99% of Linux users, even though tracing
may be configured in, it does not make sense to allocate 1.4 Megs
per CPU for the ring buffers if they are not used. Thus, on boot up
the ring buffers are set to a minimal size until something needs the
and they are expanded.
This works well for events and tracers (function, etc), but for the
asynchronous use of trace_printk() which can write to the ring buffer
at any time, does not expand the buffers.
On boot up a check is made to see if any trace_printk() is used to
see if the trace_printk() temp buffer pages should be allocated. This
same code can be used to expand the buffers as well.
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The existing 'overrun' counter is incremented when the ring
buffer wraps around, with overflow on (the default). We wanted
a way to count requests lost from the buffer filling up with
overflow off, too. I decided to add a new counter instead
of retro-fitting the existing one because it seems like a
different statistic to count conceptually, and also because
of how the code was structured.
Link: http://lkml.kernel.org/r/1310765038-26399-1-git-send-email-slavapestov@google.com
Signed-off-by: Slava Pestov <slavapestov@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
print_max and use_max_tr in struct tracer are "int" variables and
used like flags. This is wasteful, so change the type to "bool".
Link: http://lkml.kernel.org/r/20121002082710.9807.86393.stgit@falsita
Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There's times during debugging that it is helpful to see traces of early
boot functions. But the tracers are initialized at device_initcall()
which is quite late during the boot process. Setting the kernel command
line parameter ftrace=function will not show anything until the function
tracer is initialized. This prevents being able to trace functions before
device_initcall().
There's no reason that the tracers need to be initialized so late in the
boot process. Move them up to core_initcall() as they still need to come
after early_initcall() which initializes the tracing buffers.
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Masaki found and patched a kallsyms issue: the last symbol in a
module's symtab wasn't transferred. This is because we manually copy
the zero'th entry (which is always empty) then copy the rest in a loop
starting at 1, though from src[0]. His fix was minimal, I prefer to
rewrite the loops in more standard form.
There are two loops: one to get the size, and one to copy. Make these
identical: always count entry 0 and any defined symbol in an allocated
non-init section.
This bug exists since the following commit was introduced.
module: reduce symbol table for loaded modules (v2)
commit: 4a4962263f
LKML: http://lkml.org/lkml/2012/10/24/27
Reported-by: Masaki Kimura <masaki.kimura.kz@hitachi.com>
Cc: stable@kernel.org
Due to these two commits:
8323f26ce3 sched: Fix race in task_group()
800d4d30c8 sched, autogroup: Stop going ahead if autogroup is disabled
... autogroup scheduling's dynamic knobs are wrecked.
With both patches applied, all you have to do to crash a box is
disable autogroup during boot up, then reboot.. boom, NULL pointer
dereference due to 800d4d30 not allowing autogroup to move things,
and 8323f26ce making that the only way to switch runqueues.
Remove most of the (dysfunctional) knobs and turn the remaining
sched_autogroup_enabled knob readonly.
If the user fiddles with cgroups hereafter, once tasks
are moved, autogroup won't mess with them again unless
they call setsid().
No knobs, no glitz, nada, just a cute little thing folks can
turn on if they don't want to muck about with cgroups and/or
systemd.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Xiaotian Feng <xtfeng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> # v3.6
Link: http://lkml.kernel.org/r/1351451963.4999.8.camel@maggy.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I've been trying to get hardware breakpoints with perf to work
on POWER7 but I'm getting the following:
% perf record -e mem:0x10000000 true
Error: sys_perf_event_open() syscall returned with 28 (No space left on device). /bin/dmesg may provide additional information.
Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?
true: Terminated
(FWIW adding -a and it works fine)
Debugging it seems that __reserve_bp_slot() is returning ENOSPC
because it thinks there are no free breakpoint slots on this
CPU.
I have a 2 CPUs, so perf userspace is doing two perf_event_open
syscalls to add a counter to each CPU [1]. The first syscall
succeeds but the second is failing.
On this second syscall, fetch_bp_busy_slots() sets slots.pinned
to be 1, despite there being no breakpoint on this CPU. This is
because the call the task_bp_pinned, checks all CPUs, rather
than just the current CPU. POWER7 only has one hardware
breakpoint per CPU (ie. HBP_NUM=1), so we return ENOSPC.
The following patch fixes this by checking the associated CPU
for each breakpoint in task_bp_pinned. I'm not familiar with
this code, so it's provided as a reference to the above issue.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Jovi Zhang <bookjovi@gmail.com>
Cc: K Prasad <prasad@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1351268936-2956-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
vtime_account() doesn't have the same role in
CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.
In the first case it handles time accounting in any context. In
the second case it only handles irq time accounting.
So when vtime_account() is called from outside vtime_account_irq_*()
this call is pointless to CONFIG_IRQ_TIME_ACCOUNTING.
To fix the confusion, change vtime_account() to irqtime_account_irq()
in CONFIG_IRQ_TIME_ACCOUNTING. This way we ensure future account_vtime()
calls won't waste useless cycles in the irqtime APIs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
With CONFIG_VIRT_CPU_ACCOUNTING, when vtime_account()
is called in irq entry/exit, we perform a check on the
context: if we are interrupting the idle task we
account the pending cputime to idle, otherwise account
to system time or its sub-areas: tsk->stime, hardirq time,
softirq time, ...
However this check for idle only concerns the hardirq entry
and softirq entry:
* Hardirq may directly interrupt the idle task, in which case
we need to flush the pending CPU time to idle.
* The idle task may be directly interrupted by a softirq if
it calls local_bh_enable(). There is probably no such call
in any idle task but we need to cover every case. Ksoftirqd
is not concerned because the idle time is flushed on context
switch and softirq in the end of hardirq have the idle time
already flushed from the hardirq entry.
In the other cases we always account to system/irq time:
* On hardirq exit we account the time to hardirq time.
* On softirq exit we account the time to softirq time.
To optimize this and avoid the indirect call to vtime_account()
and the checks it performs, specialize the vtime irq APIs and
only perform the check on irq entry. Irq exit can directly call
vtime_account_system().
CONFIG_IRQ_TIME_ACCOUNTING behaviour doesn't change and directly
maps to its own vtime_account() implementation. One may want
to take benefits from the new APIs to optimize irq time accounting
as well in the future.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
vtime_account_system() currently has only one caller with
vtime_account() which is irq safe.
Now we are going to call it from other places like kvm where
irqs are not always disabled by the time we account the cputime.
So let's make it irqsafe. The arch implementation part is now
prefixed with "__".
vtime_account_idle() arch implementation is prefixed accordingly
to stay consistent.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Use DEFINE_STATIC_SRCU() to simplify the rcutorture.c SRCU test code.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
try_to_freeze_tasks() and cgroup_freezer rely on scheduler locks
to ensure that a task doing STOPPED/TRACED -> RUNNING transition
can't escape freezing. This mostly works, but ptrace_stop() does
not necessarily call schedule(), it can change task->state back to
RUNNING and check freezing() without any lock/barrier in between.
We could add the necessary barrier, but this patch changes
ptrace_stop() and do_signal_stop() to use freezable_schedule().
This fixes the race, freezer_count() and freezer_should_skip()
carefully avoid the race.
And this simplifies the code, try_to_freeze_tasks/update_if_frozen
no longer need to use task_is_stopped_or_traced() checks with the
non trivial assumptions. We can rely on the mechanism which was
specially designed to mark the sleeping task as "frozen enough".
v2: As Tejun pointed out, we can also change get_signal_to_deliver()
and move try_to_freeze() up before 'relock' label.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Merge misc fixes from Andrew Morton:
"18 total. 15 fixes and some updates to a device_cgroup patchset which
bring it up to date with the version which I should have merged in the
first place."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (18 patches)
fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
gen_init_cpio: avoid stack overflow when expanding
drivers/rtc/rtc-imxdi.c: add missing spin lock initialization
mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
pidns: limit the nesting depth of pid namespaces
drivers/dma/dw_dmac: make driver's endianness configurable
mm/mmu_notifier: allocate mmu_notifier in advance
tools/testing/selftests/epoll/test_epoll.c: fix build
UAPI: fix tools/vm/page-types.c
mm/page_alloc.c:alloc_contig_range(): return early for err path
rbtree: include linux/compiler.h for definition of __always_inline
genalloc: stop crashing the system when destroying a pool
backlight: ili9320: add missing SPI dependency
device_cgroup: add proper checking when changing default behavior
device_cgroup: stop using simple_strtoul()
device_cgroup: rename deny_all to behavior
cgroup: fix invalid rcu dereference
mm: fix XFS oops due to dirty pages without buffers on s390
If one includes documentation for an external tool, it should be
correct. This is not:
1. Overriding the input to rngd should typically be neither
necessary nor desired. This is especially so since newer
versions of rngd support a number of different *types* of sources.
2. The default kernel-exported device is called /dev/hwrng not
/dev/hwrandom nor /dev/hw_random (both of which were used in the
past; however, kernel and udev seem to have converged on
/dev/hwrng.)
Overall it is better if the documentation for rngd is kept with rngd
rather than in a kernel Makefile.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'struct pid' is a "variable sized struct" - a header with an array of
upids at the end.
The size of the array depends on a level (depth) of pid namespaces. Now a
level of pidns is not limited, so 'struct pid' can be more than one page.
Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
not calculated from PAGE_SIZE, because in this case it depends on
architectures, config options and it will be reduced, if someone adds a
new fields in struct pid or struct upid.
I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
"struct pid" and it's more than enough for all known for me use-cases.
When someone finds a reasonable use case, we can add a config option or a
sysctl parameter.
In addition it will reduce the effect of another problem, when we have
many nested namespaces and the oldest one starts dying.
zap_pid_ns_processe will be called for each namespace and find_vpid will
be called for each process in a namespace. find_vpid will be called
minimum max_level^2 / 2 times. The reason of that is that when we found a
bit in pidmap, we can't determine this pidns is top for this process or it
isn't.
vpid is a heavy operation, so a fork bomb, which create many nested
namespace, can make a system inaccessible for a long time. For example my
system becomes inaccessible for a few minutes with 4000 processes.
[akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There don't have any 'r' prefix in uprobe event naming, remove it.
Signed-off-by: Jovi Zhang <bookjovi@gmail.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Pull cgroup fixes from Tejun Heo:
"This pull request contains three fixes.
Two are reverts of task_lock() removal in cgroup fork path. The
optimizations incorrectly assumed that threadgroup_lock can protect
process forks (as opposed to thread creations) too. Further cleanup
of cgroup fork path is scheduled.
The third fixes cgroup emptiness notification loss."
* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup: Remove task_lock() from cgroup_post_fork()"
Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"
cgroup: notify_on_release may not be triggered in some cases
Pull workqueue fix from Tejun Heo:
"This pull request contains one patch from Dan Magenheimer to fix
cancel_delayed_work() regression introduced by its reimplementation
using try_to_grab_pending(). The reimplementation made it incorrectly
return %true when the work item is idle.
There aren't too many consumers of the return value but it broke at
least ramster."
* 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: cancel_delayed_work() should return %false if work item is idle
57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
try_to_grab_pending()") made cancel_delayed_work() always return %true
unless someone else is also trying to cancel the work item, which is
broken - if the target work item is idle, the return value should be
%false.
try_to_grab_pending() indicates that the target work item was idle by
zero return value. Use it for return. Note that this brings
cancel_delayed_work() in line with __cancel_work_timer() in return
value handling.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>
Not a big deal, but since other __get_key_name() callers
use it lets be consistent.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20121020190519.GH25467@moon
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g. runnable based load-balance (in progress), governors,
power-management, etc.
These facilities are not yet consumers of this data. This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.422162369@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__update_entity_runnable_avg forms the core of maintaining an entity's runnable
load average. In this function we charge the accumulated run-time since last
update and handle appropriate decay. In some cases, e.g. a waking task, this
time interval may be much larger than our period unit.
Fortunately we can exploit some properties of our series to perform decay for a
blocked update in constant time and account the contribution for a running
update in essentially-constant* time.
[*]: For any running entity they should be performing updates at the tick which
gives us a soft limit of 1 jiffy between updates, and we can compute up to a
32 jiffy update in a single pass.
C program to generate the magic constants in the arrays:
#include <math.h>
#include <stdio.h>
#define N 32
#define WMULT_SHIFT 32
const long WMULT_CONST = ((1UL << N) - 1);
double y;
long runnable_avg_yN_inv[N];
void calc_mult_inv() {
int i;
double yn = 0;
printf("inverses\n");
for (i = 0; i < N; i++) {
yn = (double)WMULT_CONST * pow(y, i);
runnable_avg_yN_inv[i] = yn;
printf("%2d: 0x%8lx\n", i, runnable_avg_yN_inv[i]);
}
printf("\n");
}
long mult_inv(long c, int n) {
return (c * runnable_avg_yN_inv[n]) >> WMULT_SHIFT;
}
void calc_yn_sum(int n)
{
int i;
double sum = 0, sum_fl = 0, diff = 0;
/*
* We take the floored sum to ensure the sum of partial sums is never
* larger than the actual sum.
*/
printf("sum y^n\n");
printf(" %8s %8s %8s\n", "exact", "floor", "error");
for (i = 1; i <= n; i++) {
sum = (y * sum + y * 1024);
sum_fl = floor(y * sum_fl+ y * 1024);
printf("%2d: %8.0f %8.0f %8.0f\n", i, sum, sum_fl,
sum_fl - sum);
}
printf("\n");
}
void calc_conv(long n) {
long old_n;
int i = -1;
printf("convergence (LOAD_AVG_MAX, LOAD_AVG_MAX_N)\n");
do {
old_n = n;
n = mult_inv(n, 1) + 1024;
i++;
} while (n != old_n);
printf("%d> %ld\n", i - 1, n);
printf("\n");
}
void main() {
y = pow(0.5, 1/(double)N);
calc_mult_inv();
calc_conv(1024);
calc_yn_sum(N);
}
[ Compile with -lm ]
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.277808946@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that our measurement intervals are small (~1ms) we can amortize the posting
of update_shares() to be about each period overflow. This is a large cost
saving for frequently switching tasks.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.200772172@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that running entities maintain their own load-averages the work we must do
in update_shares() is largely restricted to the periodic decay of blocked
entities. This allows us to be a little less pessimistic regarding our
occupancy on rq->lock and the associated rq->clock updates required.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.133999170@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the machinery in place is in place to compute contributed load in a
bottom up fashion; replace the shares distribution code within update_shares()
accordingly.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.061208672@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With bandwidth control tracked entities may cease execution according to user
specified bandwidth limits. Charging this time as either throttled or blocked
however, is incorrect and would falsely skew in either direction.
What we actually want is for any throttled periods to be "invisible" to
load-tracking as they are removed from the system for that interval and
contribute normally otherwise.
Do this by moderating the progression of time to omit any periods in which the
entity belonged to a throttled hierarchy.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.998912151@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Entities of equal weight should receive equitable distribution of cpu time.
This is challenging in the case of a task_group's shares as execution may be
occurring on multiple cpus simultaneously.
To handle this we divide up the shares into weights proportionate with the load
on each cfs_rq. This does not however, account for the fact that the sum of
the parts may be less than one cpu and so we need to normalize:
load(tg) = min(runnable_avg(tg), 1) * tg->shares
Where runnable_avg is the aggregate time in which the task_group had runnable
children.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.930124292@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlike task entities who have a fixed weight, group entities instead own a
fraction of their parenting task_group's shares as their contributed weight.
Compute this fraction so that we can correctly account hierarchies and shared
entity nodes.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.855074415@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Maintain a global running sum of the average load seen on each cfs_rq belonging
to each task group so that it may be used in calculating an appropriate
shares:weight distribution.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.792901086@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a running entity blocks we migrate its tracked load to
cfs_rq->blocked_runnable_avg. In the sleep case this occurs while holding
rq->lock and so is a natural transition. Wake-ups however, are potentially
asynchronous in the presence of migration and so special care must be taken.
We use an atomic counter to track such migrated load, taking care to match this
with the previously introduced decay counters so that we don't migrate too much
load.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.726077467@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we are now doing bottom up load accumulation we need explicit
notification when a task has been re-parented so that the old hierarchy can be
updated.
Adds: migrate_task_rq(struct task_struct *p, int next_cpu)
(The alternative is to do this out of __set_task_cpu, but it was suggested that
this would be a cleaner encapsulation.)
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.660023400@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are currently maintaining:
runnable_load(cfs_rq) = \Sum task_load(t)
For all running children t of cfs_rq. While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.
This can be solved by introducing a separate accounting for blocked load:
blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)
Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:
runnable_load(t) = \Sum p_i*y^i
and that to account for an additional idle period we only need to compute:
y*runnable_load(t).
This means that we can compute all blocked entities at once by evaluating:
blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)
Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.585389902@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For a given task t, we can compute its contribution to load as:
task_load(t) = runnable_avg(t) * weight(t)
On a parenting cfs_rq we can then aggregate:
runnable_load(cfs_rq) = \Sum task_load(t), for all runnable children t
Maintain this bottom up, with task entities adding their contributed load to
the parenting cfs_rq sum. When a task entity's load changes we add the same
delta to the maintained sum.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.514678907@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since runqueues do not have a corresponding sched_entity we instead embed a
sched_avg structure directly.
Signed-off-by: Ben Segall <bsegall@google.com>
Reviewed-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.442637130@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly. With the load for a given cfs_rq then being the sum
of its children.
To do this we represent the historical contribution to runnable average
within each trailing 1024us of execution as the coefficients of a
geometric series.
We can express this for a given task t as:
runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)
Where: u_i is the usage in the last i`th 1024us period (approximately 1ms)
~ms and y is chosen such that y^k = 1/2. We currently choose k to be 32 which
roughly translates to about a sched period.
Signed-off-by: Paul Turner <pjt@google.com>
Reviewed-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the comments of function tick_sched_timer(), the sentence
"timer->base->cpu_base->lock held" is not right.
In function __run_hrtimer(), before call timer->function(),
the cpu_base->lock has been unlocked.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Cc: fei.li@intel.com
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of BUG_ON(in_interrupt()), since that doesn't check for all
the newfangled stuff like preempt.
Note that this is valid since the console_sem is essentially used like
a real mutex with only two twists:
- we allow trylock from hardirq context
- across suspend/resume we lock the logical console_lock, but drop the
semaphore protecting the locking state.
Now that doesn't guarantee that no one is playing tricks in
single-thread atomic contexts at suspend/resume/boot time, but
- I couldn't find anything suspicious with some grepping,
- might_sleep shouldn't die,
- and I think the upside of catching more potential issues is worth
the risk of getting a might_sleep backtrace that would have been
save (and then dealing with that fallout).
Cc: Dave Airlie <airlied@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull perf fixes from Ingo Molnar:
"Most of these are uprobes race fixes from Oleg, and their preparatory
cleanups. (It's larger than what I'd normally send for an -rc kernel,
but they looked significant enough to not delay them.)
There's also an oprofile fix and an uncore PMU fix."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
perf/x86: Disable uncore on virtualized CPUs
oprofile, x86: Fix wrapping bug in op_x86_get_ctrl()
ring-buffer: Check for uninitialized cpu buffer before resizing
uprobes: Fix the racy uprobe->flags manipulation
uprobes: Fix prepare_uprobe() race with itself
uprobes: Introduce prepare_uprobe()
uprobes: Fix handle_swbp() vs unregister() + register() race
uprobes: Do not delete uprobe if uprobe_unregister() fails
uprobes: Don't return success if alloc_uprobe() fails
uprobes/x86: Only rep+nop can be emulated correctly
uprobes: Simplify is_swbp_at_addr(), remove stale comments
uprobes: Kill set_orig_insn()->is_swbp_at_addr()
uprobes: Introduce copy_opcode(), kill read_opcode()
uprobes: Kill set_swbp()->is_swbp_at_addr()
uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas
uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC
uprobes: Change write_opcode() to use FOLL_FORCE
uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume()
uprobes: Kill UTASK_BP_HIT state
uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp()
...
In theory, if a grace period manages to get started despite there being
no callbacks on any of the CPUs, all CPUs could go into dyntick-idle
mode, so that the grace period would never end. This commit updates
the RCU CPU stall warning messages to detect this condition by summing
up the number of callbacks on all CPUs.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit causes the last grace period started and completed to be
printed on RCU CPU stall warning messages in order to aid diagnosis.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The RCU CPU stall warnings rely on trigger_all_cpu_backtrace() to
do NMI-based dump of the stack traces of all CPUs. Unfortunately, a
number of architectures do not implement trigger_all_cpu_backtrace(), in
which case RCU falls back to just dumping the stack of the running CPU.
This is unhelpful in the case where the running CPU has detected that
some other CPU has stalled.
This commit therefore makes the running CPU dump the stacks of the
tasks running on the stalled CPUs.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Because process_srcu() will be used in DEFINE_SRCU(), which is a macro
that could be expanded pretty much anywhere, it can no longer be static.
Note that process_srcu() is still internal to srcu.h.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Lai Jiangshan rewrote SRCU, so this commit ensures that he gets his
proper share of blame^Wcredit.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The fix introduced by a10d206e (rcu: Fix day-one dyntick-idle
stall-warning bug) has a C-language precedence error. It turns out
that this error is harmless in that the same result is computed for all
inputs, but the code is nevertheless a potential source of confusion.
This commit therefore introduces parentheses in order to force the
execution of the code to reflect the intent.
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There have been some embedded applications that would benefit from
use of expedited grace-period primitives. In some ways, this is
similar to synchronize_net() doing either a normal or an expedited
grace period depending on lock state, but with control outside of
the kernel.
This commit therefore adds rcu_expedited boot and sysfs parameters
that cause the kernel to substitute expedited primitives for the
normal grace-period primitives.
[ paulmck: Add trace/event/rcu.h to kernel/srcu.c to avoid build error.
Get rid of infinite loop through contention path.]
Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It's only there to call rcu_user_hooks_switch(). Let's
just call rcu_user_hooks_switch() directly, we don't need this
function in the middle.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit causes rcutorture to print the errno if cpu_down() fails
when the rcutorture "verbose" module parameter is specified.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
In the old days, _rcu_barrier() acquired ->onofflock to exclude
rcu_send_cbs_to_orphanage(), which allowed the latter to avoid memory
barriers in callback handling. However, _rcu_barrier() recently started
doing get_online_cpus() to lock out CPU-hotplug operations entirely, which
means that the comment in rcu_send_cbs_to_orphanage() that talks about
->onofflock is now obsolete. This commit therefore fixes the comment.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Dave Airlie recently discovered a locking bug in the fbcon layer,
where a timer_del_sync (for the blinking cursor) deadlocks with the
timer itself, since both (want to) hold the console_lock:
https://lkml.org/lkml/2012/8/21/36
Unfortunately the console_lock isn't a plain mutex and hence has no
lockdep support. Which resulted in a few days wasted of tracking down
this bug (complicated by the fact that printk doesn't show anything
when the console is locked) instead of noticing the bug much earlier
with the lockdep splat.
Hence I've figured I need to fix that for the next deadlock involving
console_lock - and with kms/drm growing ever more complex locking
that'll eventually happen.
Now the console_lock has rather funky semantics, so after a quick irc
discussion with Thomas Gleixner and Dave Airlie I've quickly ditched
the original idead of switching to a real mutex (since it won't work)
and instead opted to annotate the console_lock with lockdep
information manually.
There are a few special cases:
- The console_lock state is protected by the console_sem, and usually
grabbed/dropped at _lock/_unlock time. But the suspend/resume code
drops the semaphore without dropping the console_lock (see
suspend_console/resume_console). But since the same thread that did
the suspend will do the resume, we don't need to fix up anything.
- In the printk code there's a special trylock, only used to kick off
the logbuffer printk'ing in console_unlock. But all that happens
while lockdep is disable (since printk does a few other evil
tricks). So no issue there, either.
- The console_lock can also be acquired form irq context (but only
with a trylock). lockdep already handles that.
This all leaves us with annotating the normal console_lock, _unlock
and _trylock functions.
And yes, it works - simply unloading a drm kms driver resulted in
lockdep complaining about the deadlock in fbcon_deinit:
======================================================
[ INFO: possible circular locking dependency detected ]
3.6.0-rc2+ #552 Not tainted
-------------------------------------------------------
kms-reload/3577 is trying to acquire lock:
((&info->queue)){+.+...}, at: [<ffffffff81058c70>] wait_on_work+0x0/0xa7
but task is already holding lock:
(console_lock){+.+.+.}, at: [<ffffffff81264686>] bind_con_driver+0x38/0x263
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (console_lock){+.+.+.}:
[<ffffffff81087440>] lock_acquire+0x95/0x105
[<ffffffff81040190>] console_lock+0x59/0x5b
[<ffffffff81209cb6>] fb_flashcursor+0x2e/0x12c
[<ffffffff81057c3e>] process_one_work+0x1d9/0x3b4
[<ffffffff810584a2>] worker_thread+0x1a7/0x24b
[<ffffffff8105ca29>] kthread+0x7f/0x87
[<ffffffff813b1204>] kernel_thread_helper+0x4/0x10
-> #0 ((&info->queue)){+.+...}:
[<ffffffff81086cb3>] __lock_acquire+0x999/0xcf6
[<ffffffff81087440>] lock_acquire+0x95/0x105
[<ffffffff81058cab>] wait_on_work+0x3b/0xa7
[<ffffffff81058dd6>] __cancel_work_timer+0xbf/0x102
[<ffffffff81058e33>] cancel_work_sync+0xb/0xd
[<ffffffff8120a3b3>] fbcon_deinit+0x11c/0x1dc
[<ffffffff81264793>] bind_con_driver+0x145/0x263
[<ffffffff81264a45>] unbind_con_driver+0x14f/0x195
[<ffffffff8126540c>] store_bind+0x1ad/0x1c1
[<ffffffff8127cbb7>] dev_attr_store+0x13/0x1f
[<ffffffff8116d884>] sysfs_write_file+0xe9/0x121
[<ffffffff811145b2>] vfs_write+0x9b/0xfd
[<ffffffff811147b7>] sys_write+0x3e/0x6b
[<ffffffff813b0039>] system_call_fastpath+0x16/0x1b
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(console_lock);
lock((&info->queue));
lock(console_lock);
lock((&info->queue));
*** DEADLOCK ***
v2: Mark the lockdep_map static, noticed by Jani Nikula.
Cc: Dave Airlie <airlied@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Introduce struct pm_qos_flags_request and struct pm_qos_flags
representing PM QoS flags request type and PM QoS flags constraint
type, respectively. With these definitions the data structures
will be arranged so that the list member of a struct pm_qos_flags
object will contain the head of a list of struct pm_qos_flags_request
objects representing all of the "flags" requests present for the
given device. Then, the effective_flags member of a struct
pm_qos_flags object will contain the bitwise OR of the flags members
of all the struct pm_qos_flags_request objects in the list.
Additionally, introduce helper function pm_qos_update_flags()
allowing the caller to manage the list of struct pm_qos_flags_request
pointed to by the list member of struct pm_qos_flags.
The flags are of type s32 so that the request's "value" field
is always of the same type regardless of what kind of request it
is (latency requests already have value fields of type s32).
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Jean Pihet <j-pihet@ti.com>
Acked-by: mark gross <markgross@thegnar.org>
Fix the warning:
kernel/module_signing.c:195:2: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t'
by using the proper 'z' modifier for printing a size_t.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
freezer_read/write() used cgroup_lock_live_group() to synchronize
against task migration into and out of the target cgroup.
cgroup_lock_live_group() grabs the internal cgroup lock and using it
from outside cgroup core leads to complex and fragile locking
dependency issues which are difficult to resolve.
Now that freezer_can_attach() is replaced with freezer_attach() and
update_if_frozen() updated, nothing requires excluding migration
against freezer state reads and changes.
This patch removes cgroup_lock_live_group() and the matching
cgroup_unlock() usages. The prone-to-bitrot, already outdated and
unnecessary global lock hierarchy documentation is replaced with
documentation in local scope.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Li Zefan <lizefan@huawei.com>
Locking will change such that migration can happen while
freezer_read/write() is in progress. This means that
update_if_frozen() can no longer assume that all tasks in the cgroup
coform to the current freezer state - newly migrated tasks which
haven't finished freezer_attach() yet might be in any state.
This patch updates update_if_frozen() such that it no longer verifies
task states against freezer state. It now simply decides whether
FREEZING stage is complete.
This removal of verification makes it meaningless to call from
freezer_change_state(). Drop it and move the fast exit test from
freezer_read() - the only left caller - to update_if_frozen().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Li Zefan <lizefan@huawei.com>
cgroup_freezer is one of the few users of cgroup_subsys->can_attach()
and uses it to prevent tasks from being migrated into or out of a
frozen cgroup. This makes cgroup_freezer cumbersome to use especially
when co-mounted with other controllers.
->can_attach() is problematic in general as it can make co-mounting
multiple cgroups difficult - migrating tasks may fail for reasons
completely irrelevant for other controllers. freezer_can_attach() in
particular is more problematic because it messes with cgroup internal
locking to ensure that the state verification performed at
freezer_can_attach() stays valid until migration is complete.
This patch replaces freezer_can_attach() with freezer_attach() so that
tasks are always allowed to migrate - they are nudged into the
conforming state from freezer_attach(). This means that there can be
tasks which are being migrated which don't conform to the current
cgroup_freezer state until freezer_attach() is complete. Under the
current locking scheme, the only such place is freezer_fork() which is
updated to handle such window.
While this patch doesn't remove the use of internal cgroup locking
from freezer_read/write() paths, it removes the requirement to keep
the freezer state constant while migrating and enables such change.
Note that this creates a userland visible behavior change - FROZEN
cgroup can no longer be used to lock migrations in and out of the
cgroup. This behavior change is intended. I don't think the feature
is necessary - userland should coordinate accesses to cgroup fs anyway
- and even if the feature is needed cgroup_freezer is the completely
wrong place to implement it.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1350426526-14254-1-git-send-email-tj@kernel.org>
Cc: Matt Helsley <matthltc@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Li Zefan <lizefan@huawei.com>
Because grace-period initialization is carried out by a separate
kthread, it might happen on a different CPU than the one that
had the callback needing a grace period -- which is where the
callback acceleration needs to happen.
Fortunately, rcu_start_gp() holds the root rcu_node structure's
->lock, which prevents a new grace period from starting. This
allows this function to safely determine that a grace period has
not yet started, which in turn allows it to fully accelerate any
callbacks that it has pending. This commit adds this acceleration.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The min/max call needed to have explicit types on some architectures
(e.g. mn10300). Use clamp_t instead to avoid the warning:
kernel/sys.c: In function 'override_release':
kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Emit the magic string that indicates a module has a signature after the
signature data instead of before it. This allows module_sig_check() to
be made simpler and faster by the elimination of the search for the
magic string. Instead we just need to do a single memcmp().
This works because at the end of the signature data there is the
fixed-length signature information block. This block then falls
immediately prior to the magic number.
From the contents of the information block, it is trivial to calculate
the size of the signature data and thus the size of the actual module
data.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 7e3aa30ac8.
The commit incorrectly assumed that fork path always performed
threadgroup_change_begin/end() and depended on that for
synchronization against task exit and cgroup migration paths instead
of explicitly grabbing task_lock().
threadgroup_change is not locked when forking a new process (as
opposed to a new thread in the same process) and even if it were it
wouldn't be effective as different processes use different threadgroup
locks.
Revert the incorrect optimization.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20121008020000.GB2575@localhost>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
This reverts commit 7e381b0eb1.
The commit incorrectly assumed that fork path always performed
threadgroup_change_begin/end() and depended on that for
synchronization against task exit and cgroup migration paths instead
of explicitly grabbing task_lock().
threadgroup_change is not locked when forking a new process (as
opposed to a new thread in the same process) and even if it were it
wouldn't be effective as different processes use different threadgroup
locks.
Revert the incorrect optimization.
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20121008020000.GB2575@localhost>
Acked-by: Li Zefan <lizefan@huawei.com>
Bitterly-Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
free_pid_ns() operates in a recursive fashion:
free_pid_ns(parent)
put_pid_ns(parent)
kref_put(&ns->kref, free_pid_ns);
free_pid_ns
thus if there was a huge nesting of namespaces the userspace may trigger
avalanche calling of free_pid_ns leading to kernel stack exhausting and a
panic eventually.
This patch turns the recursion into an iterative loop.
Based on a patch by Andrew Vagin.
[akpm@linux-foundation.org: export put_pid_ns() to modules]
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calling uname() with the UNAME26 personality set allows a leak of kernel
stack contents. This fixes it by defensively calculating the length of
copy_to_user() call, making the len argument unsigned, and initializing
the stack buffer to zero (now technically unneeded, but hey, overkill).
CVE-2012-0957
Reported-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The console_cpu_notify() function runs with interrupts disabled in the
CPU_DYING case. It therefore cannot block, for example, as will happen
when it calls console_lock(). Therefore, remove the CPU_DYING leg of
the switch statement to avoid this problem.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
notify_on_release must be triggered when the last process in a cgroup is
move to another. But if the first(and only) process in a cgroup is moved to
another, notify_on_release is not triggered.
# mkdir /cgroup/cpu/SRC
# mkdir /cgroup/cpu/DST
#
# echo 1 >/cgroup/cpu/SRC/notify_on_release
# echo 1 >/cgroup/cpu/DST/notify_on_release
#
# sleep 300 &
[1] 8629
#
# echo 8629 >/cgroup/cpu/SRC/tasks
# echo 8629 >/cgroup/cpu/DST/tasks
-> notify_on_release for /SRC must be triggered at this point,
but it isn't.
This is because put_css_set() is called before setting CGRP_RELEASABLE
in cgroup_task_migrate(), and is a regression introduce by the
commit:74a1166d(cgroups: make procs file writable), which was merged
into v3.0.
Cc: Ben Blum <bblum@andrew.cmu.edu>
Cc: <stable@vger.kernel.org> # v3.0.x and later
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_freezer doesn't transition from FREEZING to FROZEN if the
cgroup contains PF_NOFREEZE tasks or tasks sleeping with
PF_FREEZER_SKIP set.
Only kernel tasks can be non-freezable (PF_NOFREEZE) and there's
nothing cgroup_freezer or userland can do about or to it. It's
pointless to stall the transition for PF_NOFREEZE tasks.
PF_FREEZER_SKIP indicates that the task can be skipped when
determining whether frozen state is reached. A task with
PF_FREEZER_SKIP is guaranteed to perform try_to_freeze() after it
wakes up and can be considered frozen much like stopped or traced
tasks. Note that a vfork parent uses PF_FREEZER_SKIP while waiting
for the child.
This updates update_if_frozen() such that it only considers freezable
tasks and treats %true freezer_should_skip() tasks as frozen.
This allows cgroups w/ kthreads and vfork parents successfully reach
FROZEN state.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
try_to_freeze_cgroup() has condition checks which are intended to fail
the write operation to freezer.state if there are tasks which can't be
frozen. The condition checks have been broken for quite some time
now. freeze_task() returns %false if the target task can't be frozen,
so num_cant_freeze_now is never incremented.
In addition, strangely, cgroup freezing proceeds even after the write
is failed, which is rather broken.
This patch rips out the non-working code intended to fail the write to
freezer.state when the cgroup contains non-freezable tasks and makes
it official that writes to freezer.state succeed whether there are
non-freezable tasks in the cgroup or not.
This leaves is_task_frozen_enough() with only one user -
upste_if_frozen(). Collapse it into the caller. Note that this
removes an extra call to freezing().
This doesn't cause any userland behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
cgroup core has a bug which violates a basic rule about event
notifications - when a new entity needs to be added, you add that to
the notification list first and then make the new entity conform to
the current state. If done in the reverse order, an event happening
inbetween will be lost.
cgroup_subsys->fork() is invoked way before the new task is added to
the css_set. Currently, cgroup_freezer is the only user of ->fork()
and uses it to make new tasks conform to the current state of the
freezer. If FROZEN state is requested while fork is in progress
between cgroup_fork_callbacks() and cgroup_post_fork(), the child
could escape freezing - the cgroup isn't frozen when ->fork() is
called and the freezer couldn't see the new task on the css_set.
This patch moves cgroup_subsys->fork() invocation to
cgroup_post_fork() after the new task is added to the css_set.
cgroup_fork_callbacks() is removed.
Because now a task may be migrated during cgroup_subsys->fork(),
freezer_fork() is updated so that it adheres to the usual RCU locking
and the rather pointless comment on why locking can be different there
is removed (if it doesn't make anything simpler, why even bother?).
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
As per the recent discussion with Mike and Linus, make it easier to
test with/without this feature. No change in default behavior.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-izoxq4haeg4mTognnDbwcevt@git.kernel.org
This optimize a bit the high res tick sched handler.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Besides unifying code, this also adds the idle check before
processing idle accounting specifics on the low res handler.
This way we also generalize this part of the nohz code for
!CONFIG_HIGH_RES_TIMERS to prepare for the adaptive tickless
features.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Unify the duplicated timekeeping handling code of low and high res tick
sched handlers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Pull module signing support from Rusty Russell:
"module signing is the highlight, but it's an all-over David Howells frenzy..."
Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG.
* 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits)
X.509: Fix indefinite length element skip error handling
X.509: Convert some printk calls to pr_devel
asymmetric keys: fix printk format warning
MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking
MODSIGN: Make mrproper should remove generated files.
MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs
MODSIGN: Use the same digest for the autogen key sig as for the module sig
MODSIGN: Sign modules during the build process
MODSIGN: Provide a script for generating a key ID from an X.509 cert
MODSIGN: Implement module signature checking
MODSIGN: Provide module signing public keys to the kernel
MODSIGN: Automatically generate module signing keys if missing
MODSIGN: Provide Kconfig options
MODSIGN: Provide gitignore and make clean rules for extra files
MODSIGN: Add FIPS policy
module: signature checking hook
X.509: Add a crypto key parser for binary (DER) X.509 certificates
MPILIB: Provide a function to read raw data into an MPI
X.509: Add an ASN.1 decoder
X.509: Add simple ASN.1 grammar compiler
...
Cleanups
Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c
Add module event hooks for simplified debugging with gdb
Fixes
Fix kdb to stop paging with 'q' on bta and dmesg
Fix for data that scrolls off the vga console due to line wrapping
when using the kdb pager
New
The debug core registers for kernel module events which allows a
kernel aware gdb to automatically load symbols and break on entry
to a kernel module
Allow kgdboc=kdb to setup kdb on the vga console
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQeB8KAAoJEIciOldedpOjpbIP/j+LXEkzXKKfi/3m79VQ87DB
5iUmTS84t84pomHamXX175AC0gA/2mC0FbbcHpqjlhxF4awXcviCNIiTdtSOTbbu
G102naLHY8i77X+XbHuN2utJeaRLw8rsfMMZGmjJnjfpc4LtsaH0YTkUzbt3qvba
N6/QvknadzIrmoCJvHipdOdsSmL0YmTS22+koG4es9B5jvOqVH/W7jZs1qRlVw96
VxG5Psx4LPB+RI+ZwF1WwbGxbtqKGwkVvkcGG1XIW7FQojHmjw+vUERQCjoFueJ5
NkKfus98j85/+MvSTkWx3L1K46MHMCFbtJs9RWftJ8GtoNNnm7GDxasoIG2bJKyG
HFD3IGPuKAokE/equF3eGTRHeEM0IUGwT3EnBqdKd73zud27WsHaSqC/1CPR+74v
ojLQ2ft1QF+pEkGrhRTdQpLyVnvEmxu8q+j9z9n/HlGEVv8kZ6LGxDPjWB+um/Yi
Cs0XAryYrL5gE5O+Vwna61luughtIYJwR7+DeVxnQYJ43x/0MtN/SoURnwvrCTEo
9FeoMgZm1nLh6EW29ahIT/hMu4f0sM91Kiwrmc/zEWZgoB++wo1n470qQmUUrOx4
CPD7zdmDrf6YxDG2QTHjCtVErO4aJ5zN4Dq0+YyodV545SZVn3t4qBDTVvKhq4Y6
NIhZAxrv5RKABwtLcP9E
=uf0L
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB fixes and cleanups from Jason Wessel:
"Cleanups
- Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c
- Add module event hooks for simplified debugging with gdb
Fixes
- Fix kdb to stop paging with 'q' on bta and dmesg
- Fix for data that scrolls off the vga console due to line wrapping
when using the kdb pager
New
- The debug core registers for kernel module events which allows a
kernel aware gdb to automatically load symbols and break on entry
to a kernel module
- Allow kgdboc=kdb to setup kdb on the vga console"
* tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
tty/console: fix warnings in drivers/tty/serial/kgdboc.c
kdb,vt_console: Fix missed data due to pager overruns
kdb: Fix dmesg/bta scroll to quit with 'q'
kgdboc: Accept either kbd or kdb to activate the vga + keyboard kdb shell
kgdb,x86: fix warning about unused variable
mips,kgdb: fix recursive page fault with CONFIG_KPROBES
kgdb: Add module event hooks
Pull perf updates from Ingo Molnar:
"This tree includes some late late perf items that missed the first
round:
tools:
- Bash auto completion improvements, now we can auto complete the
tools long options, tracepoint event names, etc, from Namhyung Kim.
- Look up thread using tid instead of pid in 'perf sched'.
- Move global variables into a perf_kvm struct, from David Ahern.
- Hists refactorings, preparatory for improved 'diff' command, from
Jiri Olsa.
- Hists refactorings, preparatory for event group viewieng work, from
Namhyung Kim.
- Remove double negation on optional feature macro definitions, from
Namhyung Kim.
- Remove several cases of needless global variables, on most
builtins.
- misc fixes
kernel:
- sysfs support for IBS on AMD CPUs, from Robert Richter.
- Support for an upcoming Intel CPU, the Xeon-Phi / Knights Corner
HPC blade PMU, from Vince Weaver.
- misc fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
perf: Fix perf_cgroup_switch for sw-events
perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu
perf/AMD/IBS: Add sysfs support
perf hists: Add more helpers for hist entry stat
perf hists: Move he->stat.nr_events initialization to a template
perf hists: Introduce struct he_stat
perf diff: Removing the total_period argument from output code
perf tool: Add hpp interface to enable/disable hpp column
perf tools: Removing hists pair argument from output path
perf hists: Separate overhead and baseline columns
perf diff: Refactor diff displacement possition info
perf hists: Add struct hists pointer to struct hist_entry
perf tools: Complete tracepoint event names
perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU
perf evlist: Remove some unused methods
perf evlist: Introduce add_newtp method
perf kvm: Move global variables into a perf_kvm struct
perf tools: Convert to BACKTRACE_SUPPORT
perf tools: Long option completion support for each subcommands
perf tools: Complete long option names of perf command
...
Pull third pile of kernel_execve() patches from Al Viro:
"The last bits of infrastructure for kernel_thread() et.al., with
alpha/arm/x86 use of those. Plus sanitizing the asm glue and
do_notify_resume() on alpha, fixing the "disabled irq while running
task_work stuff" breakage there.
At that point the rest of kernel_thread/kernel_execve/sys_execve work
can be done independently for different architectures. The only
pending bits that do depend on having all architectures converted are
restrictred to fs/* and kernel/* - that'll obviously have to wait for
the next cycle.
I thought we'd have to wait for all of them done before we start
eliminating the longjump-style insanity in kernel_execve(), but it
turned out there's a very simple way to do that without flagday-style
changes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to saner kernel_execve() semantics
arm: switch to saner kernel_execve() semantics
x86, um: convert to saner kernel_execve() semantics
infrastructure for saner ret_from_kernel_thread semantics
make sure that kernel_thread() callbacks call do_exit() themselves
make sure that we always have a return path from kernel_execve()
ppc: eeh_event should just use kthread_run()
don't bother with kernel_thread/kernel_execve for launching linuxrc
alpha: get rid of switch_stack argument of do_work_pending()
alpha: don't bother passing switch_stack separately from regs
alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
alpha: simplify TIF_NEED_RESCHED handling
Pull third pile of VFS updates from Al Viro:
"Stuff from Jeff Layton, mostly. Sanitizing interplay between audit
and namei, removing a lot of insanity from audit_inode() mess and
getting things ready for his ESTALE patchset."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
procfs: don't need a PATH_MAX allocation to hold a string representation of an int
vfs: embed struct filename inside of names_cache allocation if possible
audit: make audit_inode take struct filename
vfs: make path_openat take a struct filename pointer
vfs: turn do_path_lookup into wrapper around struct filename variant
audit: allow audit code to satisfy getname requests from its names_list
vfs: define struct filename and have getname() return it
vfs: unexport getname and putname symbols
acct: constify the name arg to acct_on
vfs: allocate page instead of names_cache buffer in mount_block_root
audit: overhaul __audit_inode_child to accomodate retrying
audit: optimize audit_compare_dname_path
audit: make audit_compare_dname_path use parent_len helper
audit: remove dirlen argument to audit_compare_dname_path
audit: set the name_len in audit_inode for parent lookups
audit: add a new "type" field to audit_names struct
audit: reverse arguments to audit_inode_child
audit: no need to walk list in audit_inode if name is NULL
audit: pass in dentry to audit_copy_inode wherever possible
audit: remove unnecessary NULL ptr checks from do_path_lookup
Keep a pointer to the audit_names "slot" in struct filename.
Have all of the audit_inode callers pass a struct filename ponter to
audit_inode instead of a string pointer. If the aname field is already
populated, then we can skip walking the list altogether and just use it
directly.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
...and fix up the callers. For do_file_open_root, just declare a
struct filename on the stack and fill out the .name field. For
do_filp_open, make it also take a struct filename pointer, and fix up its
callers to call it appropriately.
For filp_open, add a variant that takes a struct filename pointer and turn
filp_open into a wrapper around it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, if we call getname() on a userland string more than once,
we'll get multiple copies of the string and multiple audit_names
records.
Add a function that will allow the audit_names code to satisfy getname
requests using info from the audit_names list, avoiding a new allocation
and audit_names records.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.
For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.
This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.
Later, we'll add other information to the struct as it becomes
convenient.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* allow kernel_execve() leave the actual return to userland to
caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers
updated accordingly.
* architecture that does select GENERIC_KERNEL_EXECVE in its
Kconfig should have its ret_from_kernel_thread() do this:
call schedule_tail
call the callback left for it by copy_thread(); if it ever
returns, that's because it has just done successful kernel_execve()
jump to return from syscall
IOW, its only difference from ret_from_fork() is that it does call the
callback.
* such an architecture should also get rid of ret_from_kernel_execve()
and __ARCH_WANT_KERNEL_EXECVE
This is the last part of infrastructure patches in that area - from
that point on work on different architectures can live independently.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull timer core update from Thomas Gleixner:
- Bug fixes (one for a longstanding dead loop issue)
- Rework of time related vsyscalls
- Alarm timer updates
- Jiffies updates to remove compile time dependencies
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Cast raw_interval to u64 to avoid shift overflow
timers: Fix endless looping between cascade() and internal_add_timer()
time/jiffies: bring back unconditional LATCH definition
time: Convert x86_64 to using new update_vsyscall
time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems
time: Introduce new GENERIC_TIME_VSYSCALL
time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD
time: Move update_vsyscall definitions to timekeeper_internal.h
time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes
jiffies: Remove compile time assumptions about CLOCK_TICK_RATE
jiffies: Kill unused TICK_USEC_TO_NSEC
alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue
alarmtimer: Remove unused helpers & defines
alarmtimer: Use hrtimer per-alarm instead of per-base
alarmtimer: Implement minimum alarm interval for allowing suspend
Pull scheduler fixes from Ingo Molnar:
"A CPU hotplug related crash fix and a nohz accounting fixlet."
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Update sched_domains_numa_masks[][] when new cpus are onlined
sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
nohz: Fix one jiffy count too far in idle cputime
Pull RCU fixes from Ingo Molnar:
"This tree includes a shutdown/cpu-hotplug deadlock fix and a
documentation fix."
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Advise most users not to enable RCU user mode
rcu: Grace-period initialization excludes only RCU notifier
It is possible to miss data when using the kdb pager. The kdb pager
does not pay attention to the maximum column constraint of the screen
or serial terminal. This result is not incrementing the shown lines
correctly and the pager will print more lines that fit on the screen.
Obviously that is less than useful when using a VGA console where you
cannot scroll back.
The pager will now look at the kdb_buffer string to see how many
characters are printed. It might not be perfect considering you can
output ASCII that might move the cursor position, but it is a
substantially better approximation for viewing dmesg and trace logs.
This also means that the vt screen needs to set the kdb COLUMNS
variable.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
If you press 'q' the pager should exit instead of printing everything
from dmesg which can really bog down a 9600 baud serial link.
The same is true for the bta command.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Allow gdb to auto load kernel modules when it is attached,
which makes it trivially easy to debug module init functions
or pre-set breakpoints in a kernel module that has not loaded yet.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.
If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In the cases where we already know the length of the parent, pass it as
a parm so we don't need to recompute it. In the cases where we don't
know the length, pass in AUDIT_NAME_FULL (-1) to indicate that it should
be determined.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Currently, this gets set mostly by happenstance when we call into
audit_inode_child. While that might be a little more efficient, it seems
wrong. If the syscall ends up failing before audit_inode_child ever gets
called, then you'll have an audit_names record that shows the full path
but has the parent inode info attached.
Fix this by passing in a parent flag when we call audit_inode that gets
set to the value of LOOKUP_PARENT. We can then fix up the pathname for
the audit entry correctly from the get-go.
While we're at it, clean up the no-op macro for audit_inode in the
!CONFIG_AUDITSYSCALL case.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For now, we just have two possibilities:
UNKNOWN: for a new audit_names record that we don't know anything about yet
NORMAL: for everything else
In later patches, we'll add other types so we can distinguish and update
records created under different circumstances.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.
Reverse those arguments for a micro-optimization.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If name is NULL then the condition in the loop will never be true. Also,
with this change, we can eliminate the check for n->name == NULL since
the equivalence check will never be true if it is.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In some cases, we were passing in NULL even when we have a dentry.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
- Complement the Nomadik pinctrl driver with alternate Cx functions
so it handles all oddities.
- A patch to the IRQdomain to reform the simple irqdomain to handle
IRQ descriptor allocation dynamically.
- Use the above feature in the Nomadik pin controller.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQdonzAAoJEEEQszewGV1zpAoP/RfvLFoZe5q6FXFCUG+CbXmg
PKSe58YR3iLkCPDgv0t/zpddmKkulg92LMvrJK1Rv5tuWODia9fQTRbqXGWehoPi
0jnAIvjuBkDDYuHD+mr9vd+WO8Ts6pKasFwNLLZSMmu5vuV3rQvkPyMkC47amB8j
ncMl16M5efxxfgEJo49TkaKCCJOp3aNRQdZlY9aCqDzGqGmLizOJituN5FAfzT60
0IZpUC3tZwn4eMlMZy3C0WkNDpiUy8U10vXafHVapQ/y2t1lgRnMyncbioH/cOIQ
jXbbHI9mKOoXf4sXWEzikEreB+WAnPVcfiLNzdHzv3SoW6UrJjY0FumGJ85MItIg
HKwtcF2HHuJ1MaQI+DkLlhyWszXXjKP/zfRioBf0SkMZOtbvDA5aMmrSza6nqIF1
zCHu33ywc8AJbEBgHfVYZlAfvqkMNnI+oerrAdodtbYY0+8hey8EKeHkTJH3grk4
mCtVFtFGhbyNmoqM2YKgLqS8TqxDMfYhj1e3GX0kCgqbQEWbX6gCyqXOeDMl+gst
9kHPfHhaqKvBShWspU0yOU88M72KWlLt+CwiB1WA1eAW/lBwFiWl21PUe6RKAjpt
E0hX77+UdNm5Af9yVETC/K5q77lQnkjBdCDXbioRcCh2ifKFjyCtMQiW5FIw3Qc3
7UGdkdWTf7vhtPqmIxgF
=UKY/
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-for-3.7-late' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull second set of pinctrl patches from Linus Walleij:
"Here is a late pinctrl pull request with stuff that wasn't quite
tested at the first pull request.
The main reason to not hold off is that the modifications to
irq_domain_add_simple() as reviewed by Rob Herring introduce new
infrastructure for irqdomains that will be useful for the next cycle:
instead of sprinkling irq descriptor allocation all over the kernel
wherever a "legacy" domain is registered, which is necessary for any
platform using sparse IRQs, and many irq chips are say GPIO
controllers which may be used with several systems, some with sparse
IRQs some not, we push this into the irq_domain_add_simple() so we can
atleast do mistakes in one place.
The irq_domain_add_simple() is currently unused in the kernel, so I
need to provide a user. The Nomadik stuff that goes with are changes
to the driver I use day-to-day to make use of this facility (and a
dependency), so see it as a way to eat my own dogfood: if this blows
up the egg hits my face.
A second round of pinctrl patches for v3.7:
- Complement the Nomadik pinctrl driver with alternate Cx functions
so it handles all oddities.
- A patch to the IRQdomain to reform the simple irqdomain to handle
IRQ descriptor allocation dynamically.
- Use the above feature in the Nomadik pin controller."
* tag 'pinctrl-for-3.7-late' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl/nomadik: use simple or linear IRQ domain
irqdomain: augment add_simple() to allocate descs
pinctrl/nomadik: support other alternate-C functions
Pull pile 2 of vfs updates from Al Viro:
"Stuff in this one - assorted fixes, lglock tidy-up, death to
lock_super().
There'll be a VFS pile tomorrow (with patches from Jeff Layton,
sanitizing getname() and related parts of audit and preparing for
ESTALE fixes), but I'd rather push the stuff in this one ASAP - some
of the bugs closed here are quite unpleasant."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: bogus warnings in fs/namei.c
consitify do_mount() arguments
lglock: add DEFINE_STATIC_LGLOCK()
lglock: make the per_cpu locks static
lglock: remove unused DEFINE_LGLOCK_LOCKDEP()
MAX_LFS_FILESIZE definition for 64bit needs LL...
tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
vfs: drop lock/unlock super
ufs: drop lock/unlock super
sysv: drop lock/unlock super
hpfs: drop lock/unlock super
fat: drop lock/unlock super
ext3: drop lock/unlock super
exofs: drop lock/unlock super
dup3: Return an error when oldfd == newfd.
fs: handle failed audit_log_start properly
fs: prevent use after free in auditing when symlink following was denied
Pull pile 2 of execve and kernel_thread unification work from Al Viro:
"Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
several more architectures plus assorted signal fixes and cleanups.
There'll be more (in particular, real fixes for the alpha
do_notify_resume() irq mess)..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
alpha: don't open-code trace_report_syscall_{enter,exit}
Uninclude linux/freezer.h
m32r: trim masks
avr32: trim masks
tile: don't bother with SIGTRAP in setup_frame
microblaze: don't bother with SIGTRAP in setup_rt_frame()
mn10300: don't bother with SIGTRAP in setup_frame()
frv: no need to raise SIGTRAP in setup_frame()
x86: get rid of duplicate code in case of CONFIG_VM86
unicore32: remove pointless test
h8300: trim _TIF_WORK_MASK
parisc: decide whether to go to slow path (tracesys) based on thread flags
parisc: don't bother looping in do_signal()
parisc: fix double restarts
bury the rest of TIF_IRET
sanitize tsk_is_polling()
bury _TIF_RESTORE_SIGMASK
unicore32: unobfuscate _TIF_WORK_MASK
mips: NOTIFY_RESUME is not needed in TIF masks
mips: merge the identical "return from syscall" per-ABI code
...
Conflicts:
arch/arm/include/asm/thread_info.h
Most of them never returned anyway - only two functions had to be
changed. That allows to simplify their callers a whole lot.
Note that this does *not* apply to kthread_run() callbacks - all of
those had been called from the same kernel_thread() callback, which
did do_exit() already. This is strictly about very few low-level
kernel_thread() callbacks (there are only 6 of those, mostly as part
of kthread.h and kmod.h exported mechanisms, plus kernel_init()
itself).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With a system where, num_present_cpus < num_possible_cpus, even if all
CPUs are online, non-present CPUs don't have per_cpu buffers allocated.
If per_cpu/<cpu>/buffer_size_kb is modified for such a CPU, it can cause
a panic due to NULL dereference in ring_buffer_resize().
To fix this, resize operation is allowed only if the per-cpu buffer has
been initialized.
Link: http://lkml.kernel.org/r/1349912427-6486-1-git-send-email-vnagarnaik@google.com
Cc: stable@vger.kernel.org # 3.5+
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
It doesn't, because the clean targets don't include kernel/Makefile, and
because two files were missing from the list.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Place an indication that the certificate should use utf8 strings into the
x509.genkey template generated by kernel/Makefile.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Use the same digest type for the autogenerated key signature as for the module
signature so that the hash algorithm is guaranteed to be present in the kernel.
Without this, the X.509 certificate loader may reject the X.509 certificate so
generated because it was self-signed and the signature will be checked against
itself - but this won't work if the digest algorithm must be loaded as a
module.
The symptom is that the key fails to load with the following message emitted
into the kernel log:
MODSIGN: Problem loading in-kernel X.509 certificate (-65)
the error in brackets being -ENOPKG. What you should see is something like:
MODSIGN: Loaded cert 'Magarathea: Glacier signing key: 9588321144239a119d3406d4c4cf1fbae1836fa0'
Note that this doesn't apply to certificates that are not self-signed as we
don't check those currently as they require the parent CA certificate to be
available.
Reported-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Check the signature on the module against the keys compiled into the kernel or
available in a hardware key store.
Currently, only RSA keys are supported - though that's easy enough to change,
and the signature is expected to contain raw components (so not a PGP or
PKCS#7 formatted blob).
The signature blob is expected to consist of the following pieces in order:
(1) The binary identifier for the key. This is expected to match the
SubjectKeyIdentifier from an X.509 certificate. Only X.509 type
identifiers are currently supported.
(2) The signature data, consisting of a series of MPIs in which each is in
the format of a 2-byte BE word sizes followed by the content data.
(3) A 12 byte information block of the form:
struct module_signature {
enum pkey_algo algo : 8;
enum pkey_hash_algo hash : 8;
enum pkey_id_type id_type : 8;
u8 __pad;
__be32 id_length;
__be32 sig_length;
};
The three enums are defined in crypto/public_key.h.
'algo' contains the public-key algorithm identifier (0->DSA, 1->RSA).
'hash' contains the digest algorithm identifier (0->MD4, 1->MD5, 2->SHA1,
etc.).
'id_type' contains the public-key identifier type (0->PGP, 1->X.509).
'__pad' should be 0.
'id_length' should contain in the binary identifier length in BE form.
'sig_length' should contain in the signature data length in BE form.
The lengths are in BE order rather than CPU order to make dealing with
cross-compilation easier.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor Kconfig fix)
Include a PGP keyring containing the public keys required to perform module
verification in the kernel image during build and create a special keyring
during boot which is then populated with keys of crypto type holding the public
keys found in the PGP keyring.
These can be seen by root:
[root@andromeda ~]# cat /proc/keys
07ad4ee0 I----- 1 perm 3f010000 0 0 crypto modsign.0: RSA 87b9b3bd []
15c7f8c3 I----- 1 perm 1f030000 0 0 keyring .module_sign: 1/4
...
It is probably worth permitting root to invalidate these keys, resulting in
their removal and preventing further modules from being loaded with that key.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Automatically generate keys for module signing if they're absent so that
allyesconfig doesn't break. The builder should consider generating their own
key and certificate, however, so that the keys are appropriately named.
The private key for the module signer should be placed in signing_key.priv
(unencrypted!) and the public key in an X.509 certificate as signing_key.x509.
If a transient key is desired for signing the modules, a config file for
'openssl req' can be placed in x509.genkey, looking something like the
following:
[ req ]
default_bits = 4096
distinguished_name = req_distinguished_name
prompt = no
x509_extensions = myexts
[ req_distinguished_name ]
O = Magarathea
CN = Glacier signing key
emailAddress = slartibartfast@magrathea.h2g2
[ myexts ]
basicConstraints=critical,CA:FALSE
keyUsage=digitalSignature
subjectKeyIdentifier=hash
authorityKeyIdentifier=hash
The build process will use this to configure:
openssl req -new -nodes -utf8 -sha1 -days 36500 -batch \
-x509 -config x509.genkey \
-outform DER -out signing_key.x509 \
-keyout signing_key.priv
to generate the key.
Note that it is required that the X.509 certificate have a subjectKeyIdentifier
and an authorityKeyIdentifier. Without those, the certificate will be
rejected. These can be used to check the validity of a certificate.
Note that 'make distclean' will remove signing_key.{priv,x509} and x509.genkey,
whether or not they were generated automatically.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we're in FIPS mode, we should panic if we fail to verify the signature on a
module or we're asked to load an unsigned module in signature enforcing mode.
Possibly FIPS mode should automatically enable enforcing mode.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We do a very simple search for a particular string appended to the module
(which is cache-hot and about to be SHA'd anyway). There's both a config
option and a boot parameter which control whether we accept or fail with
unsigned modules and modules that are signed with an unknown key.
If module signing is enabled, the kernel will be tainted if a module is
loaded that is unsigned or has a signature for which we don't have the
key.
(Useful feedback and tweaks by David Howells <dhowells@redhat.com>)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Currently we rely on all IRQ chip instances to dynamically
allocate their IRQ descriptors unless they use the linear
IRQ domain. So for irqdomain_add_legacy() and
irqdomain_add_simple() the caller need to make sure that
descriptors are allocated.
Let's slightly augment the yet unused irqdomain_add_simple()
to also allocate descriptors as a means to simplify usage
and avoid code duplication throughout the kernel.
We warn if descriptors cannot be allocated, e.g. if a
platform has the bad habit of hogging descriptors at boot
time.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Lee Jones <lee.jones@linaro.org>
Reviewed-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
audit_log_start() may return NULL, this is unchecked by the caller in
audit_log_link_denied() and could cause a NULL ptr deref.
Introduced by commit a51d9eaa ("fs: add link restriction audit reporting").
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull generic execve() changes from Al Viro:
"This introduces the generic kernel_thread() and kernel_execve()
functions, and switches x86, arm, alpha, um and s390 over to them."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (26 commits)
s390: convert to generic kernel_execve()
s390: switch to generic kernel_thread()
s390: fold kernel_thread_helper() into ret_from_fork()
s390: fold execve_tail() into start_thread(), convert to generic sys_execve()
um: switch to generic kernel_thread()
x86, um/x86: switch to generic sys_execve and kernel_execve
x86: split ret_from_fork
alpha: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
alpha: switch to generic kernel_thread()
alpha: switch to generic sys_execve()
arm: get rid of execve wrapper, switch to generic execve() implementation
arm: optimized current_pt_regs()
arm: introduce ret_from_kernel_execve(), switch to generic kernel_execve()
arm: split ret_from_fork, simplify kernel_thread() [based on patch by rmk]
generic sys_execve()
generic kernel_execve()
new helper: current_pt_regs()
preparation for generic kernel_thread()
um: kill thread->forking
um: let signal_delivered() do SIGTRAP on singlestepping into handler
...
We fixed a bunch of integer overflows in timekeeping code during the 3.6
cycle. I did an audit based on that and found this potential overflow.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20121009071823.GA19159@elgon.mountain
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Adding two (or more) timers with large values for "expires" (they have
to reside within tv5 in the same list) leads to endless looping
between cascade() and internal_add_timer() in case CONFIG_BASE_SMALL
is one and jiffies are crossing the value 1 << 18. The bug was
introduced between 2.6.11 and 2.6.12 (and survived for quite some
time).
This patch ensures that when cascade() is called timers within tv5 are
not added endlessly to their own list again, instead they are added to
the next lower tv level tv4 (as expected).
Signed-off-by: Christian Hildner <christian.hildner@siemens.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Link: http://lkml.kernel.org/r/98673C87CB31274881CFFE0B65ECC87B0F5FC1963E@DEFTHW99EA4MSX.ww902.siemens.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.
This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.
Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Cc: Andrea Arcangeli <andrea@qumranet.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Liran Liss <liranl@mellanox.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update the generic interval tree code that was introduced in "mm: replace
vma prio_tree with an interval tree".
Changes:
- fixed 'endpoing' typo noticed by Andrew Morton
- replaced include/linux/interval_tree_tmpl.h, which was used as a
template (including it automatically defined the interval tree
functions) with include/linux/interval_tree_generic.h, which only
defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
defines the interval tree functions when invoked. Now that is a very
long macro which is unfortunate, but it does make the usage sites
(lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.
- make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
instead of duplicating that code in the interval tree template.
- replaced vma_interval_tree_add(), which was actually handling the
nonlinear and interval tree cases, with vma_interval_tree_insert_after()
which handles only the interval tree case and has an API that is more
consistent with the other interval tree handling functions.
The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement an interval tree as a replacement for the VMA prio_tree. The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA. So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.
Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The deprecated /proc/<pid>/oom_adj is scheduled for removal this month.
Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:
| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.
Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the kernel sets mm->exe_file during sys_execve() and then tracks
number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
VMA with VM_EXECUTABLE flag appears after mapping file with flag
MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
splitting, because sys_mmap ignores this flag. Usually binfmt module sets
mm->exe_file and mmaps executable vmas with this file, they hold
mm->exe_file while task is running.
comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
where all this stuff was introduced:
> The kernel implements readlink of /proc/pid/exe by getting the file from
> the first executable VMA. Then the path to the file is reconstructed and
> reported as the result.
>
> Because of the VMA walk the code is slightly different on nommu systems.
> This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
> walking the VMAs to find the first executable file-backed VMA we store a
> reference to the exec'd file in the mm_struct.
>
> That reference would prevent the filesystem holding the executable file
> from being unmounted even after unmapping the VMAs. So we track the number
> of VM_EXECUTABLE VMAs and drop the new reference when the last one is
> unmapped. This avoids pinning the mounted filesystem.
exe_file's vma accounting is hooked into every file mmap/unmmap and vma
split/merge just to fix some hypothetical pinning fs from umounting by mm,
which already unmapped all its executable files, but still alive.
Seems like currently nobody depends on this behaviour. We can try to
remove this logic and keep mm->exe_file until final mmput().
mm->exe_file is still protected with mm->mmap_sem, because we want to
change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
task can change its mm->exe_file and unpin mountpoint explicitly.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
a task's executable file. After this patch they will use mm->exe_file
directly. mm->exe_file is protected with mm->mmap_sem, so locking stays
the same.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Chris Metcalf <cmetcalf@tilera.com> [arch/tile]
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [tomoyo]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The synchronization between CPU hotplug readers and writers is achieved
by means of refcounting, safeguarded by the cpu_hotplug.lock.
get_online_cpus() increments the refcount, whereas put_online_cpus()
decrements it. If we ever hit an imbalance between the two, we end up
compromising the guarantees of the hotplug synchronization i.e, for
example, an extra call to put_online_cpus() can end up allowing a
hotplug reader to execute concurrently with a hotplug writer.
So, add a WARN_ON() in put_online_cpus() to detect such cases where the
refcount can go negative, and also attempt to fix it up, so that we can
continue to run.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
architectures requiring support for the "exception-trace" debug_table
entry in kernel/sysctl.c.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kirill noted the following deadlock cycle on shutdown involving padata:
> With commit 755609a908 I've got deadlock on
> poweroff.
>
> It guess it happens because of race for cpu_hotplug.lock:
>
> CPU A CPU B
> disable_nonboot_cpus()
> _cpu_down()
> cpu_hotplug_begin()
> mutex_lock(&cpu_hotplug.lock);
> __cpu_notify()
> padata_cpu_callback()
> __padata_remove_cpu()
> padata_replace()
> synchronize_rcu()
> rcu_gp_kthread()
> get_online_cpus();
> mutex_lock(&cpu_hotplug.lock);
It would of course be good to eliminate grace-period delays from
CPU-hotplug notifiers, but that is a separate issue. Deadlock is
not an appropriate diagnostic for excessive CPU-hotplug latency.
Fortunately, grace-period initialization does not actually need to
exclude all of the CPU-hotplug operation, but rather only RCU's own
CPU_UP_PREPARE and CPU_DEAD CPU-hotplug notifiers. This commit therefore
introduces a new per-rcu_state onoff_mutex that provides the required
concurrency control in place of the get_online_cpus() that was previously
in rcu_gp_init().
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Kirill A. Shutemov <kirill@shutemov.name>
Multiple threads can manipulate uprobe->flags, this is obviously
unsafe. For example mmap can set UPROBE_COPY_INSN while register
tries to set UPROBE_RUN_HANDLER, the latter can also race with
can_skip_sstep() which clears UPROBE_SKIP_SSTEP.
Change this code to use bitops.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
install_breakpoint() is called under mm->mmap_sem, this protects
set_swbp() but not prepare_uprobe(). Two or more different tasks
can call install_breakpoint()->prepare_uprobe() at the same time,
this leads to numerous problems if UPROBE_COPY_INSN is not set.
Just for example, the second copy_insn() can corrupt the already
analyzed/fixuped uprobe->arch.insn and race with handle_swbp().
This patch simply adds uprobe->copy_mutex to serialize this code.
We could probably reuse ->consumer_rwsem, but this would mean that
consumer->handler() can not use mm->mmap_sem, not good.
Note: this is another temporary ugly hack until we move this logic
into uprobe_register().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Preparation. Extract the copy_insn/arch_uprobe_analyze_insn code
from install_breakpoint() into the new helper, prepare_uprobe().
And move uprobe->flags defines from uprobes.h to uprobes.c, nobody
else can use them anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Strictly speaking this race was added by me in 56bb4cf6. However
I think that this bug is just another indication that we should
move copy_insn/uprobe_analyze_insn code from install_breakpoint()
to uprobe_register(), there are a lot of other reasons for that.
Until then, add a hack to close the race.
A task can hit uprobe U1, but before it calls find_uprobe() this
uprobe can be unregistered *AND* another uprobe U2 can be added to
uprobes_tree at the same inode/offset. In this case handle_swbp()
will use the not-fully-initialized U2, in particular its arch.insn
for xol.
Add the additional !UPROBE_COPY_INSN check into handle_swbp(),
if this flag is not set we simply restart as if the new uprobe was
not inserted yet. This is not very nice, we need barriers, but we
will remove this hack when we change uprobe_register().
Note: with or without this patch install_breakpoint() can race with
itself, yet another reson to kill UPROBE_COPY_INSN altogether. And
even the usage of uprobe->flags is not safe. See the next patches.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
delete_uprobe() must not be called if register_for_each_vma(false)
fails to remove all breakpoints, __uprobe_unregister() is correct.
The problem is that register_for_each_vma(false) always returns 0
and thus this logic does not work.
1. Change verify_opcode() to return 0 rather than -EINVAL when
unregister detects the !is_swbp insn, we can treat this case
as success and currently unregister paths ignore the error
code anyway.
2. Change remove_breakpoint() to propagate the error code from
write_opcode().
3. Change register_for_each_vma(is_register => false) to remove
as much breakpoints as possible but return non-zero if
remove_breakpoint() fails at least once.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Pull virtio changes from Rusty Russell:
"New workflow: same git trees pulled by linux-next get sent straight to
Linus. Git is awkward at shuffling patches compared with quilt or mq,
but that doesn't happen often once things get into my -next branch."
* 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (24 commits)
lguest: fix occasional crash in example launcher.
virtio-blk: Disable callback in virtblk_done()
virtio_mmio: Don't attempt to create empty virtqueues
virtio_mmio: fix off by one error allocating queue
drivers/virtio/virtio_pci.c: fix error return code
virtio: don't crash when device is buggy
virtio: remove CONFIG_VIRTIO_RING
virtio: add help to CONFIG_VIRTIO option.
virtio: support reserved vqs
virtio: introduce an API to set affinity for a virtqueue
virtio-ring: move queue_index to vring_virtqueue
virtio_balloon: not EXPERIMENTAL any more.
virtio-balloon: dependency fix
virtio-blk: fix NULL checking in virtblk_alloc_req()
virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
virtio-blk: Add bio-based IO path for virtio-blk
virtio: console: fix error handling in init() function
tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
tools: Add guest trace agent as a user tool
virtio/console: Allocate scatterlist according to the current pipe size
...
and no longer use its debugfs knobs. The change slightly touches
kernel/trace directory, but it got the needed ack from Steven Rostedt:
http://lkml.org/lkml/2012/8/21/688
2. Added maintainers entry;
3. A bunch of fixes, nothing special.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJQbjoFAAoJEGgI9fZJve1bZgUP/A/ZwFGfdnochDgRhK5p7ljY
baRZpgSh2B+BxIDTEPLfVh6HbOivmYJ8WF0unD9kKTzCCS71ZUMiLB25G/bV4lnZ
fawAOhGfOLG3rXmldxf6nJllHr9JpoSmVEHypvjFNcbjYZ04zhe7jM+YsaWmBw68
eHXQkOSdfPPpKXZ2B0Eef/EoGWhORW0kTD7xFlorsxYAkksSheY0PC0nYgCFhvCZ
168y9pi4T4lucr4s44x8AJ/r/5BQ1jEQAY/A2qUE/iBfRFP4XyE1Oao4OHtVDYdU
KjVPA1VYmwkKSfnkiVFrpb/94IyrKslblgR8nX0kK3L/ccFYjQix4nd9jR+n857s
xfAuj9nfhUO6fI5qoaVSOBufxKyPp1S7X8INEAJ7WQ0c9VoMv00biK9M77ifDGZg
ll/Ecq1CADtcbOnQXf6qwGwRKmpR+qgPkIzpNXcuGMuM4AEPwtckOhCyXFr37Txk
6ZoGM8IIaBJ0yXxHkfpUA7l9ZF0gXR+qHMQCwpUS8tIMx35On+IbybEaKbniKEi1
AURgQ7ZimVYAHPi0Y0L00+EKI3IPVQJvCFH7SG+wUfLWcbEtNbTv3MAer5o3DANJ
GMnWBwNw9ClTydWKI0GMNmnWpFukWhd4OXleyl2+q4qRJi3HhNacrok3s/2r+CnT
QRg8i/0SDvxGuXazrTZT
=1HAE
-----END PGP SIGNATURE-----
Merge tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore
Pull pstore changes from Anton Vorontsov:
1) We no longer ad-hoc to the function tracer "high level"
infrastructure and no longer use its debugfs knobs. The change
slightly touches kernel/trace directory, but it got the needed ack
from Steven Rostedt:
http://lkml.org/lkml/2012/8/21/688
2) Added maintainers entry;
3) A bunch of fixes, nothing special.
* tag 'for-v3.7' of git://git.infradead.org/users/cbou/linux-pstore:
pstore: Avoid recursive spinlocks in the oops_in_progress case
pstore/ftrace: Convert to its own enable/disable debugfs knob
pstore/ram: Add missing platform_device_unregister
MAINTAINERS: Add pstore maintainers
pstore/ram: Mark ramoops_pstore_write_buf() as notrace
pstore/ram: Fix printk format warning
pstore/ram: Fix possible NULL dereference
Using a recursive call add a non-conflicting region in
__reserve_region_with_split() could result in a stack overflow in the case
that the recursive calls are too deep. Convert the recursive calls to an
iterative loop to avoid the problem.
Tested on a machine containing 135 regions. The kernel no longer panicked
with stack overflow.
Also tested with code arbitrarily adding regions with no conflict,
embedding two consecutive conflicts and embedding two non-consecutive
conflicts.
Signed-off-by: T Makphaibulchoke <tmac@hp.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@gmail.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If prepare_reply() succeeds we have allocated memory for 'rep_skb'. If
nla_reserve() then subsequently fails and returns NULL we fail to release
the memory we allocated, thus causing a leak.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The inclusion of <generated/utsrelease.h> is unnecessary.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparatory patch for the introduction of NT_SIGINFO elf note.
With this patch we pass "siginfo_t *siginfo" instead of "int signr" to
do_coredump() and put it into coredump_params. It will be used by the
next patch. Most changes are simple s/signr/siginfo->si_signo/.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Amerigo Wang <amwang@redhat.com>
Cc: "Jonathan M. Foote" <jmfoote@cert.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Create a new header file, fs/coredump.h, which contains functions only
used by the new coredump.c. It also moves do_coredump to the
include/linux/coredump.h header file, for consistency.
Signed-off-by: Alex Kelly <alex.page.kelly@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adds an expert Kconfig option, CONFIG_COREDUMP, which allows disabling of
core dump. This saves approximately 2.6k in the compiled kernel, and
complements CONFIG_ELF_CORE, which now depends on it.
CONFIG_COREDUMP also disables coredump-related sysctls, except for
suid_dumpable and related functions, which are necessary for ptrace.
[akpm@linux-foundation.org: fix binfmt_aout.c build]
Signed-off-by: Alex Kelly <alex.page.kelly@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
orderly_poweroff is trying to poweroff platform in two steps:
step 1: Call user space application to poweroff
step 2: If user space poweroff fail, then do a force power off if force param
is set.
The bug here is, step 1 is always successful with param UMH_NO_WAIT, which obey
the design goal of orderly_poweroff.
We have two choices here:
UMH_WAIT_EXEC which means wait for the exec, but not the process;
UMH_WAIT_PROC which means wait for the process to complete.
we need to trade off the two choices:
If using UMH_WAIT_EXEC, there is potential issue comments by Serge E.
Hallyn: The exec will have started, but may for whatever (very unlikely)
reason fail.
If using UMH_WAIT_PROC, there is potential issue comments by Eric W.
Biederman: If the caller is not running in a kernel thread then we can
easily get into a case where the user space caller will block waiting for
us when we are waiting for the user space caller.
Thanks for their excellent ideas, based on the above discussion, we
finally choose UMH_WAIT_EXEC, which is much more safe, if the user
application really fails, we just complain the application itself, it
seems a better choice here.
Signed-off-by: Feng Hong <hongfeng@marvell.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As kernel_power_off() calls disable_nonboot_cpus(), we may also want to
have kernel_restart() call disable_nonboot_cpus(). Doing so can help
machines that require boot cpu be the last alive cpu during reboot to
survive with kernel restart.
This fixes one reboot issue seen on imx6q (Cortex-A9 Quad). The machine
requires that the restart routine be run on the primary cpu rather than
secondary ones. Otherwise, the secondary core running the restart
routine will fail to come to online after reboot.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri reported that he could trigger the WARN_ON_ONCE() in
perf_cgroup_switch() using sw-events. This is because sw-events share
a cpuctx with multiple PMUs.
Use the ->unique_pmu pointer to limit the pmu iteration to unique
cpuctx instances.
Reported-and-Tested-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-so7wi2zf3jjzrwcutm2mkz0j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane thought the perf_cpu_context::active_pmu name confusing and
suggested using 'unique_pmu' instead.
This pointer is a pointer to a 'random' pmu sharing the cpuctx
instance, therefore limiting a for_each_pmu loop to those where
cpuctx->unique_pmu matches the pmu we get a loop over unique cpuctx
instances.
Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kxyjqpfj2fn9gt7kwu5ag9ks@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Once array sched_domains_numa_masks[] []is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic:
[ 3189.403280] Call Trace:
[ 3189.403286] [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
[ 3189.403289] [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
[ 3189.403292] [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
[ 3189.403296] [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
[ 3189.403299] [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
[ 3189.403305] [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308] [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
[ 3189.403316] [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
[ 3189.403320] [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328] [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333] [<ffffffff81070470>] __cpu_notify+0x20/0x40
[ 3189.403337] [<ffffffff8166663e>] _cpu_up+0xe9/0x131
[ 3189.403340] [<ffffffff81666761>] cpu_up+0xdb/0xee
[ 3189.403348] [<ffffffff8165667c>] store_online+0x9c/0xd0
[ 3189.403355] [<ffffffff81437640>] dev_attr_store+0x20/0x30
[ 3189.403361] [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
[ 3189.403368] [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
[ 3189.403371] [<ffffffff811ccdb4>] sys_write+0x54/0xa0
[ 3189.403375] [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
[ fixed compile warning ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should temporarily reset 'sched_domains_numa_levels' to 0 after
it is reset to 'level' in sched_init_numa(). If it fails to allocate
memory for array sched_domains_numa_masks[][], the array will contain
less then 'level' members. This could be dangerous when we use it to
iterate array sched_domains_numa_masks[][] in other functions.
This patch set sched_domains_numa_levels to 0 before initializing
array sched_domains_numa_masks[][], and reset it to 'level' when
sched_domains_numa_masks[][] is fully initialized.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we stop the tick in idle, we save the current jiffies value
in ts->idle_jiffies. This snapshot is substracted from the later
value of jiffies when the tick is restarted and the resulting
delta is accounted as idle cputime. This is how we handle the
idle cputime accounting without the tick.
But sometimes we need to schedule the next tick to some time in
the future instead of completely stopping it. In this case, a
tick may happen before we restart the periodic behaviour and
from that tick we account one jiffy to idle cputime as usual but
we also increment the ts->idle_jiffies snapshot by one so that
when we compute the delta to account, we substract the one jiffy
we just accounted.
To prepare for stopping the tick outside idle, we introduced a
check that prevents from fixing up that ts->idle_jiffies if we
are not running the idle task. But we use idle_cpu() for that
and this is a problem if we run the tick while another CPU
remotely enqueues a ttwu to our runqueue:
CPU 0: CPU 1:
tick_sched_timer() { ttwu_queue_remote()
if (idle_cpu(CPU 0))
ts->idle_jiffies++;
}
Here, idle_cpu() notes that &rq->wake_list is not empty and
hence won't consider the CPU as idle. As a result,
ts->idle_jiffies won't be incremented. But this is wrong because
we actually account the current jiffy to idle cputime. And that
jiffy won't get substracted from the nohz time delta. So in the
end, this jiffy is accounted twice.
Fix this by changing idle_cpu(smp_processor_id()) with
is_idle_task(current). This way the jiffy is substracted
correctly even if a ttwu operation is enqueued on the CPU.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # 3.5+
Link: http://lkml.kernel.org/r/1349308004-3482-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
lb/nbAsXf9UJLgGir4I1
=kN6V
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights of the changes for this release include support for vfio
level triggered interrupts, improved big real mode support on older
Intels, a streamlines guest page table walker, guest APIC speedups,
PIO optimizations, better overcommit handling, and read-only memory."
* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: s390: Fix vcpu_load handling in interrupt code
KVM: x86: Fix guest debug across vcpu INIT reset
KVM: Add resampling irqfds for level triggered interrupts
KVM: optimize apic interrupt delivery
KVM: MMU: Eliminate pointless temporary 'ac'
KVM: MMU: Avoid access/dirty update loop if all is well
KVM: MMU: Eliminate eperm temporary
KVM: MMU: Optimize is_last_gpte()
KVM: MMU: Simplify walk_addr_generic() loop
KVM: MMU: Optimize pte permission checks
KVM: MMU: Update accessed and dirty bits after guest pagetable walk
KVM: MMU: Move gpte_access() out of paging_tmpl.h
KVM: MMU: Optimize gpte_access() slightly
KVM: MMU: Push clean gpte write protection out of gpte_access()
KVM: clarify kvmclock documentation
KVM: make processes waiting on vcpu mutex killable
KVM: SVM: Make use of asm.h
KVM: VMX: Make use of asm.h
KVM: VMX: Make lto-friendly
KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
...
Conflicts:
arch/s390/include/asm/processor.h
arch/x86/kvm/i8259.c
Pull security subsystem updates from James Morris:
"Highlights:
- Integrity: add local fs integrity verification to detect offline
attacks
- Integrity: add digital signature verification
- Simple stacking of Yama with other LSMs (per LSS discussions)
- IBM vTPM support on ppc64
- Add new driver for Infineon I2C TIS TPM
- Smack: add rule revocation for subject labels"
Fixed conflicts with the user namespace support in kernel/auditsc.c and
security/integrity/ima/ima_policy.c.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (39 commits)
Documentation: Update git repository URL for Smack userland tools
ima: change flags container data type
Smack: setprocattr memory leak fix
Smack: implement revoking all rules for a subject label
Smack: remove task_wait() hook.
ima: audit log hashes
ima: generic IMA action flag handling
ima: rename ima_must_appraise_or_measure
audit: export audit_log_task_info
tpm: fix tpm_acpi sparse warning on different address spaces
samples/seccomp: fix 31 bit build on s390
ima: digital signature verification support
ima: add support for different security.ima data types
ima: add ima_inode_setxattr/removexattr function and calls
ima: add inode_post_setattr call
ima: replace iint spinblock with rwlock/read_lock
ima: allocating iint improvements
ima: add appraise action keywords and default rules
ima: integrity appraisal extension
vfs: move ima_file_free before releasing the file
...
Pull vfs update from Al Viro:
- big one - consolidation of descriptor-related logics; almost all of
that is moved to fs/file.c
(BTW, I'm seriously tempted to rename the result to fd.c. As it is,
we have a situation when file_table.c is about handling of struct
file and file.c is about handling of descriptor tables; the reasons
are historical - file_table.c used to be about a static array of
struct file we used to have way back).
A lot of stray ends got cleaned up and converted to saner primitives,
disgusting mess in android/binder.c is still disgusting, but at least
doesn't poke so much in descriptor table guts anymore. A bunch of
relatively minor races got fixed in process, plus an ext4 struct file
leak.
- related thing - fget_light() partially unuglified; see fdget() in
there (and yes, it generates the code as good as we used to have).
- also related - bits of Cyrill's procfs stuff that got entangled into
that work; _not_ all of it, just the initial move to fs/proc/fd.c and
switch of fdinfo to seq_file.
- Alex's fs/coredump.c spiltoff - the same story, had been easier to
take that commit than mess with conflicts. The rest is a separate
pile, this was just a mechanical code movement.
- a few misc patches all over the place. Not all for this cycle,
there'll be more (and quite a few currently sit in akpm's tree)."
Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
MAX_LFS_FILESIZE should be a loff_t
compat: fs: Generic compat_sys_sendfile implementation
fs: push rcu_barrier() from deactivate_locked_super() to filesystems
btrfs: reada_extent doesn't need kref for refcount
coredump: move core dump functionality into its own file
coredump: prevent double-free on an error path in core dumper
usb/gadget: fix misannotations
fcntl: fix misannotations
ceph: don't abuse d_delete() on failure exits
hypfs: ->d_parent is never NULL or negative
vfs: delete surplus inode NULL check
switch simple cases of fget_light to fdget
new helpers: fdget()/fdput()
switch o2hb_region_dev_write() to fget_light()
proc_map_files_readdir(): don't bother with grabbing files
make get_file() return its argument
vhost_set_vring(): turn pollstart/pollstop into bool
switch prctl_set_mm_exe_file() to fget_light()
switch xfs_find_handle() to fget_light()
switch xfs_swapext() to fget_light()
...
* Improved system suspend/resume and runtime PM handling for the SH TMU, CMT
and MTU2 clock event devices (also used by ARM/shmobile).
* Generic PM domains framework extensions related to cpuidle support and
domain objects lookup using names.
* ARM/shmobile power management updates including improved support for the
SH7372's A4S power domain containing the CPU core.
* cpufreq changes related to AMD CPUs support from Matthew Garrett, Andre
Przywara and Borislav Petkov.
* cpu0 cpufreq driver from Shawn Guo.
* cpufreq governor fixes related to the relaxing of limit from Michal Pecio.
* OMAP cpufreq updates from Axel Lin and Richard Zhao.
* cpuidle ladder governor fixes related to the disabling of states from
Carsten Emde and me.
* Runtime PM core updates related to the interactions with the system suspend
core from Alan Stern and Kevin Hilman.
* Wakeup sources modification allowing more helper functions to be called from
interrupt context from John Stultz and additional diagnostic code from Todd
Poynor.
* System suspend error code path fix from Feng Hong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQa1rRAAoJEKhOf7ml8uNsYZ0P/2RZ71sgLWcUCfr0yHaiZeOd
2GxEYSZ+9BZJHADgoAK/bHRTv8crm40Y2RkbaWbxPDRNuE4SutbvNTGTlJSAguSD
yHkU/6AFC7u8Jwq+afsWIdGX7eHd78zPpj6EVtVtjHM903WDwbMU2vUz7tQ+fFa+
ZZ7eydq9j0ec0OoH3UeNhet7JSOpT5BSLgjmIkHMBgIvTxNVDbkB31QUxnUxocxn
k6S2wQaUSJJWGMLksRRNrhwLq+cGYwTsaOtG/KzRLH1raUyn33B5pcZr0aqhOkjg
ClaCks3V8o3vRghSwOPB5aVXzjBKvM3UnSyJNIl+FeCeyWuwSNbkEFdA/e7oPuxG
UsW6dcHiuVo6Ir4+zhd9+lN+/AcPTChO5b7lbU8qRF4ce04czWlUY/KzJjaM+YOE
CKGq6eX9AHwFjE+h4+VcCXgmzcioiS8Y/CPz13u8N1y0zzwW+ftjb12K+7lVBEG1
fhrePKHgLw3kJ9LqGpR+4vVur7C+rCf6WwCReTY2vXXVYJ+SuKWTRI4zAjTPXtHa
i9dpMRASpF+ScRYBcgwIpv789WuHATFKqdBSinZUKBaxQZ5flJ2qIrfqN5VeAejh
oQs/zZCdIuAtFKqVycQ0L42YxFNKgPFKQErUCSu3M5OuZLlLVLu7yQvIo2Xmo9qf
Hcrpvo5K+w29YkiwGP9e
=rbCk
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael J Wysocki:
- Improved system suspend/resume and runtime PM handling for the SH
TMU, CMT and MTU2 clock event devices (also used by ARM/shmobile).
- Generic PM domains framework extensions related to cpuidle support
and domain objects lookup using names.
- ARM/shmobile power management updates including improved support for
the SH7372's A4S power domain containing the CPU core.
- cpufreq changes related to AMD CPUs support from Matthew Garrett,
Andre Przywara and Borislav Petkov.
- cpu0 cpufreq driver from Shawn Guo.
- cpufreq governor fixes related to the relaxing of limit from Michal
Pecio.
- OMAP cpufreq updates from Axel Lin and Richard Zhao.
- cpuidle ladder governor fixes related to the disabling of states from
Carsten Emde and me.
- Runtime PM core updates related to the interactions with the system
suspend core from Alan Stern and Kevin Hilman.
- Wakeup sources modification allowing more helper functions to be
called from interrupt context from John Stultz and additional
diagnostic code from Todd Poynor.
- System suspend error code path fix from Feng Hong.
Fixed up conflicts in cpufreq/powernow-k8 that stemmed from the
workqueue fixes conflicting fairly badly with the removal of support for
hardware P-state chips. The changes were independent but somewhat
intertwined.
* tag 'pm-for-3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
Revert "PM QoS: Use spinlock in the per-device PM QoS constraints code"
PM / Runtime: let rpm_resume() succeed if RPM_ACTIVE, even when disabled, v2
cpuidle: rename function name "__cpuidle_register_driver", v2
cpufreq: OMAP: Check IS_ERR() instead of NULL for omap_device_get_by_hwmod_name
cpuidle: remove some empty lines
PM: Prevent runtime suspend during system resume
PM QoS: Use spinlock in the per-device PM QoS constraints code
PM / Sleep: use resume event when call dpm_resume_early
cpuidle / ACPI : move cpuidle_device field out of the acpi_processor_power structure
ACPI / processor: remove pointless variable initialization
ACPI / processor: remove unused function parameter
cpufreq: OMAP: remove loops_per_jiffy recalculate for smp
sections: fix section conflicts in drivers/cpufreq
cpufreq: conservative: update frequency when limits are relaxed
cpufreq / ondemand: update frequency when limits are relaxed
properly __init-annotate pm_sysrq_init()
cpufreq: Add a generic cpufreq-cpu0 driver
PM / OPP: Initialize OPP table from device tree
ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp
cpufreq: Remove support for hardware P-state chips from powernow-k8
...
Pull networking changes from David Miller:
1) GRE now works over ipv6, from Dmitry Kozlov.
2) Make SCTP more network namespace aware, from Eric Biederman.
3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.
4) Make openvswitch network namespace aware, from Pravin B Shelar.
5) IPV6 NAT implementation, from Patrick McHardy.
6) Server side support for TCP Fast Open, from Jerry Chu and others.
7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.
8) Increate the loopback default MTU to 64K, from Eric Dumazet.
9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.
From Eric Dumazet.
10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.
From Eric Dumazet.
11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.
12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.
13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.
Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...
Make the session keyring per-thread rather than per-process, but still
inherited from the parent thread to solve a problem with PAM and gdm.
The problem is that join_session_keyring() will reject attempts to change the
session keyring of a multithreaded program but gdm is now multithreaded before
it gets to the point of starting PAM and running pam_keyinit to create the
session keyring. See:
https://bugs.freedesktop.org/show_bug.cgi?id=49211
The reason that join_session_keyring() will only change the session keyring
under a single-threaded environment is that it's hard to alter the other
thread's credentials to effect the change in a multi-threaded program. The
problems are such as:
(1) How to prevent two threads both running join_session_keyring() from
racing.
(2) Another thread's credentials may not be modified directly by this process.
(3) The number of threads is uncertain whilst we're not holding the
appropriate spinlock, making preallocation slightly tricky.
(4) We could use TIF_NOTIFY_RESUME and key_replace_session_keyring() to get
another thread to replace its keyring, but that means preallocating for
each thread.
A reasonable way around this is to make the session keyring per-thread rather
than per-process and just document that if you want a common session keyring,
you must get it before you spawn any threads - which is the current situation
anyway.
Whilst we're at it, we can the process keyring behave in the same way. This
means we can clean up some of the ickyness in the creds code.
Basically, after this patch, the session, process and thread keyrings are about
inheritance rules only and not about sharing changes of keyring.
Reported-by: Mantas M. <grawity@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ray Strode <rstrode@redhat.com>
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.
People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.
This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."
* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
Pull cgroup updates from Tejun Heo:
- xattr support added. The implementation is shared with tmpfs. The
usage is restricted and intended to be used to manage per-cgroup
metadata by system software. tmpfs changes are routed through this
branch with Hugh's permission.
- cgroup subsystem ID handling simplified.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
cgroup: Assign subsystem IDs during compile time
cgroup: Do not depend on a given order when populating the subsys array
cgroup: Wrap subsystem selection macro
cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
cgroup: net_prio: Do not define task_netpioidx() when not selected
cgroup: net_cls: Do not define task_cls_classid() when not selected
cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
xattr: mark variable as uninitialized to make both gcc and smatch happy
fs: add missing documentation to simple_xattr functions
cgroup: add documentation on extended attributes usage
cgroup: rename subsys_bits to subsys_mask
cgroup: add xattr support
cgroup: revise how we re-populate root directory
xattr: extract simple_xattr code from tmpfs
Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.
* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.
* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.
These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.
* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.
All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.
* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.
* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.
There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."
Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.
Tejun pointed out a few of them, I fixed a couple more.
* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...
This fixes two issues that could cause incompatibility between
kernel versions:
- If a tracer uses SECCOMP_RET_TRACE to select a syscall number
higher than the largest known syscall, emulate the unknown
vsyscall by returning -ENOSYS. (This is unlikely to make a
noticeable difference on x86-64 due to the way the system call
entry works.)
- On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
This updates the documentation accordingly.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
As we skipped the merge window for 3.6-rc1 for the tty tree, everything
is now settled down and working properly, so we are ready for 3.7-rc1.
Here's the patchset, it's big, but the large changes are removing a
firmware file and adding a staging tty driver (it depended on the tty
core changes, so it's going through this tree instead of the staging
tree.)
All of these patches have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlBp36oACgkQMUfUDdst+yk4WgCdEy13hot8fI2Lqnc7W0LKu7GX
4p8AoLTjzrXhLosxdijskDQ9X1OtjrxU
=S5Ng
-----END PGP SIGNATURE-----
Merge tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull TTY changes from Greg Kroah-Hartman:
"As we skipped the merge window for 3.6-rc1 for the tty tree,
everything is now settled down and working properly, so we are ready
for 3.7-rc1. Here's the patchset, it's big, but the large changes are
removing a firmware file and adding a staging tty driver (it depended
on the tty core changes, so it's going through this tree instead of
the staging tree.)
All of these patches have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
Fix up more-or-less trivial conflicts in
- drivers/char/pcmcia/synclink_cs.c:
tty NULL dereference fix vs tty_port_cts_enabled() helper function
- drivers/staging/{Kconfig,Makefile}:
add-add conflict (dgrp driver added close to other staging drivers)
- drivers/staging/ipack/devices/ipoctal.c:
"split ipoctal_channel from iopctal" vs "TTY: use tty_port_register_device"
* tag 'tty-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (235 commits)
tty/serial: Add kgdb_nmi driver
tty/serial/amba-pl011: Quiesce interrupts in poll_get_char
tty/serial/amba-pl011: Implement poll_init callback
tty/serial/core: Introduce poll_init callback
kdb: Turn KGDB_KDB=n stubs into static inlines
kdb: Implement disable_nmi command
kernel/debug: Mask KGDB NMI upon entry
serial: pl011: handle corruption at high clock speeds
serial: sccnxp: Make 'default' choice in switch last
serial: sccnxp: Remove mask termios caps for SW flow control
serial: sccnxp: Report actual baudrate back to core
serial: samsung: Add poll_get_char & poll_put_char
Powerpc 8xx CPM_UART setting MAXIDL register proportionaly to baud rate
Powerpc 8xx CPM_UART maxidl should not depend on fifo size
Powerpc 8xx CPM_UART too many interrupts
Powerpc 8xx CPM_UART desynchronisation
serial: set correct baud_base for EXSYS EX-41092 Dual 16950
serial: omap: fix the reciever line error case
8250: blacklist Winbond CIR port
8250_pnp: do pnp probe before legacy probe
...
Features currently supported:
- 39-bit address space for user and kernel (each)
- 4KB and 64KB page configurations
- Compat (32-bit) user applications (ARMv7, EABI only)
- Flattened Device Tree (mandated for all AArch64 platforms)
- ARM generic timers
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iQIcBAABAgAGBQJQabRiAAoJEGvWsS0AyF7xXgcQAK+FTXt0ikdQYMkV5AIZXb9i
xHRhuiZWx2vKyk0mCqpyGLY58GSmSb6uTBg/2P2Ej7vXdH/RB2goPzjlspfjkDL4
o8RJp7eQ07Uz3KRDYEJgMP8xKZid6KFG93RJ6TjjpKZLuDBdwiG1GP1vb0jVcWfo
ttZrj/aI8lMcqrh3Vq5qefP7GWP1OVATqeaGTiT7oo38pXwF3t237xfBr2iDGFBp
ZgIRddrxpa7JYUesfJDDDdGHvLq7Vh2jJV+io9qasBZDrtppGJIhZ0vUni2DgIi7
r4i1LcynDN4JaG0maZ4U/YQm74TCD4BqxV8GJ7zwLPTWeN+of+skjhPSLOkA+0fp
I+sWjXlv200gDfJZ9qnUld2kFpoDfJi2b7fNDouSDd2OhmVOVWG3jnVP4Z7meVSb
O8BYzWDdsAiabuwciUY3OsmW6424lT93b2v86Vncs4unKMvEjOPxYZbUxhqX8f2j
gsmWwwD/yS4THx2B6OyW9VT3I5J6miqs2Glt/GG6vPWT5AKQJn9jCxKaBGhPMPIs
xe5/GycBYjdk/Y8qRjegxFbEqzQuiRzmkeFn5jwjmBLqpGNbZDpvMaL6adhAKM5/
v6UIKa91ra4fC9N0h6G61pOc9N9DbT8wPbCbdYY0RMTMRuLDZDgAM3Bvz0r2APdD
96leNy6vx684hbkCSLJs
=buJB
-----END PGP SIGNATURE-----
Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64
Pull arm64 support from Catalin Marinas:
"Linux support for the 64-bit ARM architecture (AArch64)
Features currently supported:
- 39-bit address space for user and kernel (each)
- 4KB and 64KB page configurations
- Compat (32-bit) user applications (ARMv7, EABI only)
- Flattened Device Tree (mandated for all AArch64 platforms)
- ARM generic timers"
* tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (35 commits)
arm64: ptrace: remove obsolete ptrace request numbers from user headers
arm64: Do not set the SMP/nAMP processor bit
arm64: MAINTAINERS update
arm64: Build infrastructure
arm64: Miscellaneous header files
arm64: Generic timers support
arm64: Loadable modules
arm64: Miscellaneous library functions
arm64: Performance counters support
arm64: Add support for /proc/sys/debug/exception-trace
arm64: Debugging support
arm64: Floating point and SIMD
arm64: 32-bit (compat) applications support
arm64: User access library functions
arm64: Signal handling support
arm64: VDSO support
arm64: System calls handling
arm64: ELF definitions
arm64: SMP support
arm64: DMA mapping API
...
Pull x86/asm changes from Ingo Molnar:
"The one change that stands out is the alternatives patching change
that prevents us from ever patching back instructions from SMP to UP:
this simplifies things and speeds up CPU hotplug.
Other than that it's smaller fixes, cleanups and improvements."
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Unspaghettize do_trap()
x86_64: Work around old GAS bug
x86: Use REP BSF unconditionally
x86: Prefer TZCNT over BFS
x86/64: Adjust types of temporaries used by ffs()/fls()/fls64()
x86: Drop unnecessary kernel_eflags variable on 64-bit
x86/smp: Don't ever patch back to UP if we unplug cpus
Pull scheduler changes from Ingo Molnar:
"Continued quest to clean up and enhance the cputime code by Frederic
Weisbecker, in preparation for future tickless kernel features.
Other than that, smallish changes."
Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
cputime: Make finegrained irqtime accounting generally available
cputime: Gather time/stats accounting config options into a single menu
ia64: Reuse system and user vtime accounting functions on task switch
ia64: Consolidate user vtime accounting
vtime: Consolidate system/idle context detection
cputime: Use a proper subsystem naming for vtime related APIs
sched: cpu_power: enable ARCH_POWER
sched/nohz: Clean up select_nohz_load_balancer()
sched: Fix load avg vs. cpu-hotplug
sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix nohz_idle_balance()
sched: Remove useless code in yield_to()
sched: Add time unit suffix to sched sysctl knobs
sched/debug: Limit sd->*_idx range on sysctl
sched: Remove AFFINE_WAKEUPS feature flag
s390: Remove leftover account_tick_vtime() header
cputime: Consolidate vtime handling on context switch
sched: Move cputime code to its own file
cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
tile: Remove SD_PREFER_LOCAL leftover
...
Pull perf update from Ingo Molnar:
"Lots of changes in this cycle as well, with hundreds of commits from
over 30 contributors. Most of the activity was on the tooling side.
Higher level changes:
- New 'perf kvm' analysis tool, from Xiao Guangrong.
- New 'perf trace' system-wide tracing tool
- uprobes fixes + cleanups from Oleg Nesterov.
- Lots of patches to make perf build on Android out of box, from
Irina Tirdea
- Extend ftrace function tracing utility to be more dynamic for its
users. It allows for data passing to the callback functions, as
well as reading regs as if a breakpoint were to trigger at function
entry.
The main goal of this patch series was to allow kprobes to use
ftrace as an optimized probe point when a probe is placed on an
ftrace nop. With lots of help from Masami Hiramatsu, and going
through lots of iterations, we finally came up with a good
solution.
- Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.
- Various tracing updates from Steve Rostedt
- Clean up and improve 'perf sched' performance by elliminating lots
of needless calls to libtraceevent.
- Event group parsing support, from Jiri Olsa
- UI/gtk refactorings and improvements from Namhyung Kim
- Add support for non-tracepoint events in perf script python, from
Feng Tang
- Add --symbols to 'script', similar to the one in 'report', from
Feng Tang.
Infrastructure enhancements and fixes:
- Convert the trace builtins to use the growing evsel/evlist
tracepoint infrastructure, removing several open coded constructs
like switch like series of strcmp to dispatch events, etc.
Basically what had already been showcased in 'perf sched'.
- Add evsel constructor for tracepoints, that uses libtraceevent just
to parse the /format events file, use it in a new 'perf test' to
make sure the libtraceevent format parsing regressions can be more
readily caught.
- Some strange errors were happening in some builds, but not on the
next, reported by several people, problem was some parser related
files, generated during the build, didn't had proper make deps, fix
from Eric Sandeen.
- Introduce struct and cache information about the environment where
a perf.data file was captured, from Namhyung Kim.
- Fix handling of unresolved samples when --symbols is used in
'report', from Feng Tang.
- Add union member access support to 'probe', from Hyeoncheol Lee.
- Fixups to die() removal, from Namhyung Kim.
- Render fixes for the TUI, from Namhyung Kim.
- Don't enable annotation in non symbolic view, from Namhyung Kim.
- Fix pipe mode in 'report', from Namhyung Kim.
- Move related stats code from stat to util/, will be used by the
'stat' kvm tool, from Xiao Guangrong.
- Remove die()/exit() calls from several tools.
- Resolve vdso callchains, from Jiri Olsa
- Don't pass const char pointers to basename, so that we can
unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
from David Ahern
- Refactor hist formatting so that it can be reused with the GTK
browser, From Namhyung Kim
- Fix build for another rbtree.c change, from Adrian Hunter.
- Make 'perf diff' command work with evsel hists, from Jiri Olsa.
- Use the only field_sep var that is set up: symbol_conf.field_sep,
fix from Jiri Olsa.
- .gitignore compiled python binaries, from Namhyung Kim.
- Get rid of die() in more libtraceevent places, from Namhyung Kim.
- Rename libtraceevent 'private' struct member to 'priv' so that it
works in C++, from Steven Rostedt
- Remove lots of exit()/die() calls from tools so that the main perf
exit routine can take place, from David Ahern
- Fix x86 build on x86-64, from David Ahern.
- {int,str,rb}list fixes from Suzuki K Poulose
- perf.data header fixes from Namhyung Kim
- Allow user to indicate objdump path, needed in cross environments,
from Maciek Borzecki
- Fix hardware cache event name generation, fix from Jiri Olsa
- Add round trip test for sw, hw and cache event names, catching the
problem Jiri fixed, after Jiri's patch, the test passes
successfully.
- Clean target should do clean for lib/traceevent too, fix from David
Ahern
- Check the right variable for allocation failure, fix from Namhyung
Kim
- Set up evsel->tp_format regardless of evsel->name being set
already, fix from Namhyung Kim
- Oprofile fixes from Robert Richter.
- Remove perf_event_attr needless version inflation, from Jiri Olsa
- Introduce libtraceevent strerror like error reporting facility,
from Namhyung Kim
- Add pmu mappings to perf.data header and use event names from cmd
line, from Robert Richter
- Fix include order for bison/flex-generated C files, from Ben
Hutchings
- Build fixes and documentation corrections from David Ahern
- Assorted cleanups from Robert Richter
- Let O= makes handle relative paths, from Steven Rostedt
- perf script python fixes, from Feng Tang.
- Initial bash completion support, from Frederic Weisbecker
- Allow building without libelf, from Namhyung Kim.
- Support DWARF CFI based unwind to have callchains when %bp based
unwinding is not possible, from Jiri Olsa.
- Symbol resolution fixes, while fixing support PPC64 files with an
.opt ELF section was the end goal, several fixes for code that
handles all architectures and cleanups are included, from Cody
Schafer.
- Assorted fixes for Documentation and build in 32 bit, from Robert
Richter
- Cache the libtraceevent event_format associated to each evsel
early, so that we avoid relookups, i.e. calling pevent_find_event
repeatedly when processing tracepoint events.
[ This is to reduce the surface contact with libtraceevents and
make clear what is that the perf tools needs from that lib: so
far parsing the common and per event fields. ]
- Don't stop the build if the audit libraries are not installed, fix
from Namhyung Kim.
- Fix bfd.h/libbfd detection with recent binutils, from Markus
Trippelsdorf.
- Improve warning message when libunwind devel packages not present,
from Jiri Olsa"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
perf trace: Add aliases for some syscalls
perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
perf tools: Check libaudit availability for perf-trace builtin
perf hists: Add missing period_* fields when collapsing a hist entry
perf trace: New tool
perf evsel: Export the event_format constructor
perf evsel: Introduce rawptr() method
perf tools: Use perf_evsel__newtp in the event parser
perf evsel: The tracepoint constructor should store sys:name
perf evlist: Introduce set_filter() method
perf evlist: Renane set_filters method to apply_filters
perf test: Add test to check we correctly parse and match syscall open parms
perf evsel: Handle endianity in intval method
perf evsel: Know if byte swap is needed
perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
perf evsel: Improve tracepoint constructor setup
tools lib traceevent: Fix error path on pevent_parse_event
perf test: Fix build failure
trace: Move trace event enable from fs_initcall to core_initcall
tracing: Add an option for disabling markers
...
Pull core locking changes from Ingo Molnar:
"It includes a lockdep improvement plus a spinlock inlining Kconfig
cleanup."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Adjust spin lock inlining Kconfig options
lockdep: Check if nested lock is actually held
Pull core kernel fixes from Ingo Molnar:
"This is a complex task_work series from Oleg that fixes the bug that
this VFS commit tried to fix:
d35abdb288 hold task_lock around checks in keyctl
but solves the problem without the lockup regression that d35abdb288
introduced in v3.6.
This series came late in v3.6 and I did not feel confident about it so
late in the cycle. Might be worth backporting to -stable if it proves
itself upstream."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
task_work: Simplify the usage in ptrace_notify() and get_signal_to_deliver()
task_work: Revert "hold task_lock around checks in keyctl"
task_work: task_work_add() should not succeed after exit_task_work()
task_work: Make task_work_add() lockless
Make default just return 0. The current default (checking
TIF_POLLING_NRFLAG) is taken to architectures that need it;
ones that don't do polling in their idle threads don't need
to defined TIF_POLLING_NRFLAG at all.
ia64 defined both TS_POLLING (used by its tsk_is_polling())
and TIF_POLLING_NRFLAG (not used at all). Killed the latter...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Let architectures select GENERIC_KERNEL_THREAD and have their copy_thread()
treat NULL regs as "it came from kernel_thread(), sp argument contains
the function new thread will be calling and stack_size - the argument for
that function". Switching the architectures begins shortly...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
After the previous change is_swbp_at_addr() is always called with
current->mm. Remove this check and move it close to its single caller.
Also, remove the obsolete comment about is_swbp_at_addr() and
uprobe_state.count.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Unlike set_swbp(), set_orig_insn()->is_swbp_at_addr() makes sense,
although it can't prevent all confusions.
But the usage of is_swbp_at_addr() is equally confusing, and it adds
the extra get_user_pages() we can avoid.
This patch removes set_orig_insn()->is_swbp_at_addr() but changes
write_opcode() to do the necessary checks before replace_page().
Perhaps it also makes sense to ensure PAGE_MAPPING_ANON in unregister
case.
find_active_uprobe() becomes the only user of is_swbp_at_addr(),
we can change its semantics.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
No functional changes, preparations.
1. Extract the kmap-and-memcpy code from read_opcode() into the
new trivial helper, copy_opcode(). The next patch will add
another user.
2. read_opcode() becomes really trivial, fold it into its single
caller, is_swbp_at_addr().
3. Remove "auprobe" argument from write_opcode(), it is not used
since f403072c6.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
A separate patch for better documentation.
set_swbp()->is_swbp_at_addr() is not needed for correctness, it is
harmless to do the unnecessary __replace_page(old_page, new_page)
when these 2 pages are identical.
And it can not be counted as optimization. mmap/register races are
very unlikely, while in the likely case is_swbp_at_addr() adds the
extra get_user_pages() even if the caller is uprobe_mmap(current->mm)
and returns false.
Note also that the semantics/usage of is_swbp_at_addr() in uprobe.c
is confusing. set_swbp() uses it to detect the case when this insn
was already modified by uprobes, that is why it should always compare
the opcode with UPROBE_SWBP_INSN even if the hardware (like powerpc)
has other trap insns. It doesn't matter if this breakpoint was in fact
installed by gdb or application itself, we are going to "steal" this
breakpoint anyway and execute the original insn from vm_file even if
it no longer matches the memory.
OTOH, handle_swbp()->find_active_uprobe() uses is_swbp_at_addr() to
figure out whether we need to send SIGTRAP or not if we can not find
uprobe, so in this case it should return true for all trap variants,
not only for UPROBE_SWBP_INSN.
This patch removes set_swbp()->is_swbp_at_addr(), the next patches
will remove it from set_orig_insn() which is similar to set_swbp()
in this respect. So the only caller will be handle_swbp() and we
can make its semantics clear.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
valid_vma(false) ignores ->vm_flags, this is not actually right.
We should never try to write into MAP_SHARED mapping, this can
confuse an apllication which actually writes to ->vm_file.
With this patch valid_vma(false) ignores VM_WRITE only but checks
other (immutable) bits checked by valid_vma(true). This can also
speedup uprobe_munmap() and uprobe_unregister().
Note: even after this patch _unregister can confuse the probed
application if it does mprotect(PROT_WRITE) after _register and
installs "int3", but this is hardly possible to avoid and this
doesn't differ from gdb case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobe_register() or uprobe_mmap() requires VM_READ | VM_EXEC, this
is not right. An apllication can do mprotect(PROT_EXEC) later and
execute this code.
Change valid_vma(is_register => true) to check VM_MAYEXEC instead.
No need to check VM_MAYREAD, it is always set.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
write_opcode()->get_user_pages() needs FOLL_FORCE to ensure we can
read the page even if the probed task did mprotect(PROT_NONE) after
uprobe_register(). Without FOLL_WRITE, FOLL_FORCE doesn't have any
side effect but allows to read the !VM_READ memory.
Otherwiese the subsequent uprobe_unregister()->set_orig_insn() fails
and we leak "int3". If that task does mprotect(PROT_READ | EXEC) and
execute the probed insn later it will be killed.
Note: in fact this is also needed for _register, see the next patch.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Kill UTASK_BP_HIT state, it buys nothing but complicates the code.
It is only used in uprobe_notify_resume() to decide who should be
called, we can check utask->active_uprobe != NULL instead. And this
allows us to simplify handle_swbp(), no need to clear utask->state.
Likewise we could kill UTASK_SSTEP, but UTASK_BP_HIT is worse and
imho should die. The problem is, it creates the special case when
task->utask is NULL, we can't distinguish RUNNING and BP_HIT. With
this patch utask == NULL always means RUNNING.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
If handle_swbp()->add_utask() fails but UPROBE_SKIP_SSTEP is set,
cleanup_ret: path do not restart the insn, this is wrong. Remove
this check and add the additional label for can_skip_sstep() = T
case.
Note also that UPROBE_SKIP_SSTEP can be false positive, we simply
can not trust it unless arch_uprobe_skip_sstep() was already called.
Also, move another UPROBE_SKIP_SSTEP check before can_skip_sstep()
into this helper, this looks more clean and understandable.
Note: probably we should rename "skip" to "emulate" and I think
that "clear UPROBE_SKIP_SSTEP" should be moved to arch_can_skip.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
handle_swbp() sets utask->active_uprobe before handler_chain(),
and UTASK_SSTEP before pre_ssout(). This complicates the code
for no reason, arch_ hooks or consumer->handler() should not
(and can't) use this info.
Change handle_swbp() to initialize them after pre_ssout(), and
remove the no longer needed cleanup-utask code.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
cked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
If handle_swbp()->find_active_uprobe() fails we return with
utask->state = UTASK_BP_HIT.
Change handle_swbp() to reset utask->state at the start. Note
that we do this unconditionally, see the next patch(es).
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c
The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.
qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.
With help from Antonio Quartulli.
Signed-off-by: David S. Miller <davem@davemloft.net>
Use generic steal operation on pipe buffer to allow stealing
ring buffer's read page from pipe buffer.
Note that this could reduce the performance of splice on the
splice_write side operation without affinity setting.
Since the ring buffer's read pages are allocated on the
tracing-node, but the splice user does not always execute
splice write side operation on the same node. In this case,
the page will be accessed from the another node.
Thus, it is strongly recommended to assign the splicing
thread to corresponding node.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The original module-init-tools module loader used a fnctl lock on the
.ko file to avoid attempts to simultaneously load a module.
Unfortunately, you can't get an exclusive fcntl lock on a read-only
fd, making this not work for read-only mounted filesystems.
module-init-tools has a hacky sleep-and-loop for this now.
It's not that hard to wait in the kernel, and only return -EEXIST once
the first module has finished loading (or continue loading the module
if the first one failed to initialize for some reason). It's also
consistent with what we do for dependent modules which are still loading.
Suggested-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use resolve_symbol_wait(), which blocks if the module containing
the symbol is still loading. However:
1) The module_wq we use is only woken after calling the modules' init
function, but there are other failure paths after the module is
placed in the linked list where we need to do the same thing.
2) wake_up() only wakes one waiter, and our waitqueue is shared by all
modules, so we need to wake them all.
3) wake_up_all() doesn't imply a memory barrier: I feel happier calling
it after we've grabbed and dropped the module_mutex, not just after
the state assignment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Use the mapping of Elf_[SPE]hdr, Elf_Addr, Elf_Sym, Elf_Dyn, Elf_Rel/Rela,
ELF_R_TYPE() and ELF_R_SYM() to either the 32-bit version or the 64-bit version
into asm-generic/module.h for all arches bar MIPS.
Also, use the generic definition mod_arch_specific where possible.
To this end, I've defined three new config bools:
(*) HAVE_MOD_ARCH_SPECIFIC
Arches define this if they don't want to use the empty generic
mod_arch_specific struct.
(*) MODULES_USE_ELF_RELA
Arches define this if their modules can contain RELA records. This causes
the Elf_Rela mapping to be emitted and allows apply_relocate_add() to be
defined by the arch rather than have the core emit an error message.
(*) MODULES_USE_ELF_REL
Arches define this if their modules can contain REL records. This causes
the Elf_Rel mapping to be emitted and allows apply_relocate() to be
defined by the arch rather than have the core emit an error message.
Note that it is possible to allow both REL and RELA records: m68k and mips are
two arches that do this.
With this, some arch asm/module.h files can be deleted entirely and replaced
with a generic-y marker in the arch Kbuild file.
Additionally, I have removed the bits from m32r and score that handle the
unsupported type of relocation record as that's now handled centrally.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cloudlinux have a product called lve that includes a kernel module. This
was previously GPLed but is now under a proprietary license, but the
module continues to declare MODULE_LICENSE("GPL") and makes use of some
EXPORT_SYMBOL_GPL symbols. Forcibly taint it in order to avoid this.
Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Alex Lyashkov <umka@cloudlinux.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJQX7MuAAoJEHm+PkMAQRiG0h0IAJURkrMCAQUxA+Ik66ReH89s
LQcVd0U9uL4UUOi7f5WR64Vf9Cfu6VVGX9ZKSvjpNskvlQaUQPMIt4pMe6g4X4dI
u0bApEy4XZz3nGabUAghIU8jJ8cDmhCG6kPpSiS7pi7KHc0yIa4WFtJRrIpGaIWT
xuK38YOiOHcSDRlLyWZzainMncQp/ixJdxnqVMTonkVLk0q0b84XzOr4/qlLE5lU
i+TsK3PRKdQXgvZ4CebL+srPBwWX1dmgP3VkeBloQbSSenSeELICbFWavn2ml+sF
GXi4dO93oNquL/Oy5SwI666T4uNcrRPaS+5X+xSZgBW/y2aQVJVJuNZg6ZP/uWk=
=0v2l
-----END PGP SIGNATURE-----
Merge tag 'v3.6-rc7' into next
Linux 3.6-rc7
Requested by David Howells so he can merge his key susbsystem work into
my tree with requisite -linus changesets.
descriptor-related parts of daemonize, done right. As the
result we simplify the locking rules for ->files - we
hold task_lock in *all* cases when we modify ->files.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This command disables NMI-entry. If NMI source has been previously shared
with a serial console ("debug port"), this effectively releases the port
from KDB exclusive use, and makes the console available for normal use.
Of course, NMI can be reenabled, enable_nmi modparam is used for that:
echo 1 > /sys/module/kdb/parameters/enable_nmi
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The new arch callback should manage NMIs that usually cause KGDB to
enter. That is, not all NMIs should be enabled/disabled, but only
those that issue kgdb_handle_exception().
We must mask it as serial-line interrupt can be used as an NMI, so
if the original KGDB-entry cause was say a breakpoint, then every
input to KDB console will cause KGDB to reenter, which we don't want.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Checking "user" before "is_idle_task()" allows better optimizations
in cases where inlining is possible. Also, "bool" should be passed
"true" or "false" rather than "1" or "0". This commit therefore makes
these changes, as noted in Josh's review.
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Provide a config option that enables the userspace
RCU extended quiescent state on every CPUs by default.
This is for testing purpose.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When exceptions or irq are about to resume userspace, if
the task needs to be rescheduled, the arch low level code
calls schedule() directly.
If we call it, it is because we have the TIF_RESCHED flag:
- It can be set after random local calls to set_need_resched()
(RCU, drm, ...)
- A wake up happened and the CPU needs preemption. This can
happen in several ways:
* Remotely: the remote waking CPU has set TIF_RESCHED and send the
wakee an IPI to schedule the new task.
* Remotely enqueued: the remote waking CPU sends an IPI to the target
and the wake up is made by the target.
* Locally: waking CPU == wakee CPU and the wakeup is done locally.
set_need_resched() is called without IPI.
In the case of local and remotely enqueued wake ups, the tick can
be restarted when we enqueue the new task and RCU can exit the
extended quiescent state at the same time. Then by the time we reach
irq exit path and we call schedule, we are not in RCU user mode.
But if we call schedule() only because something called set_need_resched(),
RCU may still be in user mode when we reach schedule.
Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
flag and call schedule while the IPI has not yet happen to restart the
tick and exit RCU user mode.
We need to manually protect against these corner cases.
Create a new API schedule_user() that calls schedule() inside
rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
will need to rely on it now to implement user preemption safely.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When an exception or an irq exits, and we are going to resume into
interrupted kernel code, the low level architecture code calls
preempt_schedule_irq() if there is a need to reschedule.
If the interrupt/exception occured between a call to rcu_user_enter()
(from syscall exit, exception exit, do_notify_resume exit, ...) and
a real resume to userspace (iret,...), preempt_schedule_irq() can be
called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
is going to run kernel code and may be some RCU read side critical
section. We must exit the userspace extended quiescent state before
we call it.
To solve this, just call rcu_user_exit() in the beginning of
preempt_schedule_irq().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Clear the syscalls hook of a task when it's scheduled out so that if
the task migrates, it doesn't run the syscall slow path on a CPU
that might not need it.
Also set the syscalls hook on the next task if needed.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
By default we don't want to enter into RCU extended quiescent
state while in userspace because doing this produces some overhead
(eg: use of syscall slowpath). Set it off by default and ready to
run when some feature like adaptive tickless need it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Allow calls to rcu_user_enter() even if we are already
in userspace (as seen by RCU) and allow calls to rcu_user_exit()
even if we are already in the kernel.
This makes the APIs more flexible to be called from architectures.
Exception entries for example won't need to know if they come from
userspace before calling rcu_user_exit().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Create a new config option under the RCU menu that put
CPUs under RCU extended quiescent state (as in dynticks
idle mode) when they run in userspace. This require
some contribution from architectures to hook into kernel
and userspace boundaries.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The current implementation of RCU_FAST_NO_HZ tries reasonably hard to rid
the current CPU of RCU callbacks. This is appropriate when the CPU is
entering idle, where it doesn't have much useful to do anyway, but is most
definitely not what you want when transitioning to user-mode execution.
This commit therefore detects the adaptive-tick case, and refrains from
burning CPU time getting rid of RCU callbacks in that case.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In some cases, it is necessary to enter or exit userspace-RCU-idle mode
from an interrupt handler, for example, if some other CPU sends this
CPU a resched IPI. In this case, the current CPU would enter the IPI
handler in userspace-RCU-idle mode, but would need to exit the IPI handler
after having exited that mode.
To allow this to work, this commit adds two new APIs to TREE_RCU:
- rcu_user_enter_after_irq(). This must be called from an interrupt between
rcu_irq_enter() and rcu_irq_exit(). After the irq calls rcu_irq_exit(),
the irq handler will return into an RCU extended quiescent state.
In theory, this interrupt is never a nested interrupt, but in practice
it might interrupt softirq, which looks to RCU like a nested interrupt.
- rcu_user_exit_after_irq(). This must be called from a non-nesting
interrupt, interrupting an RCU extended quiescent state, also
between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
rcu_irq_exit(), the irq handler will return in an RCU non-quiescent
state.
[ Combined with "Allow calls to rcu_exit_user_irq from nesting irqs." ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
RCU currently insists that only idle tasks can enter RCU idle mode, which
prohibits an adaptive tickless kernel (AKA nohz cpusets), which in turn
would mean that usermode execution would always take scheduling-clock
interrupts, even when there is only one task runnable on the CPU in
question.
This commit therefore adds rcu_user_enter() and rcu_user_exit(), which
allow non-idle tasks to enter RCU idle mode. These are quite similar
to rcu_idle_enter() and rcu_idle_exit(), respectively, except that they
omit the idle-task checks.
[ Updated to use "user" flag rather than separate check functions. ]
[ paulmck: Updated to drop exports of new functions based on Josh's patch
getting rid of the need for them. ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The conflicts between kernel/rcutree.h and kernel/rcutree_plugin.h
were due to adjacent insertions and deletions, which were resolved
by simply accepting the changes on both branches.
Move the code that finds out to which context we account the
cputime into generic layer.
Archs that consider the whole time spent in the idle task as idle
time (ia64, powerpc) can rely on the generic vtime_account()
and implement vtime_account_system() and vtime_account_idle(),
letting the generic code to decide when to call which API.
Archs that have their own meaning of idle time, such as s390
that only considers the time spent in CPU low power mode as idle
time, can just override vtime_account().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:
- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()
It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.
This also make it better to know on which subsystem these APIs
refer to.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
bigrt.2012.09.23a contains additional commits to reduce scheduling latency
from RCU on huge systems (many hundrends or thousands of CPUs).
doctorture.2012.09.23a contains documentation changes and rcutorture fixes.
fixes.2012.09.23a contains miscellaneous fixes.
hotplug.2012.09.23a contains CPU-hotplug-related changes.
idle.2012.09.23a fixes architectures for which RCU no longer considered
the idle loop to be a quiescent state due to earlier
adaptive-dynticks changes. Affected architectures are alpha,
cris, frv, h8300, m32r, m68k, mn10300, parisc, score, xtensa,
and ia64.
We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.
This page is used to build fragments for skbs.
Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)
But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page
Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.
This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.
(up to 32768 bytes per frag, thats order-3 pages on x86)
This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.
Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536
Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits trace event initialization in two stages:
* ftrace enable
* sysfs event entry creation
This allows to capture trace events from an earlier point
by using 'trace_event' kernel parameter and is important
to trace boot-up allocations.
Note that, in order to enable events at core_initcall,
it's necessary to move init_ftrace_syscalls() from
core_initcall to early_initcall.
Link: http://lkml.kernel.org/r/1347461277-25302-1-git-send-email-elezegarcia@gmail.com
Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In our application, we have trace markers spread through user-space.
We have markers in GL, X, etc. These are super handy for Chrome's
about:tracing feature (Chrome + system + kernel trace view), but
can be very distracting when you're trying to debug a kernel issue.
I normally, use "grep -v tracing_mark_write" but it would be nice
if I could just temporarily disable markers all together.
Link: http://lkml.kernel.org/r/1347066739-26285-1-git-send-email-msb@chromium.org
CC: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We only do rounding to the next nanosecond so we don't see minor
1ns inconsistencies in the vsyscall implementations. Since we're
changing the vsyscall implementations to avoid this, conditionalize
the rounding only to the GENERIC_TIME_VSYSCALL_OLD architectures.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Now that we moved everyone over to GENERIC_TIME_VSYSCALL_OLD,
introduce the new declaration and config option for the new
update_vsyscall method.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
We're going to need to access the timekeeper in update_vsyscall,
so make the structure available for those who need it.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Now that alarmtimer_remove has been simplified, change
its name to _dequeue to better match its paired _enqueue
function.
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Arve Hjønnevåg reported numerous crashes from the
"BUG_ON(timer->state != HRTIMER_STATE_CALLBACK)" check
in __run_hrtimer after it called alarmtimer_fired.
It ends up the alarmtimer code was not properly handling
possible failures of hrtimer_try_to_cancel, and because
these faulres occur when the underlying base hrtimer is
being run, this limits the ability to properly handle
modifications to any alarmtimers on that base.
Because much of the logic duplicates the hrtimer logic,
it seems that we might as well have a per-alarmtimer
hrtimer, and avoid the extra complextity of trying to
multiplex many alarmtimers off of one hrtimer.
Thus this patch moves the hrtimer to the alarm structure
and simplifies the management logic.
Changelog:
v2:
* Includes a fix for double alarm_start calls found by
Arve
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Arve Hjønnevåg <arve@android.com>
Tested-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
alarmtimer suspend return -EBUSY if the next alarm will fire in less
than 2 seconds. This allows one RTC seconds tick to occur subsequent
to this check before the alarm wakeup time is set, ensuring the wakeup
time is still in the future (assuming the RTC does not tick one more
second prior to setting the alarm).
If suspend is rejected due to an imminent alarm, hold a wakeup source
for 2 seconds to process the alarm prior to reattempting suspend.
If setting the alarm incurs an -ETIME for an alarm set in the past,
or any other problem setting the alarm, abort suspend and hold a
wakelock for 1 second while the alarm is allowed to be serviced or
other hopefully transient conditions preventing the alarm clear up.
Signed-off-by: Todd Poynor <toddpoynor@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Rabik and Paul reported two different issues related to the same few
lines of code.
Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).
Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.
The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.
The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.
So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.
[ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
miscounting noted by Rakib. ]
Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Posting a callback after the CPU_DEAD notifier effectively leaks
that callback unless/until that CPU comes back online. Silence is
unhelpful when attempting to track down such leaks, so this commit emits
a WARN_ON_ONCE() and unconditionally leaks the callback when an offline
CPU attempts to register a callback. The rdp->nxttail[RCU_NEXT_TAIL] is
set to NULL in the CPU_DEAD notifier and restored in the CPU_UP_PREPARE
notifier, allowing _call_rcu() to determine exactly when posting callbacks
is illegal.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, _rcu_barrier() relies on preempt_disable() to prevent
any CPU from going offline, which in turn depends on CPU hotplug's
use of __stop_machine().
This patch therefore makes _rcu_barrier() use get_online_cpus() to
block CPU-hotplug operations. This has the added benefit of removing
the need for _rcu_barrier() to adopt callbacks: Because CPU-hotplug
operations are excluded, there can be no callbacks to adopt. This
commit simplifies the code accordingly.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The print_cpu_stall_fast_no_hz() function attempts to print -1 when
the ->idle_gp_timer is not pending, but unsigned arithmetic causes it
to instead print ULONG_MAX, which is 4294967295 on 32-bit systems and
18446744073709551615 on 64-bit systems. Neither of these are the most
reader-friendly values, so this commit instead causes "timer not pending"
to be printed when ->idle_gp_timer is not pending.
Reported-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
TINY_RCU's rcu_idle_enter_common() invokes rcu_sched_qs() in order
to inform the RCU core of the quiescent state implied by idle entry.
Of course, idle is also an extended quiescent state, so that the call
to rcu_sched_qs() speeds up RCU's invoking of any callbacks that might
be queued. This speed-up is important when entering into dyntick-idle
mode -- if there are no further scheduling-clock interrupts, the callbacks
might never be invoked, which could result in a system hang.
However, processing callbacks does event tracing, which in turn
implies RCU read-side critical sections, which are illegal in extended
quiescent states. This patch therefore moves the call to rcu_sched_qs()
so that it precedes the point at which we inform lockdep that RCU has
entered an extended quiescent state.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The can_stop_idle_tick() function complains if a softirq vector is
raised too late in the idle-entry process, presumably in order to
prevent dangling softirq invocations from being delayed across the
full idle period, which might be indefinitely long -- and if softirq
was asserted any later than the call to this function, such a delay
might well happen.
However, RCU needs to be able to use softirq to stop idle entry in
order to be able to drain RCU callbacks from the current CPU, which in
turn enables faster entry into dyntick-idle mode, which in turn reduces
power consumption. Because RCU takes this action at a well-defined
point in the idle-entry path, it is safe for RCU to take this approach.
This commit therefore silences the error message that is sometimes
produced when the going-idle CPU suddenly finds that it has an RCU_SOFTIRQ
to process. The error message will continue to be issued for other
softirq vectors.
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The use of raw_local_irq_save() is unnecessary, given that local_irq_save()
really does disable interrupts. Also, it appears to interfere with lockdep.
Therefore, this commit moves to local_irq_save().
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
The first memory barrier in __call_rcu() is supposed to order any
updates done beforehand by the caller against the actual queuing
of the callback. However, the second memory barrier (which is intended
to order incrementing the queue lengths before queuing the callback)
is also between the caller's updates and the queuing of the callback.
The second memory barrier can therefore serve both purposes.
This commit therefore removes the first memory barrier.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
If a given CPU avoids the idle loop but also avoids starting a new
RCU grace period for a full minute, RCU can issue spurious RCU CPU
stall warnings. This commit fixes this issue by adding a check for
ongoing grace period to avoid these spurious stall warnings.
Reported-by: Becky Bruce <bgillbruce@gmail.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The print_other_cpu_stall() function accesses a number of rcu_node
fields without protection from the ->lock. In theory, this is not
a problem because the fields accessed are all integers, but in
practice the compiler can get nasty. Therefore, the commit extends
the existing critical section to cover the entire loop body.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The rcu_print_detail_task_stall_rnp() function invokes
rcu_preempt_blocked_readers_cgp() to verify that there are some preempted
RCU readers blocking the current grace period outside of the protection
of the rcu_node structure's ->lock. This means that the last blocked
reader might exit its RCU read-side critical section and remove itself
from the ->blkd_tasks list before the ->lock is acquired, resulting in
a segmentation fault when the subsequent code attempts to dereference
the now-NULL gp_tasks pointer.
This commit therefore moves the test under the lock. This will not
have measurable effect on lock contention because this code is invoked
only when printing RCU CPU stall warnings, in other words, in the common
case, never.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The increment_cpu_stall_ticks() function listed each RCU flavor
explicitly, with an ifdef to handle preemptible RCU. This commit
therefore applies for_each_rcu_flavor() to save a line of code.
Because this commit switches from a code-based enumeration of the
flavors of RCU to an rcu_state-list-based enumeration, it is no longer
possible to apply __get_cpu_var() to the per-CPU rcu_data structures.
We instead use __this_cpu_var() on the rcu_state structure's ->rda field
that references the corresponding rcu_data structures.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Commit 1217ed1b (rcu: permit rcu_read_unlock() to be called while holding
runqueue locks) made rcu_initiate_boost() restore irq state when releasing
the rcu_node structure's ->lock, but failed to update the header comment
accordingly. This commit therefore brings the header comment up to date.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_implicit_offline_qs() function implicitly assumed that execution
would progress predictably when interrupts are disabled, which is of course
not guaranteed when running on a hypervisor. Furthermore, this function
is short, and is called from one place only in a short function.
This commit therefore ensures that the timing is checked before
checking the condition, which guarantees correct behavior even given
indefinite delays. It also inlines rcu_implicit_offline_qs() into
rcu_implicit_dynticks_qs().
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu_preempt_offline_tasks() moves all tasks queued on a given leaf
rcu_node structure to the root rcu_node, which is done when the last CPU
corresponding the the leaf rcu_node structure goes offline. Now that
RCU-preempt's synchronize_rcu_expedited() implementation blocks CPU-hotplug
operations during the initialization of each rcu_node structure's
->boost_tasks pointer, rcu_preempt_offline_tasks() can do a better job
of setting the root rcu_node's ->boost_tasks pointer.
The key point is that rcu_preempt_offline_tasks() runs as part of the
CPU-hotplug process, so that a concurrent synchronize_rcu_expedited()
is guaranteed to either have not started on the one hand (in which case
there is no boosting on behalf of the expedited grace period) or to be
completely initialized on the other (in which case, in the absence of
other priority boosting, all ->boost_tasks pointers will be initialized).
Therefore, if rcu_preempt_offline_tasks() finds that the ->boost_tasks
pointer is equal to the ->exp_tasks pointer, it can be sure that it is
correctly placed.
In the case where there was boosting ongoing at the time that the
synchronize_rcu_expedited() function started, different nodes might start
boosting the tasks blocking the expedited grace period at different times.
In this mixed case, the root node will either be boosting tasks for
the expedited grace period already, or it will start as soon as it gets
done boosting for the normal grace period -- but in this latter case,
the root node's tasks needed to be boosted in any case.
This commit therefore adds a check of the ->boost_tasks pointer against
the ->exp_tasks pointer to the list that prevents updating ->boost_tasks.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
There is a need to use RCU from interrupt context, but either before
rcu_irq_enter() is called or after rcu_irq_exit() is called. If the
interrupt occurs from idle, then lockdep-RCU will complain about such
uses, as they appear to be illegal uses of RCU from the idle loop.
In other environments, RCU_NONIDLE() could be used to properly protect
the use of RCU, but RCU_NONIDLE() currently cannot be invoked except
from process context.
This commit therefore modifies RCU_NONIDLE() to permit its use more
globally.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
When rcu_preempt_offline_tasks() clears tasks from a leaf rcu_node
structure, it does not NULL out the structure's ->boost_tasks field.
This commit therefore fixes this issue.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Because TINY_RCU's idle detection keys directly off of the nesting
level, rather than from a separate variable as in TREE_RCU, the
TINY_RCU dyntick-idle tracing on transition to idle must happen
before the change to the nesting level. This commit therefore makes
this change by passing the desired new value (rather than the old value)
of the nesting level in to rcu_idle_enter_common().
[ paulmck: Add fix for wrong-variable bug spotted by
Michael Wang <wangyun@linux.vnet.ibm.com>. ]
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
There have been some recent bugs that were triggered only when
preemptible RCU's __rcu_read_unlock() was preempted just after setting
->rcu_read_lock_nesting to INT_MIN, which is a low-probability event.
Therefore, reproducing those bugs (to say nothing of gaining confidence
in alleged fixes) was quite difficult. This commit therefore creates
a new debug-only RCU kernel config option that forces a short delay
in __rcu_read_unlock() to increase the probability of those sorts of
bugs occurring.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When you do something like "t = kthread_run(...)", it is possible that
the kthread will start running before the assignment to "t" happens.
If the child kthread expects to find a pointer to its task_struct in "t",
it will then be fatally disappointed. This commit therefore switches
such cases to kthread_create() followed by wake_up_process(), guaranteeing
that the assignment happens before the child kthread starts running.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Drop a few characters by switching kernel/rcutorture.c from
"printk(KERN_ALERT" to "pr_alert(".
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Many rcutorture runs include CPU-hotplug operations in their stress
testing. This commit accumulates statistics on the durations of these
operations in deference to the recent concern about the overhead and
latency of these operations.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
A number of new features have been added to rcutorture over the years, but
the defaults have not been updated to include them. This commit therefore
turns on a couple of them that have proven helpful and trustworthy, namely
periodic progress reports and testing of NO_HZ.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Currently, rcu_init_geometry() only reshapes RCU's combining trees
if the leaf fanout is changed at boot time. This means that by
default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
data structures, even when running on systems with (say) four CPUs.
This commit therefore checks to see if the maximum number of CPUs on
the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
reshapes the combining trees accordingly.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
to nr_cpu_ids) to require more than a single rcu_node structure, but if
NR_CPUS is larger than would fit into a single rcu_node structure, then
the current rcu_init_levelspread() code is subject to integer overflow
in the eight-bit ->levelspread[] array in the rcu_state structure.
In this case, the solution is -not- to increase the size of the
elements in this array because the values in that array should be
constrained to the number of bits in an unsigned long. Instead, this
commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
function's initialization of the cprv local variable. This results in
all of the arithmetic being consistently based off of the nr_cpu_ids
value, thus avoiding the overflow, which was caused by the mixing of
nr_cpu_ids and NR_CPUS.
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The current quiescent-state detection algorithm is needlessly
complex. It records the grace-period number corresponding to
the quiescent state at the time of the quiescent state, which
works, but it seems better to simply erase any record of previous
quiescent states at the time that the CPU notices the new grace
period. This has the further advantage of removing another piece
of RCU for which lockless reasoning is required.
Therefore, this commit makes this change.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The synchronize_rcu_expedited() function disables interrupts across a
scan of all leaf rcu_node structures, which is not good for real-time
scheduling latency on large systems (hundreds or especially thousands
of CPUs). This commit therefore holds off CPU-hotplug operations using
get_online_cpus(), and removes the prior acquisiion of the ->onofflock
(which required disabling interrupts).
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In the C language, signed overflow is undefined. It is true that
twos-complement arithmetic normally comes to the rescue, but if the
compiler can subvert this any time it has any information about the values
being compared. For example, given "if (a - b > 0)", if the compiler
has enough information to realize that (for example) the value of "a"
is positive and that of "b" is negative, the compiler is within its
rights to optimize to a simple "if (1)", which might not be what you want.
This commit therefore converts synchronize_rcu_expedited()'s work-done
detection counter from signed to unsigned.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Now that the rcu_node structures' ->completed fields are unconditionally
assigned at grace-period cleanup time, they should already have the
correct value for the new grace period at grace-period initialization
time. This commit therefore inserts a WARN_ON_ONCE() to verify this
invariant.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Preemption greatly raised the probability of certain types of race
conditions, so this commit adds an anti-heisenbug to greatly increase
the collision cross section, also known as the probability of occurrence.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The current approach to grace-period initialization is vulnerable to
extremely low-probability races. These races stem from the fact that
the old grace period is marked completed on the same traversal through
the rcu_node structure that is marking the start of the new grace period.
This means that some rcu_node structures will believe that the old grace
period is still in effect at the same time that other rcu_node structures
believe that the new grace period has already started.
These sorts of disagreements can result in too-short grace periods,
as shown in the following scenario:
1. CPU 0 completes a grace period, but needs an additional
grace period, so starts initializing one, initializing all
the non-leaf rcu_node structures and the first leaf rcu_node
structure. Because CPU 0 is both completing the old grace
period and starting a new one, it marks the completion of
the old grace period and the start of the new grace period
in a single traversal of the rcu_node structures.
Therefore, CPUs corresponding to the first rcu_node structure
can become aware that the prior grace period has completed, but
CPUs corresponding to the other rcu_node structures will see
this same prior grace period as still being in progress.
2. CPU 1 passes through a quiescent state, and therefore informs
the RCU core. Because its leaf rcu_node structure has already
been initialized, this CPU's quiescent state is applied to the
new (and only partially initialized) grace period.
3. CPU 1 enters an RCU read-side critical section and acquires
a reference to data item A. Note that this CPU believes that
its critical section started after the beginning of the new
grace period, and therefore will not block this new grace period.
4. CPU 16 exits dyntick-idle mode. Because it was in dyntick-idle
mode, other CPUs informed the RCU core of its extended quiescent
state for the past several grace periods. This means that CPU 16
is not yet aware that these past grace periods have ended. Assume
that CPU 16 corresponds to the second leaf rcu_node structure --
which has not yet been made aware of the new grace period.
5. CPU 16 removes data item A from its enclosing data structure
and passes it to call_rcu(), which queues a callback in the
RCU_NEXT_TAIL segment of the callback queue.
6. CPU 16 enters the RCU core, possibly because it has taken a
scheduling-clock interrupt, or alternatively because it has
more than 10,000 callbacks queued. It notes that the second
most recent grace period has completed (recall that because it
corresponds to the second as-yet-uninitialized rcu_node structure,
it cannot yet become aware that the most recent grace period has
completed), and therefore advances its callbacks. The callback
for data item A is therefore in the RCU_NEXT_READY_TAIL segment
of the callback queue.
7. CPU 0 completes initialization of the remaining leaf rcu_node
structures for the new grace period, including the structure
corresponding to CPU 16.
8. CPU 16 again enters the RCU core, again, possibly because it has
taken a scheduling-clock interrupt, or alternatively because
it now has more than 10,000 callbacks queued. It notes that
the most recent grace period has ended, and therefore advances
its callbacks. The callback for data item A is therefore in
the RCU_DONE_TAIL segment of the callback queue.
9. All CPUs other than CPU 1 pass through quiescent states. Because
CPU 1 already passed through its quiescent state, the new grace
period completes. Note that CPU 1 is still in its RCU read-side
critical section, still referencing data item A.
10. Suppose that CPU 2 wais the last CPU to pass through a quiescent
state for the new grace period, and suppose further that CPU 2
did not have any callbacks queued, therefore not needing an
additional grace period. CPU 2 therefore traverses all of the
rcu_node structures, marking the new grace period as completed,
but does not initialize a new grace period.
11. CPU 16 yet again enters the RCU core, yet again possibly because
it has taken a scheduling-clock interrupt, or alternatively
because it now has more than 10,000 callbacks queued. It notes
that the new grace period has ended, and therefore advances
its callbacks. The callback for data item A is therefore in
the RCU_DONE_TAIL segment of the callback queue. This means
that this callback is now considered ready to be invoked.
12. CPU 16 invokes the callback, freeing data item A while CPU 1
is still referencing it.
This scenario represents a day-zero bug for TREE_RCU. This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.
That said, it would have been insanely difficult to force this race to
happen before the grace-period initialization process was preemptible.
Therefore, this commit is not a candidate for -stable.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Conflicts:
kernel/rcutree.c
The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs. This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system. This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.
Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Although almost everyone is well-served by the defaults, some uses of RCU
benefit from shorter grace periods, while others benefit more from the
greater efficiency provided by longer grace periods. Situations requiring
a large number of grace periods to elapse (and wireshark startup has
been called out as an example of this) are helped by lower-latency
grace periods. Furthermore, in some embedded applications, people are
willing to accept a small degradation in update efficiency (due to there
being more of the shorter grace-period operations) in order to gain the
lower latency.
In contrast, those few systems with thousands of CPUs need longer grace
periods because the CPU overhead of a grace period rises roughly
linearly with the number of CPUs. Such systems normally do not make
much use of facilities that require large numbers of grace periods to
elapse, so this is a good tradeoff.
Therefore, this commit allows the durations to be controlled from sysfs.
There are two sysfs parameters, one named "jiffies_till_first_fqs" that
specifies the delay in jiffies from the end of grace-period initialization
until the first attempt to force quiescent states, and the other named
"jiffies_till_next_fqs" that specifies the delay (again in jiffies)
between subsequent attempts to force quiescent states. They both default
to three jiffies, which is compatible with the old hard-coded behavior.
At some future time, it may be possible to automatically increase the
grace-period length with the number of CPUs, but we do not yet have
sufficient data to do a good job. Preliminary data indicates that we
should add an addiitonal jiffy to each of the delays for every 200 CPUs
in the system, but more experimentation is needed. For now, the number
of systems with more than 1,000 CPUs is small enough that this can be
relegated to boot-time hand tuning.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Large systems running RCU_FAST_NO_HZ kernels see extreme memory
contention on the rcu_state structure's ->fqslock field. This
can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
or at boot time (via the nohz kernel boot parameter), but large
systems will no doubt become sensitive to energy consumption.
This commit therefore uses a combining-tree approach to spread the
memory contention across new cache lines in the leaf rcu_node structures.
This can be thought of as a tournament lock that has only a try-lock
acquisition primitive.
The effect on small systems is minimal, because such systems have
an rcu_node "tree" consisting of a single node. In addition, this
functionality is not used on fastpaths.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Moving quiescent-state forcing into a kthread dispenses with the need
for the ->n_rp_need_fqs field, so this commit removes it.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
RCU quiescent-state forcing is currently carried out without preemption
points, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs). This patch therefore inserts
a voluntary preemption point into force_qs_rnp(), which should greatly
reduce the magnitude of these spikes.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
As the first step towards allowing quiescent-state forcing to be
preemptible, this commit moves RCU quiescent-state forcing into the
same kthread that is now used to initialize and clean up after grace
periods. This is yet another step towards keeping scheduling
latency down to a dull roar.
Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
and to remove the now-unused rcu_state structure fields as suggested by
Peter Zijlstra.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The fields in the rcu_state structure that are protected by the
root rcu_node structure's ->lock can share a cache line with the
fields protected by ->onofflock. This can result in excessive
memory contention on large systems, so this commit applies
____cacheline_internodealigned_in_smp to the ->onofflock field in
order to segregate them.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
large number of lazy callbacks, which as the name implies will be slow
to be invoked. This can be a problem on small-memory systems, where the
default 6-second sleep for CPUs having only lazy RCU callbacks could well
be fatal. This commit therefore installs an OOM hander that ensures that
every CPU with lazy callbacks has at least one non-lazy callback, in turn
ensuring timely advancement for these callbacks.
Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.
Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
thus reducing the number of IPIs, as suggested by Steven Rostedt. Also
to make the for_each_online_cpu() loop be preemptible. (Later, it might
be good to use smp_call_function(), as suggested by Peter Zijlstra.)
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
in order to note a quiescent state for the outgoing CPU. Because the
CPU is marked "offline" during the execution of the CPU_DYING notifiers,
the RCU core had to tolerate being invoked from an offline CPU. However,
commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
code in the CPU_DYING notifier, so the RCU core need no longer execute
on offline CPUs. This commit therefore enforces this restriction.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Then rcu_gp_kthread() function is too large and furthermore needs to
have the force_quiescent_state() code pulled in. This commit therefore
breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
RCU grace-period cleanup is currently carried out with interrupts
disabled, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs). This patch therefore makes the
RCU grace-period cleanup be preemptible, including voluntary preemption
points, which should eliminate those latency spikes. Similar spikes from
forcing of quiescent states will be dealt with similarly by later patches.
Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
suggested by Peter Zijlstra.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
As a first step towards allowing grace-period cleanup to be preemptible,
this commit moves the RCU grace-period cleanup into the same kthread
that is now used to initialize grace periods. This is needed to keep
scheduling latency down to a dull roar.
[ paulmck: Get rid of stray spin_lock_irqsave() calls. ]
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
RCU grace-period initialization is currently carried out with interrupts
disabled, which can result in 200-microsecond latency spikes on systems
on which RCU has been configured for 4096 CPUs. This patch therefore
makes the RCU grace-period initialization be preemptible, which should
eliminate those latency spikes. Similar spikes from grace-period cleanup
and the forcing of quiescent states will be dealt with similarly by later
patches.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The next step in reducing RCU's grace-period initialization latency on
large systems will make this initialization preemptible. Unfortunately,
making the grace-period initialization subject to interrupts (let alone
preemption) exposes the following race on systems whose rcu_node tree
contains more than one node:
1. CPU 31 starts initializing the grace period, including the
first leaf rcu_node structures, and is then preempted.
2. CPU 0 refers to the first leaf rcu_node structure, and notes
that a new grace period has started. It passes through a
quiescent state shortly thereafter, and informs the RCU core
of this rite of passage.
3. CPU 0 enters an RCU read-side critical section, acquiring
a pointer to an RCU-protected data item.
4. CPU 31 takes an interrupt whose handler removes the data item
referenced by CPU 0 from the data structure, and registers an
RCU callback in order to free it.
5. CPU 31 resumes initializing the grace period, including its
own rcu_node structure. In invokes rcu_start_gp_per_cpu(),
which advances all callbacks, including the one registered
in #4 above, to be handled by the current grace period.
6. The remaining CPUs pass through quiescent states and inform
the RCU core, but CPU 0 remains in its RCU read-side critical
section, still referencing the now-removed data item.
7. The grace period completes and all the callbacks are invoked,
including the one that frees the data item that CPU 0 is still
referencing. Oops!!!
One way to avoid this race is to remove grace-period acceleration from
rcu_start_gp_per_cpu(). Now, the only reason for this acceleration was
to allow CPUs bringing RCU out of idle state to have their callbacks
invoked after only one grace period, rather than the two grace periods
that would otherwise be required. But this acceleration does not
work when RCU grace-period initialization is moved to a kthread because
the CPU posting the callback is no longer necessarily the CPU that is
initializing the resulting grace period.
This commit therefore removes this now-pointless (and soon to be dangerous)
grace-period acceleration, thus avoiding the above race.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
As the first step towards allowing grace-period initialization to be
preemptible, this commit moves the RCU grace-period initialization
into its own kthread. This is needed to keep large-system scheduling
latency at reasonable levels.
Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
by Peter Zijlstra in review comments.
Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Each grace period is supposed to have at least one callback waiting
for that grace period to complete. However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally. In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period. Given that nothing is waiting on this grace
period, this is also not a problem.
That is, unless RCU CPU stall warnings are also enabled, as they are
in recent kernels. In this case, if a CPU wakes up after at least one
minute of inactivity, an RCU CPU stall warning will result. The reason
that no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that consistently get into this situation.
All this begs the question of exactly how a callback-free grace period
gets started in the first place. This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait for
another grace period, never mind the fact that the grace period that they
were waiting for just completed. This CPU can therefore erroneously
decide to start a new grace period. Note that this can happen in
TREE_RCU and TREE_PREEMPT_RCU even on a single-CPU system: Deadlock
considerations mean that the CPU that detected the end of the grace
period is not necessarily officially informed of this fact for some time.
Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks. It then won't have any callbacks left. If no
other CPU has any callbacks, we now have a callback-free grace period.
This commit therefore makes CPUs check more carefully before starting a
new grace period. This new check relies on an array of tail pointers
into each CPU's list of callbacks. If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment. The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU is officially notified that the old grace period
has ended.
This change is to cpu_needs_another_gp(), which is called in a number
of places. The only one that really matters is in rcu_start_gp(), where
the root rcu_node structure's ->lock is held, which prevents any
other CPU from starting or completing a grace period, so that the
comparison that determines whether the CPU is missing the completion
of a grace period is stable.
Reported-by: Becky Bruce <bgillbruce@gmail.com>
Reported-by: Subodh Nijsure <snijsure@grid-net.com>
Reported-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul Walmsley <paul@pwsan.com> # OMAP3730, OMAP4430
Cc: stable@vger.kernel.org
Pull timer fix from Ingo Molnar:
"One more timekeeping fix for v3.6"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Fix timeekeping_get_ns overflow on 32bit systems
e0aecdd874 ("workqueue: use irqsafe timer for delayed_work") made
try_to_grab_pending() safe to use from irq context but forgot to
remove WARN_ON_ONCE(in_irq()). Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Pull workqueue / powernow-k8 fix from Tejun Heo:
"This is the fix for the bug where cpufreq/powernow-k8 was tripping
BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
different CPU.
https://bugzilla.kernel.org/show_bug.cgi?id=47301
As discussed, the fix is now two parts - one to reimplement
work_on_cpu() so that it doesn't create a new kthread each time and
the actual fix which makes powernow-k8 use work_on_cpu() instead of
performing manual migration.
While pretty late in the merge cycle, both changes are on the safer
side. Jiri and I verified two existing users of work_on_cpu() and
Duncan confirmed that the powernow-k8 fix survived about 18 hours of
testing."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
cpufreq/powernow-k8: workqueue user shouldn't migrate the kworker to another CPU
workqueue: reimplement work_on_cpu() using system_wq
workqueue_set_max_active() may increase ->max_active without
activating delayed works and may make the activation order differ from
the queueing order. Both aren't strictly bugs but the resulting
behavior could be a bit odd.
To make things more consistent, use cwq_set_max_active() helper which
immediately makes use of the newly increased max_mactive if there are
delayed work items and also keeps the activation order.
tj: Slight update to description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Using a helper instead of open code makes thaw_workqueues() clearer.
The helper will also be used by the next patch.
tj: Slight update to comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The existing work_on_cpu() implementation is hugely inefficient. It
creates a new kthread, execute that single function and then let the
kthread die on each invocation.
Now that system_wq can handle concurrent executions, there's no
advantage of doing this. Reimplement work_on_cpu() using system_wq
which makes it simpler and way more efficient.
stable: While this isn't a fix in itself, it's needed to fix a
workqueue related bug in cpufreq/powernow-k8. AFAICS, this
shouldn't break other existing users.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
@delayed is now always false for all callers, remove it.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works. The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall. For example,
state: cwq->max_active = 1, cwq->nr_active = 1
one work in cwq->pool, many in cwq->delayed_works.
step1: try_to_grab_pending() removes a work item from delayed_works
but leaves its NO_COLOR linked work items on it.
step2: Later on, cwq_activate_first_delayed() activates the linked
work item increasing ->nr_active.
step3: cwq->nr_active = 1, but all activated work items of the cwq are
NO_COLOR. When they finish, cwq->nr_active will not be
decreased due to NO_COLOR, and no further work items will be
activated from cwq->delayed_works. the cwq stalls.
Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending(). This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so
hotcpu_notifier() fits better than cpu_notifier().
When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same.
When HOTPLUG_CPU=n, if we use cpu_notifier(),
workqueue_cpu_down_callback() will be called during boot to do
nothing, and the memory of workqueue_cpu_down_callback() and
gcwq_unbind_fn() will be discarded after boot.
If we use hotcpu_notifier(), we can avoid the no-op call of
workqueue_cpu_down_callback() and the memory of
workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at
build time:
$ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
-rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier
-rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier
$ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
text data bss dec hex filename
18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier
18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
For workqueue hotplug callbacks, it makes less sense to use __devinit
which discards the memory after boot if !HOTPLUG. __cpuinit, which
discards the memory after boot if !HOTPLUG_CPU fits better.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that manager_mutex's role has changed from synchronizing manager
role to excluding hotplug against manager, the name is misleading.
As it is protecting the CPU-association of the gcwq now, rename it to
assoc_mutex.
This patch is pure rename and doesn't introduce any functional change.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Now both worker destruction and idle rebinding remove the worker from
idle list while it's still idle, so list_empty(&worker->entry) can be
used to test whether either is pending and WORKER_DIE to distinguish
between the two instead making WORKER_REBIND unnecessary.
Use list_empty(&worker->entry) to determine whether destruction or
rebinding is pending. This simplifies worker state transitions.
WORKER_REBIND is not needed anymore. Remove it.
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Because the old unbind/rebinding implementation wasn't atomic w.r.t.
GCWQ_DISASSOCIATED manipulation which is protected by
global_cwq->lock, we had to use two flags, WORKER_UNBOUND and
WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with
back-to-back CPU hotplug operations; otherwise, completion of
rebinding while another unbinding is in progress could clear UNBIND
prematurely.
Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED,
there's no need to use two flags. Just one is enough. Don't use
WORKER_REBIND for busy rebinding.
tj: Updated description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently rebind_workers() uses rebinds idle workers synchronously
before proceeding to requesting busy workers to rebind. This is
necessary because all workers on @worker_pool->idle_list must be bound
before concurrency management local wake-ups from the busy workers
take place.
Unfortunately, the synchronous idle rebinding is quite complicated.
This patch reimplements idle rebinding to simplify the code path.
Rather than trying to make all idle workers bound before rebinding
busy workers, we simply remove all to-be-bound idle workers from the
idle list and let them add themselves back after completing rebinding
(successful or not).
As only workers which finished rebinding can on on the idle worker
list, the idle worker list is guaranteed to have only bound workers
unless CPU went down again and local wake-ups are safe.
After the change, @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list. More specifically,
nr_idle may be non-zero while ->idle_list is empty. All users of
->nr_idle and ->idle_list are audited. The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.
After this patch, rebind_workers() no longer performs the nasty
idle-rebind retries which require temporary release of gcwq->lock, and
both unbinding and rebinding are atomic w.r.t. global_cwq->lock.
worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
and removed along with the definition of struct idle_rebind.
Changed from V1:
1) remove unlikely from too_many_workers(), ->idle_list can be empty
anytime, even before this patch, no reason to use unlikely.
2) fix a small rebasing mistake.
(which is from rebasing the orignal fixing patch to for-next)
3) add a lot of comments.
4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()
tj: Updated comments and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Implement kprojid_t a cousin of the kuid_t and kgid_t.
The per user namespace mapping of project id values can be set with
/proc/<pid>/projid_map.
A full compliment of helpers is provided: make_kprojid, from_kprojid,
from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
projid_eq, projid_lt.
Project identifiers are part of the generic disk quota interface,
although it appears only xfs implements project identifiers currently.
The xfs code allows anyone who has permission to set the project
identifier on a file to use any project identifier so when
setting up the user namespace project identifier mappings I do
not require a capability.
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- When tracing capture the kuid.
- When displaying the data to user space convert the kuid into the
user namespace of the process that opened the report file.
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
BSD process accounting conveniently passes the file the accounting
records will be written into to do_acct_process. The file credentials
captured the user namespace of the opener of the file. Use the file
credentials to format the uid and the gid of the current process into
the user namespace of the user that started the bsd process
accounting.
Cc: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Explicitly limit exit task stat broadcast to the initial user and
pid namespaces, as it is already limited to the initial network
namespace.
- For broadcast task stats explicitly generate all of the idenitiers
in terms of the initial user namespace and the initial pid
namespace.
- For request stats report them in terms of the current user namespace
and the current pid namespace. Netlink messages are delivered
syncrhonously to the kernel allowing us to get the user namespace
and the pid namespace from the current task.
- Pass the namespaces for representing pids and uids and gids
into bacct_add_task.
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
- Explicitly format uids gids in audit messges in the initial user
namespace. This is safe because auditd is restrected to be in
the initial user namespace.
- Convert audit_sig_uid into a kuid_t.
- Enable building the audit code and user namespaces at the same time.
The net result is that the audit subsystem now uses kuid_t and kgid_t whenever
possible making it almost impossible to confuse a raw uid_t with a kuid_t
preventing bugs.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Always store audit loginuids in type kuid_t.
Print loginuids by converting them into uids in the appropriate user
namespace, and then printing the resulting uid.
Modify audit_get_loginuid to return a kuid_t.
Modify audit_set_loginuid to take a kuid_t.
Modify /proc/<pid>/loginuid on read to convert the loginuid into the
user namespace of the opener of the file.
Modify /proc/<pid>/loginud on write to convert the loginuid
rom the user namespace of the opener of the file.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Paul Moore <paul@paul-moore.com> ?
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The audit filter code guarantees that uid are always compared with
uids and gids are always compared with gids, as the comparason
operations are type specific. Take advantage of this proper to define
audit_uid_comparator and audit_gid_comparator which use the type safe
comparasons from uidgid.h.
Build on audit_uid_comparator and audit_gid_comparator and replace
audit_compare_id with audit_compare_uid and audit_compare_gid. This
is one of those odd cases where being type safe and duplicating code
leads to simpler shorter and more concise code.
Don't allow bitmask operations in uid and gid comparisons in
audit_data_to_entry. Bitmask operations are already denined in
audit_rule_to_entry.
Convert constants in audit_rule_to_entry and audit_data_to_entry into
kuids and kgids when appropriate.
Convert the uid and gid field in struct audit_names to be of type
kuid_t and kgid_t respectively, so that the new uid and gid comparators
can be applied in a type safe manner.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The only place we use the uid and the pid that we calculate in
audit_receive_msg is in audit_log_common_recv_msg so move the
calculation of these values into the audit_log_common_recv_msg.
Simplify the calcuation of the current pid and uid by
reading them from current instead of reading them from
NETLINK_CREDS.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
For user generated audit messages set the portid field in the netlink
header to the netlink port where the user generated audit message came
from. Reporting the process id in a port id field was just nonsense.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use current instead of looking up the current up the current task by
process identifier. Netlink requests are processed in trhe context of
the sending task so this is safe.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Now that netlink messages are processed in the context of the sender
tty_audit_push_task can be called directly and audit_prepare_user_tty
which only added looking up the task of the tty by process id is
not needed.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Get caller process uid and gid and pid values from the current task
instead of the NETLINK_CB. This is simpler than passing NETLINK_CREDS
from from audit_receive_msg to audit_filter_user_rules and avoid the
chance of being hit by the occassional bugs in netlink uid/gid
credential passing. This is a safe changes because all netlink
requests are processed in the task of the sending process.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This allows the code to safely make the assumption that all of the
uids gids and pids that need to be send in audit messages are in the
initial namespaces.
If someone cares we may lift this restriction someday but start with
limiting access so at least the code is always correct.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This merge is necessary as Lai's CPU hotplug restructuring series
depends on the CPU hotplug bug fixes in for-3.6-fixes.
The merge creates one trivial conflict between the following two
commits.
96e65306b8 "workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic"
e2b6a6d570 "workqueue: use system_highpri_wq for highpri workers in rebind_workers()"
Both add local variable definitions to the same block and can be
merged in any order.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull another workqueue fix from Tejun Heo:
"Unfortunately, yet another late fix. This too is discovered and fixed
by Lai. This bug was introduced during this merge window by commit
25511a4776 ("workqueue: reimplement CPU online rebinding to handle
idle workers") which started using WORKER_REBIND flag for idle rebind
too.
The bug is relatively easy to trigger if the CPU rapidly goes through
off, on and then off (and stay off). The fix is on the safer side.
This hasn't been on linux-next yet but I'm pushing early so that it
can get more exposure before v3.6 release."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
The kernel doesn't check the pid for negative values, so if you try to
write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.
The crash happens because the next pid is -1, and alloc_pidmap() will
try to access to a nonexistent pidmap.
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pm-sleep:
properly __init-annotate pm_sysrq_init()
PM / wakeup: Use irqsave/irqrestore for events_lock
PM / Freezer: Fix small typo "regrigerator"
PM / Sleep: Print name of wakeup source that aborts suspend
* pm-timers:
PM: Do not use the syscore flag for runtime PM
sh: MTU2: Basic runtime PM support
sh: CMT: Basic runtime PM support
sh: TMU: Basic runtime PM support
PM / Domains: Do not measure start time for "irq safe" devices
PM / Domains: Move syscore flag from subsys data to struct device
PM / Domains: Rename the always_on device flag to syscore
PM / Runtime: Allow helpers to be called by early platform drivers
PM: Reorganize device PM initialization
sh: MTU2: Introduce clock events suspend/resume routines
sh: CMT: Introduce clocksource/clock events suspend/resume routines
sh: TMU: Introduce clocksource/clock events suspend/resume routines
timekeeping: Add suspend and resume of clock event devices
PM / Domains: Add power off/on function for system core suspend stage
PM / Domains: Introduce simplified power on routine for system resume
This patch allows setting of the show_unhandled_signals variable via
/proc/sys/debug/exception-trace. The default value is currently 1
showing unhandled user faults (undefined instructions, data aborts) and
invalid signal stack frames.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nicolas Pitre <nico@linaro.org>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
This reverts commit 970e178985.
Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").
Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e178985 ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.
Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.
There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.
Reported-by: Nikolay Ulyanitsky <lystor@gmail.com>
Bisected-by: Borislav Petkov <bp@alien8.de>
Acked-by: Mike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c
Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.
Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.
Signed-off-by: David S. Miller <davem@davemloft.net>
As Oleg pointed out in [0] uprobe should not use the ptrace interface
for enabling/disabling single stepping.
[0] http://lkml.kernel.org/r/20120730141638.GA5306@redhat.com
Add the new "__weak arch" helpers which simply call user_*_single_step()
as a preparation. This is only needed to not break the powerpc port, we
will fold this logic into arch_uprobe_pre/post_xol() hooks later.
We should also change handle_singlestep(), _disable_step(&uprobe->arch)
should be called before put_uprobe().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The wrong MMF_HAS_UPROBES doesn't really hurt, just it triggers
the "slow" and unnecessary handle_swbp() path if the task hits
the non-uprobe breakpoint.
So this patch changes find_active_uprobe() to check every valid
vma and clear MMF_HAS_UPROBES if no uprobes were found. This is
adds the slow O(n) path, but it is only called in unlikely case
when the task hits the normal breakpoint first time after
uprobe_unregister().
Note the "not strictly accurate" comment in mmf_recalc_uprobes().
We can fix this, we only need to teach vma_has_uprobes() to return
a bit more more info, but I am not sure this worth the trouble.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the new MMF_RECALC_UPROBES flag, it means that MMF_HAS_UPROBES
can be false positive after remove_breakpoint() or uprobe_munmap().
It is also set by uprobe_dup_mmap(), this is not optimal but simple.
We could add the new hook, uprobe_dup_vma(), to set MMF_HAS_UPROBES
only if the new mm actually has uprobes, but I don't think this
makes sense.
The next patch will use this flag to clear MMF_HAS_UPROBES.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Nobody plays with uprobes_tree/uprobes_treelock in interrupt context,
no need to disable irqs.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Pull perf fixes from Ingo Molnar:
"This tree includes various fixes"
Ingo really needs to improve on the whole "explain git pull" part.
"Various fixes" indeed.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/hwpb: Invoke __perf_event_disable() if interrupts are already disabled
perf/x86: Enable Intel Cedarview Atom suppport
perf_event: Switch to internal refcount, fix race with close()
oprofile, s390: Fix uninitialized memory access when writing to oprofilefs
perf/x86: Fix microcode revision check for SNB-PEBS
Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.
These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.
Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.
This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.
* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.
* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.
For now, start with a single warning message. We can whine louder
later on.
v2: Fixed a typo spotted by Michal. Warning message updated.
v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.
v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
WARNING: With this change it is impossible to load external built
controllers anymore.
In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.
By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.
A close look is necessary on the RCU part which was introduces by
following patch:
commit f845172531
Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010
Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010
cls_cgroup: Store classid in struct sock
Tis code was added to init_cgroup_cls()
/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;
respectively to exit_cgroup_cls()
net_cls_subsys_id = -1;
synchronize_rcu();
and in module version of task_cls_classid()
rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();
Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)
So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.
The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).
So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.
Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.
The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).
No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.
When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
The *_subsys_id will be used as index to access the subsys. Therefore
we need to care we populate the subsystem at the correct position by
using designated initialization.
With this change we are able to interleave builtin and modules in the subsys
array.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Before we are able to define all subsystem ids at compile time we need
a more fine grained control what gets defined when we include
cgroup_subsys.h. For example we define the enums for the subsystems or
to declare for struct cgroup_subsys (builtin subsystem) by including
cgroup_subsys.h and defining SUBSYS accordingly.
Currently, the decision if a subsys is used is defined inside the
header by testing if CONFIG_*=y is true. By moving this test outside
of cgroup_subsys.h we are able to control it on the include level.
This is done by introducing IS_SUBSYS_ENABLED which then is defined
according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
looping over the subsys array looking either at the builtin or the
module subsystems. Since all the builtin subsystems have an id which
is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
are sorted.
We are about to change id assignment to happen only at compile time
later in this series. That means we can't rely on the above trick
since all ids will always be defined at compile time. Furthermore,
ordering the builtin subsystems and the module subsystems is not
really necessary.
So we need a different way to know which subsystem is a builtin or a
module one. We can use the subsys[]->module pointer for this. Any
place where we need to know if a subsys is module we just check for
the pointer. If it is NULL then the subsystem is a builtin one.
With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
enum. Though we need to introduce a temporary placeholder so that we
don't get a compilation error when only CONFIG_CGROUP is selected and
no single controller. An empty enum definition is not valid. Later in
this series we are able to remove the placeholder again.
And with this change we get a fix for this:
kernel/cgroup.c: In function ‘cgroup_load_subsys’:
kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]
when CONFIG_CGROUP=y and no built in controller was enabled.
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
Fix kprobes/x86 to support jprobes on ftrace-based kprobes.
Because of -mfentry support of ftrace, ftrace is now put
on the beginning of function where jprobes are put.
Originally ftrace-based kprobes doesn't support jprobe
because it will change regs->ip and ftrace doesn't support
changing IP and ftrace itself doesn't conflict jprobe.
However, ftrace -mfentry support moves mcount call on the
top of functions where jprobes are put. This means that
jprobe always conflicts with ftrace-based kprobe and fails.
This patch allows ftrace-based kprobes to support jprobes
by allowing to modify regs->ip and kprobes breakpoint
handler also allows to skip singlestepping because there
is a ftrace call (not an original instruction).
Link: http://lkml.kernel.org/r/20120905143125.10329.90836.stgit@localhost.localdomain
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Commit 56449f437 "tracing: make the trace clocks available generally",
in April 2009, made trace_clock available unconditionally, since
CONFIG_X86_DS used it too.
Commit faa4602e47 "x86, perf, bts, mm: Delete the never used BTS-ptrace code",
in March 2010, removed CONFIG_X86_DS, and now only CONFIG_RING_BUFFER (split
out from CONFIG_TRACING for general use) has a dependency on trace_clock. So,
only compile in trace_clock with CONFIG_RING_BUFFER or CONFIG_TRACING
enabled.
Link: http://lkml.kernel.org/r/20120903024513.GA19583@leaf
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Break out the DEBUG_SPINLOCK dependency (requires moving up
UNINLINE_SPIN_UNLOCK, as this was the only one in that block not
depending on that option).
Avoid putting values not selected into the resulting .config -
they are not useful for anything, make the output less legible,
and just consume space: Use "depends on" rather than directly
setting the default from the combined dependency values.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/504DF2AC020000780009A2DF@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Lezcano reported seeing multi-second stalls from
keyboard input on his T61 laptop when NOHZ and CPU_IDLE
were enabled on a 32bit kernel.
He bisected the problem down to commit
1e75fa8be9 ("time: Condense timekeeper.xtime into xtime_sec").
After reproducing this issue, I narrowed the problem down
to the fact that timekeeping_get_ns() returns a 64bit
nsec value that hasn't been accumulated. In some cases
this value was being then stored in timespec.tv_nsec
(which is a long).
On 32bit systems, with idle times larger then 4 seconds
(or less, depending on the value of xtime_nsec), the
returned nsec value would overflow 32bits. This limited
kept time from increasing, causing timers to not expire.
The fix is to make sure we don't directly store the
result of timekeeping_get_ns() into a tv_nsec field,
instead using a 64bit nsec value which can then be
added into the timespec via timespec_add_ns().
Reported-and-bisected-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Link: http://lkml.kernel.org/r/1347405963-35715-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Steve Rostedt asked for the merge of a single commit, into both
the RCU and the perf/tracing tree:
| Josh made a change to the tracing code that affects both the
| work Paul McKenney and I are currently doing. At the last
| Kernel Summit back in August, Linus said when such a case
| exists, it is best to make a separate branch based off of his
| tree and place the change there. This way, the repositories
| that need to share the change can both pull them in and the
| SHA1 will match for both. Whichever branch is pulled in first
| by Linus will also pull in the necessary change for the other
| branch as well.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is considered good form to lock the lock you claim to be nested in.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
[ removed nest_lock arg to print_lock_nested_lock_not_held in favour
of hlock->nest_lock, also renamed the lock arg to hlock since its
a held_lock type ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5051A9E7.5040501@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Heteregeneous ARM platform uses arch_scale_freq_power function
to reflect the relative capacity of each core
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341826026-6504-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no load_balancer to be selected now. It just sets the
state of the nohz tick to stop.
So rename the function, pass the 'cpu' as a parameter and then
remove the useless call from tick_nohz_restart_sched_tick().
[ s/set_nohz_tick_stopped/nohz_balance_enter_idle/g
s/clear_nohz_tick_stopped/nohz_balance_exit_idle/g ]
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1347261059-24747-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:
In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.
Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.
Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the last architecture to use this has stopped doing so (ARM,
thanks Catalin!) we can remove this complexity from the scheduler
core.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On tickless systems, one CPU runs load balance for all idle CPUs.
The cpu_load of this CPU is updated before starting the load balance
of each other idle CPUs. We should instead update the cpu_load of
the balance_cpu.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347509486-8688-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ptrace_notify() and get_signal_to_deliver() do unnecessary things
before task_work_run():
1. smp_mb__after_clear_bit() is not needed, test_and_clear_bit()
implies mb().
2. And we do not need the barrier at all, in this case we only
care about the "synchronous" works added by the task itself.
3. No need to clear TIF_NOTIFY_RESUME, and we should not assume
task_works is the only user of this flag.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191217.GA4238@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ed3e694d "move exit_task_work() past exit_files() et.al" destroyed
the add/exit synchronization we had, the caller itself should ensure
task_work_add() can't race with the exiting task.
However, this is not convenient/simple, and the only user which tries
to do this is buggy (see the next patch). Unless the task is current,
there is simply no way to do this in general.
Change exit_task_work()->task_work_run() to use the dummy "work_exited"
entry to let task_work_add() know it should fail.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191211.GA4228@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change task_work's to use llist-like code to avoid pi_lock
in task_work_add(), this makes it useable under rq->lock.
task_work_cancel() and task_work_run() still use pi_lock
to synchronize with each other.
(This is in preparation for a deadlock fix.)
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120826191209.GA4221@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At the suggestion of eparis@redhat.com, move this chunk of task
logging from audit_log_exit to audit_log_task_info and export this
function so it's usuable elsewhere in the kernel.
This patch is against
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity#next-ima-appraisal
Changelog v2:
- add empty audit_log_task_info if CONFIG_AUDITSYSCALL isn't set.
Changelog v1:
- Initial post.
Signed-off-by: Peter Moody <pmoody@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Pull workqueue fixes from Tejun Heo:
"It's later than I'd like but well the timing just didn't work out this
time.
There are three bug fixes. One from before 3.6-rc1 and two from the
new CPU hotplug code. Kudos to Lai for discovering all of them and
providing fixes.
* Atomicity bug when clearing a flag and setting another. The two
operation should have been atomic but wasn't. This bug has existed
for a long time but is unlikely to have actually happened. Fix is
safe. Marked for -stable.
* If CPU hotplug cycles happen back-to-back before workers finish the
previous cycle, the states could get out of sync and it could get
stuck. Fixed by waiting for workers to complete before finishing
hotplug cycle.
* While CPU hotplug is in progress, idle workers could be depleted
which can then lead to deadlock. I think both happening together
is highly unlikely but still better to fix it and the fix isn't too
scary.
There's another workqueue related regression which reported a few days
ago:
https://bugzilla.kernel.org/show_bug.cgi?id=47301
It's a bit of head scratcher but there is a semi-reliable reproduce
case, so I'm hoping to resolve it soonish."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix possible idle worker depletion across CPU hotplug
workqueue: restore POOL_MANAGING_WORKERS
workqueue: fix possible deadlock in idle worker rebinding
workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function
workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic
It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.
I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.
I have successfully built an allyesconfig kernel with this change.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To simplify both normal and CPU hotplug paths, worker management is
prevented while CPU hoplug is in progress. This is achieved by CPU
hotplug holding the same exclusion mechanism used by workers to ensure
there's only one manager per pool.
If someone else seems to be performing the manager role, workers
proceed to execute work items. CPU hotplug using the same mechanism
can lead to idle worker depletion because all workers could proceed to
execute work items while CPU hotplug is in progress and CPU hotplug
itself wouldn't actually perform the worker management duty - it
doesn't guarantee that there's an idle worker left when it releases
management.
This idle worker depletion, under extreme circumstances, can break
forward-progress guarantee and thus lead to deadlock.
This patch fixes the bug by using separate mechanisms for manager
exclusion among workers and hotplug exclusion. For manager exclusion,
POOL_MANAGING_WORKERS which was restored by the previous patch is
used. pool->manager_mutex is now only used for exclusion between the
elected manager and CPU hotplug. The elected manager won't proceed
without holding pool->manager_mutex.
This ensures that the worker which won the manager position can't skip
managing while CPU hotplug is in progress. It will block on
manager_mutex and perform management after CPU hotplug is complete.
Note that hotplug may happen while waiting for manager_mutex. A
manager isn't either on idle or busy list and thus the hoplug code
can't unbind/rebind it. Make the manager handle its own un/rebinding.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch restores POOL_MANAGING_WORKERS which was replaced by
pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq
manager exclusion".
There's a subtle idle worker depletion bug across CPU hotplug events
and we need to distinguish an actual manager and CPU hotplug
preventing management. POOL_MANAGING_WORKERS will be used for the
former and manager_mutex the later.
This patch just lays POOL_MANAGING_WORKERS on top of the existing
manager_mutex and doesn't introduce any synchronization changes. The
next patch will update it.
Note that this patch fixes a non-critical anomaly where
too_many_workers() may return %true spuriously while CPU hotplug is in
progress. While the issue could schedule idle timer spuriously, it
didn't trigger any actual misbehavior.
tj: Rewrote patch description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This patch defines netlink_kernel_create as a wrapper function of
__netlink_kernel_create to hide the struct module *me parameter
(which seems to be THIS_MODULE in all existing netlink subsystems).
Suggested by David S. Miller.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
pm_qos_get_value don't return a return code in all cases. It's sure that
anything interesting happend after BUG() but this prevent any compilation
warning.
[rjw: Chaneged the new return value to PM_QOS_DEFAULT_VALUE.]
Signed-off-by: Luis Gonzalez Fernandez <luisgf@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
With this patch we no longer reuse function tracer infrastructure, now
we register our own tracer back-end via a debugfs knob.
It's a bit more code, but that is the only downside. On the bright side we
have:
- Ability to make persistent_ram module removable (when needed, we can
move ftrace_ops struct into a module). Note that persistent_ram is still
not removable for other reasons, but with this patch it's just one
thing less to worry about;
- Pstore part is more isolated from the generic function tracer. We tried
it already by registering our own tracer in available_tracers, but that
way we're loosing ability to see the traces while we record them to
pstore. This solution is somewhere in the middle: we only register
"internal ftracer" back-end, but not the "front-end";
- When there is only pstore tracing enabled, the kernel will only write
to the pstore buffer, omitting function tracer buffer (which, of course,
still can be enabled via 'echo function > current_tracer').
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Currently, rebind_workers() and idle_worker_rebind() are two-way
interlocked. rebind_workers() waits for idle workers to finish
rebinding and rebound idle workers wait for rebind_workers() to finish
rebinding busy workers before proceeding.
Unfortunately, this isn't enough. The second wait from idle workers
is implemented as follows.
wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
then returns. If CPU hotplug cycle happens again before one of the
idle workers finishes the above wait_event(), rebind_workers() will
repeat the first part of the handshake - set WORKER_REBIND again and
wait for the idle worker to finish rebinding - and this leads to
deadlock because the idle worker would be waiting for WORKER_REBIND to
clear.
This is fixed by adding another interlocking step at the end -
rebind_workers() now waits for all the idle workers to finish the
above WORKER_REBIND wait before returning. This ensures that all
rebinding steps are complete on all idle workers before the next
hotplug cycle can happen.
This problem was diagnosed by Lai Jiangshan who also posted a patch to
fix the issue, upon which this patch is based.
This is the minimal fix and further patches are scheduled for the next
merge window to simplify the CPU hotplug path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
This doesn't make any functional difference and is purely to help the
next patch to be simpler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
The compiler may compile the following code into TWO write/modify
instructions.
worker->flags &= ~WORKER_UNBOUND;
worker->flags |= WORKER_REBIND;
so the other CPU may temporarily see worker->flags which doesn't have
either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
prematurely.
Fix it by using single explicit assignment via ACCESS_ONCE().
Because idle workers have another WORKER_NOT_RUNNING flag, this bug
doesn't exist for them; however, update it to use the same pattern for
consistency.
tj: Applied the change to idle workers too and updated comments and
patch description a bit.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
While debugging a warning message on PowerPC while using hardware
breakpoints, it was discovered that when perf_event_disable is invoked
through hw_breakpoint_handler function with interrupts disabled, a
subsequent IPI in the code path would trigger a WARN_ON_ONCE message in
smp_call_function_single function.
This patch calls __perf_event_disable() when interrupts are already
disabled, instead of perf_event_disable().
Reported-by: Edjunior Barbosa Machado <emachado@linux.vnet.ibm.com>
Signed-off-by: K.Prasad <Prasad.Krishnan@gmail.com>
[naveen.n.rao@linux.vnet.ibm.com: v3: Check to make sure we target current task]
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120802081635.5811.17737.stgit@localhost.localdomain
[ Fixed build error on MIPS. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't mess with file refcounts (or keep a reference to file, for
that matter) in perf_event. Use explicit refcount of its own
instead. Deal with the race between the final reference to event
going away and new children getting created for it by use of
atomic_long_inc_not_zero() in inherit_event(); just have the
latter free what it had allocated and return NULL, that works
out just fine (children of siblings of something doomed are
created as singletons, same as if the child of leader had been
created and immediately killed).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820135925.GG23464@ZenIV.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's impossible to enter the else branch if we have set
skip_clock_update in task_yield_fair(), as yield_to_task_fair()
will directly return true after invoke task_yield_fair().
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unlike others, sched_migration_cost, sched_time_avg and
sched_shares_window doesn't have time unit as suffix. Add them.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345083330-19486-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Various sd->*_idx's are used for refering the rq's load average table
when selecting a cpu to run. However they can be set to any number
with sysctl knobs so that it can crash the kernel if something bad is
given. Fix it by limiting them into the actual range.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit beac4c7e4a ("sched: Remove AFFINE_WAKEUPS feature") removed
use of the flag but left the definition. Get rid of it.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix two kernel-doc warnings in kernel/sched/fair.c:
Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.
Instead unthrottle rt runqueues before migrating tasks.
Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()
Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Azat Khuzhin reported high loadavg in Linux v3.6
After checking the upstream scheduler code, I found Peter's commit:
5167e8d541 sched/nohz: Rewrite and fix load-avg computation -- again
not fully applied, missing the call to calc_load_exit_idle().
After that idle exit in sampling window will always be calculated
to non-idle, and the load will be higher than normal.
This patch adds the missing call to calc_load_exit_idle().
Signed-off-by: Charles Wang <muming.wq@taobao.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345449754-27130-1-git-send-email-muming.wq@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rabik and Paul reported two different issues related to the same few
lines of code.
Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).
Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.
The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.
The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.
So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.
Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some clock event devices, for example such that belong to PM domains,
need to be handled in a spcial way during the timekeeping suspend
and resume (which takes place in the system core, or "syscore",
stages of system power transitions) in analogy with clock sources.
Introduce .suspend() and .resume() callbacks for clock event devices
that will be executed by timekeeping_suspend/_resume(), respectively,
next the the clock sources' .suspend() and .resume() callbacks.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Introduce function pm_genpd_syscore_switch() and two wrappers around
it, pm_genpd_syscore_poweroff() and pm_genpd_syscore_poweron(),
allowing the callers to let the generic PM domains framework know
that the given device is not necessary any more and its PM domain
can be turned off (the former) or that the given device will be
required immediately, so its PM domain has to be turned on (the
latter) during the system core (syscore) stage of system suspend
(or hibernation) and resume.
These functions will be used for handling devices registered as
clock sources and clock event devices that belong to PM domains.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Andreas Bombe reported that the added ktime_t overflow checking added to
timespec_valid in commit 4e8b14526c ("time: Improve sanity checking of
timekeeping inputs") was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.
Previously, these large timeouts would be clamped to KTIME_MAX and would
never expire, which is valid.
This patch splits the ktime_t overflow checking into a new
timespec_valid_strict function, and converts the timekeeping codes
internal checking to use this more strict function.
Reported-and-tested-by: Andreas Bombe <aeb@debian.org>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody does set_orig_insn(verify => false), and I think nobody will.
Remove this argument. IIUC set_orig_insn(verify => false) was needed
to single-step without xol area.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Now that we have uprobe_dup_mmap() we can fold uprobe_reset_state()
into the new hook and remove it. mmput()->uprobe_clear_state() can't
be called before dup_mmap().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Add the new MMF_HAS_UPROBES flag. It is set by install_breakpoint()
and it is copied by dup_mmap(), uprobe_pre_sstep_notifier() checks
it to avoid the slow path if the task was never probed. Perhaps it
makes sense to check it in valid_vma(is_register => false) as well.
This needs the new dup_mmap()->uprobe_dup_mmap() hook. We can't use
uprobe_reset_state() or put MMF_HAS_UPROBES into MMF_INIT_MASK, we
need oldmm->mmap_sem to avoid the race with uprobe_register() or
mmap() from another thread.
Currently we never clear this bit, it can be false-positive after
uprobe_unregister() or uprobe_munmap() or if dup_mmap() hits the
probed VM_DONTCOPY vma. But this is fine correctness-wise and has
no effect unless the task hits the non-uprobe breakpoint.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
-EEXIST from install_breakpoint() no longer makes sense, all
callers should simply treat it as "success". Change the code
to return zero and simplify register_for_each_vma().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Once install_breakpoint() fails uprobe_mmap() "ignores" all other
uprobes and returns the error.
It was never really needed to to stop after the first error, and
in fact it was always wrong at least in -ENOTSUPP case.
Change uprobe_mmap() to ignore the errors and always return 0.
This is not what we want in the long term, but until we teach
the callers to handle the failure it would be better to remove
the pointless complications. And this doesn't look too bad, the
only "reasonable" error is ENOMEM but in this case the caller
should be oom-killed in the likely case or the system has more
serious problems.
However it makes sense to stop if fatal_signal_pending() == T.
In particular this helps if the task was oom-killed.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
1. Kill dup_mmap()->uprobe_mmap(), it was only needed to calculate
new_mm->uprobes_state.count removed by the previous patch.
If the forking process has a pending uprobe (int3) in vma, it will
be copied by copy_page_range(), note that it checks vma->anon_vma
so "Don't copy ptes" is not possible after install_breakpoint()
which does anon_vma_prepare().
2. Remove is_swbp_at_addr() and "int count" in uprobe_mmap(). Again,
this was needed for uprobes_state.count.
As a side effect this fixes the bug pointed out by Srikar,
this code lacked the necessary put_uprobe().
3. uprobe_munmap() becomes a nop after the previous patch. Remove the
meaningless code but do not remove the helper, we will need it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
uprobes_state->count is only needed to avoid the slow path in
uprobe_pre_sstep_notifier(). It is also checked in uprobe_munmap()
but ironically its only goal to decrement this counter. However,
it is very broken. Just some examples:
- uprobe_mmap() can race with uprobe_unregister() and wrongly
increment the counter if it hits the non-uprobe "int3". Note
that install_breakpoint() checks ->consumers first and returns
-EEXIST if it is NULL.
"atomic_sub() if error" in uprobe_mmap() looks obviously wrong
too.
- uprobe_munmap() can race with uprobe_register() and wrongly
decrement the counter by the same reason.
- Suppose an appication tries to increase the mmapped area via
sys_mremap(). vma_adjust() does uprobe_munmap(whole_vma) first,
this can nullify the counter temporarily and race with another
thread which can hit the bp, the application will be killed by
SIGTRAP.
- Suppose an application mmaps 2 consecutive areas in the same file
and one (or both) of these areas has uprobes. In the likely case
mmap_region()->vma_merge() suceeds. Like above, this leads to
uprobe_munmap/uprobe_mmap from vma_merge()->vma_adjust() but then
mmap_region() does another uprobe_mmap(resulting_vma) and doubles
the counter.
This patch only removes this counter and fixes the compile errors,
then we will try to cleanup the changed code and add something else
instead.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
by the time we get here (after we pass cleanup_ret) uprobe is always is
set. If it is NULL we leave very early in the code.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Since read_opcode() reads from the referenced page and doesnt modify
the page contents nor the page attributes, there is no need to lock
the page.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Merging critical fixes from upstream required for development.
* upstream/master: (809 commits)
libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry
Revert "powerpc: Update g5_defconfig"
powerpc/perf: Use pmc_overflow() to detect rolled back events
powerpc: Fix VMX in interrupt check in POWER7 copy loops
powerpc: POWER7 copy_to_user/copy_from_user patch applied twice
powerpc: Fix personality handling in ppc64_personality()
powerpc/dma-iommu: Fix IOMMU window check
powerpc: Remove unnecessary ifdefs
powerpc/kgdb: Restore current_thread_info properly
powerpc/kgdb: Bail out of KGDB when we've been triggered
powerpc/kgdb: Do not set kgdb_single_step on ppc
powerpc/mpic_msgr: Add missing includes
powerpc: Fix null pointer deref in perf hardware breakpoints
powerpc: Fixup whitespace in xmon
powerpc: Fix xmon dl command for new printk implementation
xfs: check for possible overflow in xfs_ioc_trim
xfs: unlock the AGI buffer when looping in xfs_dialloc
xfs: fix uninitialised variable in xfs_rtbuf_get()
powerpc/fsl: fix "Failed to mount /dev: No such device" errors
powerpc/fsl: update defconfigs
...
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This is one of the items in the plumber's wish list.
For use cases:
>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart
v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.
What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.
This is a preparation for cgroup xattr support.
v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull timer fixes from Thomas Gleixner:
"Mostly small fixes for the fallout of the timekeeping overhaul in 3.6
along with stable fixes to address an accumulation problem and missing
sanity checks for RTC readouts and user space provided values."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Avoid making adjustments if we haven't accumulated anything
time: Avoid potential shift overflow with large shift values
time: Fix casting issue in timekeeping_forward_now
time: Ensure we normalize the timekeeper in tk_xtime_add
time: Improve sanity checking of timekeeping inputs
The function graph has a test to check if the frame pointer is
corrupted, which can happen with various options of gcc with mcount.
But this is not an issue with -mfentry as -mfentry does not need nor use
frame pointers for function graph tracing.
Link: http://lkml.kernel.org/r/20120807194059.773895870@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Thanks to Andi Kleen, gcc 4.6.0 now supports -mfentry for x86
(and hopefully soon for other archs). What this does is to have
the function profiler start at the beginning of the function
instead of after the stack is set up. As plain -pg (mcount) is
called after the stack is set up, and in some cases can have issues
with the function graph tracer. It also requires frame pointers to
be enabled.
The -mfentry now calls __fentry__ at the beginning of the function.
This allows for compiling without frame pointers and even has the
ability to access parameters if needed.
If the architecture and the compiler both support -mfentry then
use that instead.
Link: http://lkml.kernel.org/r/20120807194059.392617243@goodmis.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time. In particular, not if we
unplug CPUs to return to a single cpu.
Paul McKenney points out:
mean offline overhead is 6251/48=130.2 milliseconds.
If I remove the alternatives_smp_switch() from the offline
path [...] the mean offline overhead is 550/42=13.1 milliseconds
Basically, we're never going to get those 120ms back, and the
code is pretty messy.
We get rid of:
1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
documentation is wrong. It's now the default.
2) The skip_smp_alternatives flag used by suspend.
3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
which were only used to set this one flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Noticed when digging into a suspend issue in linux-next (next-20120821).
For more details see <http://marc.info/?t=134554708000002&r=1&w=2>.
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
If update_wall_time() is called and the current offset isn't large
enough to accumulate, avoid re-calling timekeeping_adjust which may
change the clock freq and can cause 1ns inconsistencies with
CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andreas Schwab noticed that the 1 << tk->shift could overflow if the
shift value was greater than 30, since 1 would be a 32bit long on
32bit architectures. This issue was introduced by 1e75fa8be (time:
Condense timekeeper.xtime into xtime_sec)
Use 1ULL instead to ensure we don't overflow on the shift.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
arch_gettimeoffset returns a u32 value which when shifted by tk->shift
can overflow. This issue was introduced with 1e75fa8be (time: Condense
timekeeper.xtime into xtime_sec)
Cast it to u64 first.
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Andreas noticed problems with resume on specific hardware after commit
1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined
with commit b44d50dca (time: Fix casting issue in tk_set_xtime and
tk_xtime_add)
After some digging I realized we aren't normalizing the timekeeper
after the add. Add the missing normalize call.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Tested-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cancel_delayed_work() can't be called from IRQ handlers due to its use
of del_timer_sync() and can't cancel work items which are already
transferred from timer to worklist.
Also, unlike other flush and cancel functions, a canceled delayed_work
would still point to the last associated cpu_workqueue. If the
workqueue is destroyed afterwards and the work item is re-used on a
different workqueue, the queueing code can oops trying to dereference
already freed cpu_workqueue.
This patch reimplements cancel_delayed_work() using
try_to_grab_pending() and set_work_cpu_and_clear_pending(). This
allows the function to be called from IRQ handlers and makes its
behavior consistent with other flush / cancel functions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Up to now, for delayed_works, try_to_grab_pending() couldn't be used
from IRQ handlers because IRQs may happen while
delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.
This patch makes delayed_work use the new TIMER_IRQSAFE flag for
delayed_work->timer. This makes try_to_grab_pending() and thus
mod_delayed_work_on() safe to call from IRQ handlers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull audit-tree fixes from Miklos Szeredi:
"The audit subsystem maintainers (Al and Eric) are not responding to
repeated resends. Eric did ack them a while ago, but no response
since then. So I'm sending these directly to you."
* 'audit-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
audit: clean up refcounting in audit-tree
audit: fix refcounting in audit-tree
audit: don't free_chunk() after fsnotify_add_mark()
It seems commit 4a9d4b024a ("switch fput to task_work_add") re-
introduced the problem addressed in 944be0b224 ("close_files(): add
scheduling point")
If a server process with a lot of files (say 2 million tcp sockets) is
killed, we can spend a lot of time in task_work_run() and trigger a soft
lockup.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Timer internals are protected with irq-safe locks but timer execution
isn't, so a timer being dequeued for execution and its execution
aren't atomic against IRQs. This makes it impossible to wait for its
completion from IRQ handlers and difficult to shoot down a timer from
IRQ handlers.
This issue caused some issues for delayed_work interface. Because
there's no way to reliably shoot down delayed_work->timer from IRQ
handlers, __cancel_delayed_work() can't share the logic to steal the
target delayed_work with cancel_delayed_work_sync(), and can only
steal delayed_works which are on queued on timer. Similarly, the
pending mod_delayed_work() can't be used from IRQ handlers.
This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer
to be executed without enabling IRQ after dequeueing such that its
dequeueing and execution are atomic against IRQ handlers.
This makes it safe to wait for the timer's completion from IRQ
handlers, for example, using del_timer_sync(). It can never be
executing on the local CPU and if executing on other CPUs it won't be
interrupted until done.
This will enable simplifying delayed_work cancel/mod interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-5-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Over time, timer initializers became messy with unnecessarily
duplicated code which are inconsistently spread across timer.h and
timer.c.
This patch cleans up timer initializers.
* timer.c::__init_timer() is renamed to do_init_timer().
* __TIMER_INITIALIZER() added. It takes @flags and all initializers
are wrappers around it.
* init_timer[_on_stack]_key() now take @flags.
* __init_timer[_on_stack]() added. They take @flags and all init
macros are wrappers around them.
* __setup_timer[_on_stack]() added. It uses __init_timer() and takes
@flags. All setup macros are wrappers around the two.
Note that this patch doesn't add missing init/setup combinations -
e.g. init_timer_deferrable_on_stack(). Adding missing ones is
trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-4-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To prepare for addition of another flag, generalize timer->base flags
handling.
* Rename from TBASE_*_FLAG to TIMER_* and make them LU constants.
* Define and use TIMER_FLAG_MASK for flags masking so that multiple
flags can be handled correctly.
* Don't dereference timer->base directly even if
!tbase_get_deferrable(). All two such places are already passed in
@base, so use it instead.
* Make sure tvec_base's alignment is large enough for timer->base
flags using BUILD_BUG_ON().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-2-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export dummy_irq_chip to modules to allow them to do things such as
irq_set_chip_and_handler(virq,
&dummy_irq_chip,
handle_level_irq);
This fixes
ERROR: "dummy_irq_chip" [drivers/gpio/gpio-pcf857x.ko] undefined!
when gpio-pcf857x.c is being built as a module.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Greg KH <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/871ujstrp6.wl%25kuninori.morimoto.gx@renesas.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Export irq_set_chip_and_handler_name() to modules to allow them to
do things such as
irq_set_chip_and_handler(....);
This fixes
ERROR: "irq_set_chip_and_handler_name" \
[drivers/gpio/gpio-pcf857x.ko] undefined!
when gpio-pcf857x.c is being built as a module.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Greg KH <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/873948trpk.wl%25kuninori.morimoto.gx@renesas.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch fixes:
https://bugzilla.redhat.com/show_bug.cgi?id=843640
If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.
Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().
For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.
dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.
And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().
Reported-and-tested-by: William Cohen <wcohen@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v3.5
Cc: Anton Arapov <anton@redhat.com>
Cc: William Cohen <wcohen@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull ftrace updates from Steve Rostedt:
" This patch series extends ftrace function tracing utility to be
more dynamic for its users. It allows for data passing to the callback
functions, as well as reading regs as if a breakpoint were to trigger
at function entry.
The main goal of this patch series was to allow kprobes to use ftrace
as an optimized probe point when a probe is placed on an ftrace nop.
With lots of help from Masami Hiramatsu, and going through lots of
iterations, we finally came up with a good solution. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq(). Replace the latter with
wrappers around system[_freezable]_wq(). The wrapping goes through
inline functions so that __deprecated can be added easily.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work(). Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().
* start_flush_work() no longer takes @wait_executing as the only left
user - flush_work() - always sets it to %true.
* __cancel_work_timer() uses flush_work() instead of wait_on_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixed some checkpatch warnings.
tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.
Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix migration thread runtime bogosity
sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
sched,cgroup: Fix up task_groups list
sched: fix divide by zero at {thread_group,task}_times
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.
These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.
Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Extract cputime code from the giant sched/core.c and
put it in its own file. This make it easier to deal with
this particular area and de-bloat a bit more core.c
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Merge alpha architecture update from Michael Cree:
"The Alpha Maintainer, Matt Turner, is currently unavailable, so I have
collected up patches that have been posted to the linux-alpha mailing
list over the last couple of months, and are forwarding them to you in
the hope that you are prepared to accept them via me.
The patches by Al Viro and myself I have been running against kernels
for two months now so have had quite a bit of testing. All except one
patch were intended for the 3.5 kernel but because of Matt's
unavailability never got forwarded to you."
* emailed patches from Michael Cree <mcree@orcon.net.nz>: (9 commits)
alpha: Fix fall-out from disintegrating asm/system.h
Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts
alpha: fix fpu.h usage in userspace
alpha/mm/fault.c: Port OOM changes to do_page_fault
alpha: take kernel_execve() out of entry.S
alpha: take a bunch of syscalls into osf_sys.c
alpha: Use new generic strncpy_from_user() and strnlen_user()
alpha: Wire up cross memory attach syscalls
alpha: Don't export SOCK_NONBLOCK to user space.
New helper: current_thread_info(). Allows to do a bunch of odd syscalls
in C. While we are at it, there had never been a reason to do
osf_getpriority() in assembler. We also get "namespace"-aware (read:
consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Michael Cree <mcree@orcon.net.nz>
Acked-by: Matt Turner <mattst88@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syscall_get_nr can return -1 in the case that the task is not executing
a system call.
This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.
Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com
Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.
tj: CPU up/downs care quite a bit about latency these days. This
shouldn't hurt anything and makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.
To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.
tj: Rephrased comment a bit.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.
It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.
tj: Even apart from rebinding, having system_highpri_wq generally
makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.
gcwq = get_gcwq(cpu);
if (wq->flags & WQ_NON_REENTRANT &&
(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {
Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.
tj: Slightly rephrased the comment.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.
Use temporary local variable for storing requested cpu.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Unexpected behavior could occur if the time is set to a value large
enough to overflow a 64bit ktime_t (which is something larger then the
year 2262).
Also unexpected behavior could occur if large negative offsets are
injected via adjtimex.
So this patch improves the sanity check timekeeping inputs by
improving the timespec_valid() check, and then makes better use of
timespec_valid() to make sure we don't set the time to an invalid
negative value or one that overflows ktime_t.
Note: This does not protect from setting the time close to overflowing
ktime_t and then letting natural accumulation cause the overflow.
Reported-by: CAI Qian <caiqian@redhat.com>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Drop the initial reference by fsnotify_init_mark early instead of
audit_tree_freeing_mark() at destroy time.
In the cases we destroy the mark before we drop the initial reference we need to
get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Refcounting of fsnotify_mark in audit tree is broken. E.g:
refcount
create_chunk
alloc_chunk 1
fsnotify_add_mark 2
untag_chunk
fsnotify_get_mark 3
fsnotify_destroy_mark
audit_tree_freeing_mark 2
fsnotify_put_mark 1
fsnotify_put_mark 0
via destroy_list
fsnotify_mark_destroy -1
This was reported by various people as triggering Oops when stopping auditd.
We could just remove the put_mark from audit_tree_freeing_mark() but that would
break freeing via inode destruction. So this patch simply omits a put_mark
after calling destroy_mark or adds a get_mark before.
The additional get_mark is necessary where there's no other put_mark after
fsnotify_destroy_mark() since it assumes that the caller is holding a reference
(or the inode is keeping the mark pinned, not the case here AFAICS).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Reported-by: Valentin Avram <aval13@gmail.com>
Reported-by: Peter Moody <pmoody@google.com>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref
via the destroy list and this results in use-after-free.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Eric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org
There is a least one modular user so export free_pid_ns so modules can
capture and use the pid namespace on the very rare occasion when it
makes sense.
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Correct a long standing omission and use struct pid in the owner
field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS.
This guarantees we don't have issues when pid wraparound occurs.
Use a kuid_t in the owner field of struct ip6_flowlabel when the
share type is IPV6_FL_S_USER to add user namespace support.
In /proc/net/ip6_flowlabel capture the current pid namespace when
opening the file and release the pid namespace when the file is
closed ensuring we print the pid owner value that is meaning to
the reader of the file. Similarly use from_kuid_munged to print
uid values that are meaningful to the reader of the file.
This requires exporting pid_nr_ns so that ipv6 can continue to built
as a module. Yoiks what silliness
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.
There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data(). The caller of
the former already does smp_wmb() but the latter doesn't have any.
Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().
There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer. The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.
Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer. Currently, it chooses the local CPU
flush is running on. This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.
There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off. This patch adds delayed_work->cpu to
remember the CPU it's queued for.
Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change. As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.
So, clean up the obsolete code, includes SD_PREFER_LOCAL.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As we already have dst_rq in lb_env, using or changing "this_rq" do not
make sense.
This patch will replace "this_rq" with dst_rq in load_balance, and we
don't need to change "this_rq" while process LBF_SOME_PINNED any more.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds a comment on top of the schedule() function to explain
to scheduler newbies how the main scheduler function is entered.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Explained-by: Ingo Molnar <mingo@kernel.org>
Explained-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It should be sched_nr_latency so fix it before it annoys me more.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Make stop scheduler class do the same accounting as other classes,
Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start. The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.
%CPU PID USER CMD
99.3 45 root [migration/10]
97.7 53 root [migration/12]
97.0 57 root [migration/13]
90.1 49 root [migration/11]
89.6 65 root [migration/15]
88.7 17 root [migration/3]
80.4 37 root [migration/8]
78.1 41 root [migration/9]
44.2 13 root [migration/2]
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance. This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.3+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.
This problem can be triggered in practice on very long lived processes:
PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
#1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
#2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
#3 [ffff880472a51cd0] die at ffffffff8100f26b
#4 [ffff880472a51d00] do_trap at ffffffff814f03f4
#5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
#6 [ffff880472a51e00] divide_error at ffffffff8100be7b
[exception RIP: thread_group_times+0x56]
RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
#8 [ffff880472a51f40] sys_times at ffffffff81088524
#9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b
Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.
His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.
It was found that:
schedule()
idle_balance()
load_balance()
local_irq_save()
double_rq_lock()
update_h_load()
walk_tg_tree(tg_load_down)
tg_load_down()
Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.
This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.
I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).
Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: Peter Portante <pportant@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU
kthread code to use the new smp_hotplug_thread facility.
[ tglx: Adapted it to use callbacks and to the simplified rcu yield ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.563736676@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.456416747@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Because kernel subsystems need their per-CPU kthreads on UP systems as
well as on SMP systems, the smpboot hotplug kthread functions must be
provided in UP builds as well as in SMP builds. This commit therefore
adds smpboot.c to UP builds and excludes irrelevant code via #ifdef.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Provide a generic interface for setting up and tearing down percpu
threads.
On registration the threads for already online cpus are created and
started. On deregistration (modules) the threads are stoppped.
During hotplug operations the threads are created, started, parked and
unparked. The datastructure for registration provides a pointer to
percpu storage space and optional setup, cleanup, park, unpark
functions. These functions are called when the thread state changes.
Each implementation has to provide a function which is queried and
returns whether the thread should run and the thread function itself.
The core code handles all state transitions and avoids duplicated code
in the call sites.
[ paulmck: Preemption leak fix ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lkml.kernel.org/r/20120716103948.352501068@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To avoid the full teardown/setup of per cpu kthreads in the case of
cpu hot(un)plug, provide a facility which allows to put the kthread
into a park position and unpark it when the cpu comes online again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.236618824@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The rcu_yield() code is amazing. It's there to avoid starvation of the
system when lots of (boosting) work is to be done.
Now looking at the code it's functionality is:
Make the thread SCHED_OTHER and very nice, i.e. get it out of the way
Arm a timer with 2 ticks
schedule()
Now if the system goes idle the rcu task returns, regains SCHED_FIFO
and plugs on. If the systems stays busy the timer fires and wakes a
per node kthread which in turn makes the per cpu thread SCHED_FIFO and
brings it back on the cpu. For the boosting thread the "make it FIFO"
bit is missing and it just runs some magic boost checks. Now this is a
lot of code with extra threads and complexity.
It's way simpler to let the tasks when they detect overload schedule
away for 2 ticks and defer the normal wakeup as long as they are in
yielded state and the cpu is not idle.
That solves the same problem and the only difference is that when the
cpu goes idle it's not guaranteed that the thread returns right away,
but it won't be longer out than two ticks, so no harm is done. If
that's an issue than it is way simpler just to wake the task from
idle as RCU has callbacks there anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Fix for two recent regressions in the generic PM domains framework.
* Revert of a commit that introduced a resume regression and is conceptually
incorrect in my opinion.
* Fix for a return value in pcc-cpufreq.c from Julia Lawall.
* RTC wakeup signaling fix from Neil Brown.
* Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI,
platform/x86 and TPM drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQJWxZAAoJEKhOf7ml8uNs1EcP/ApgCk1SfMo779Lcq8OQVVqq
2jbtoqnsuPMs/rl4VrW1adJspEkWb39KgE5XIlfg6tIKm5nuIauFtJEGskMq00w7
8PT7bQOSJdLKIOjsBEUugUtp+HZO0iUuGahciQf4V11eOAZKODqxtomL8Ry2mY3P
gDohYBa3J+xnkvRqKUY0k0OkSNDDlI3+y+WPr+tamjDzT5uqjWLR9LJ1+1eGtmou
6DrgjD3eOus/r53OXKlNldXc9HbzVdnmoZwMNtswlNTaCL7HkdpRnPClSWt+NvVi
cOviJ6F4d6FRmYRFvatFEaXmSAfpB9v/dt1C9VYtoLyZsZWs1sRGd/bxgCofYWnE
GZckKl8pI80u14345P9R+QF3CculV2itfbKBiXxWunmOeokBYIz5sWdTh4mNg/vy
VZdeO9jJy2542aF8P9Up9EE3IjkrEz7gEL0Sv4hfmEoHI1jKJDdAn/9/lmfrujPh
e3vpBeqlBmSTU0rKj97x/G8zwWhPscqJDPkDUEEe+wfS3oPvhymYesV1bF7OCNwr
WMMcFoDuSRzZ1lvEY7w4IWAKRDCqjaJ1kkBZvzoOEIC4gi4i3pAehpYEZMNFtFrf
RB2z5Jx1Z1w0LOgcz69TTMY274kZ8N/v7/SVUBk5+tSs1VNHo/p+WYGqW/8ExSvH
D4H8kQvz8uBK23g7ekVR
=lo6A
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael J. Wysocki:
- Fix for two recent regressions in the generic PM domains framework.
- Revert of a commit that introduced a resume regression and is
conceptually incorrect in my opinion.
- Fix for a return value in pcc-cpufreq.c from Julia Lawall.
- RTC wakeup signaling fix from Neil Brown.
- Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI,
platform/x86 and TPM drivers.
* tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
tpm_tis / PM: Fix unused function warning for CONFIG_PM_SLEEP
platform / x86 / PM: Fix unused function warnings for CONFIG_PM_SLEEP
ACPI / PM: Fix unused function warnings for CONFIG_PM_SLEEP
Revert "NMI watchdog: fix for lockup detector breakage on resume"
PM: Make dev_pm_get_subsys_data() always return 0 on success
drivers/cpufreq/pcc-cpufreq.c: fix error return code
RTC: Avoid races between RTC alarm wakeup and suspend.
While tracking down a weird buffer overflow issue in a program that
looked to be sane, I started double checking the length returned by
syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
the buffer.
Sure enough, it was. I saw this in strace:
11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279
It turns out that the loops that calculate how much space the entries
will take when they're copied don't include the newlines and prefixes
that will be included in the final output since prev flags is passed as
zero.
This patch properly accounts for it and fixes the overflow.
CC: stable@kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introducing following bits to the the perf_event_attr struct:
- exclude_callchain_kernel to filter out kernel callchain
from the sample dump
- exclude_callchain_user to filter out user callchain
from the sample dump
We need to be able to disable standard user callchain dump when we use
the dwarf cfi callchain mode, because frame pointer based user
callchains are useless in this mode.
Implementing also exclude_callchain_kernel to have complete set of
options.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
[ Added kernel callchains filtering ]
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-7-git-send-email-jolsa@redhat.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
of the user level stack on sample. The size of the dump is specified by
sample_stack_user value.
Being able to dump parts of the user stack, starting from the stack
pointer, will be useful to make a post mortem dwarf CFI based stack
unwinding.
Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
architecture provides user stack dump on perf event samples. This needs
access to the user stack pointer which is not unified across
architectures. Enabling this for x86 architecture.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing perf_output_skip function to be able to skip data within the
perf ring buffer.
When writing data into perf ring buffer we first reserve needed place in
ring buffer and then copy the actual data.
There's a possibility we won't be able to fill all the reserved size
with data, so we need a way to skip the remaining bytes.
This is going to be useful when storing the user stack dump, where we
might end up with less data than we originally requested.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-5-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding a generic way to use __output_copy function with specific copy
function via DEFINE_PERF_OUTPUT_COPY macro.
Using this to add new __output_copy_user function, that provides output
copy from user pointers. For x86 the copy_from_user_nmi function is used
and __copy_from_user_inatomic for the rest of the architectures.
This new function will be used in user stack dump on sample, coming in
next patches.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-4-git-send-email-jolsa@redhat.com
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Introducing PERF_SAMPLE_REGS_USER sample type bit to trigger the dump of
user level registers on sample. Registers we want to dump are specified
by sample_regs_user bitmask.
Only user level registers are dumped at the moment. Meaning the register
values of the user space context as it was before the user entered the
kernel for whatever reason (syscall, irq, exception, or a PMI happening
in userspace).
The layout of the sample_regs_user bitmap is described in
asm/perf_regs.h for archs that support register dump.
This is going to be useful to bring Dwarf CFI based stack unwinding on
top of samples.
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Dump registers ABI specification. ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Suggested-by: Stephane Eranian <eranian@google.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-3-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Revert commit 45226e9 (NMI watchdog: fix for lockup detector breakage
on resume) which breaks resume from system suspend on my SH7372
Mackerel board (by causing a NULL pointer dereference to happen) and
is generally wrong, because it abuses the CPU hotplug functionality
in a shamelessly blatant way.
The original issue should be addressed through appropriate syscore
resume callback instead.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Add missing initialization for ret variable. Its initialization
is based on the re_cnt variable, which is being set deep down
in the ftrace_function_filter_re function.
I'm not sure compilers would be smart enough to see this in near
future, so killing the warning this way.
Link: http://lkml.kernel.org/r/1340120894-9465-2-git-send-email-jolsa@redhat.com
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The warkeup_rt self test used msleep() calls to wait for real time
tasks to wake up and run. On bare-metal hardware, this was enough as
the scheduler should let the RT task run way before the non-RT task
wakes up from the msleep(). If it did not, then that would mean the
scheduler was broken.
But when dealing with virtual machines, this is a different story.
If the RT task wakes up on a VCPU, it's up to the host to decide when
that task gets to schedule, which can be far behind the time that the
non-RT task wakes up. In this case, the test would fail incorrectly.
As we are not testing the scheduler, but instead the wake up tracing,
we can use completions to wait and not depend on scheduler timings
to see if events happen on time.
Link: http://lkml.kernel.org/r/1343663105.3847.7.camel@fedora
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Tetsuo Handa reported that sporadically the system clock starts
counting up too quickly which is enough to confuse the hangcheck
timer to print a bogus stall warning.
Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling
timekeeping_adjust" overlooked this exit path:
} else
return;
which should really be a proper exit sequence, fixing the bug as a
side effect.
Also make the flow more readable by properly balancing curly
braces.
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: john.stultz@linaro.org
Cc: a.p.zijlstra@chello.nl
Cc: richardcochran@gmail.com
Cc: prarit@redhat.com
Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull futex fixes from Ingo Molnar:
"A couple of futex fixes from Darren Hart: two bugs reported by Dave
Jones (found with his trinity test) and Dan Carpenter through static
analysis. The third found while debugging the first two."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()
futex: Fix bug in WARN_ON for NULL q.pi_state
futex: Test for pi_mutex on fault in futex_wait_requeue_pi()
Pull timer fixes from Ingo Molnar:
"One regression fix, and a couple of cleanups that clean up the code
flow in areas that had high-profile bugs recently."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Remove all direct references to timekeeper
time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates
time: Clean up stray newlines
time/jiffies: Rename ACTHZ to SHIFTED_HZ
time/jiffies: Allow CLOCK_TICK_RATE to be undefined
time: Fix casting issue in tk_set_xtime and tk_xtime_add
Pull scheduler fixes from Ingo Molnar:
"Fixes and two late cleanups"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'
sched: Fix comment about PREEMPT_ACTIVE bit location
sched: Fix minor code style issues
sched: Use task_rq_unlock() in __sched_setscheduler()
sched/numa: Add SD_PERFER_SIBLING to CPU domain
Pull perf fixes from Ingo Molnar:
"Fix merge window fallout and fix sleep profiling (this was always
broken, so it's not a fix for the merge window - we can skip this one
from the head of the tree)."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/trace: Add ability to set a target task for events
perf/x86: Fix USER/KERNEL tagging of samples properly
perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
Pull irq fix from Ingo Molnar.
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Allow irq chips to mark themself oneshot safe
usb-dbgp - increase the controller wait time to come out of halt.
kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt
debug core - pass NMI type on archs that provide NMI types
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQG8wyAAoJEIciOldedpOjN2oP/ipaQSLnvoKUhutFl/qL2239
mMsxh9ga9rfKuCujpSkHZUwjo3VX7put7cnhVwETd4y2gN2YWPYg4OIKt+Y0AhNe
4NzHwB+lm6iGE33Q1x4uEHBH5aWLzWcOM/9n4avwY2DjtDfpecki5ChP/CVHK8qU
VVF2PfY8nbxcEonCbP1b/0KaD3xrPqwgZ70HdFi5eUuXiBajAyp9c9zqVUWJ6j+H
r+2PVkzn9NxRCkyq3tzK5gYk5SzoJPClkpB67CWugG35MiFLkz2csJNztFxtaInZ
t8HLkMTVLdCgLZqnw/ZEVWfqQA+q6N5NS6zs9j0siLg5HbEb+UtCebPwpChdQrmh
Sol+0vmT9Hi4Jm6onhDnQYchaDI7gMhynUC9sWAPhtSHS9e7D9c5IBLHQd3YbOHK
c8ELzxduszw8+jaiDJStkWM+tbQzJXD9bT1KpLVJd8t9BKAmuBX7ETSD+eKtjynJ
SgywSfVOdxXzEMvRWqeK3qgkLJYCWCfFsc+75hzJl18dRoM3NDyuOxKyOLWF9tFV
QUjaCvndIFz7CgM7FTToJZbACqxFHRGh4UUXHiXPUBPXvE5Zt34n4VsxAXYpJc95
1by/TBcYKWPd3hsOJOh1qMeKXD6TEwN/eEmsjBfnHTp8EBcgRtvue8ZHPMcfYnn/
Iauy66b/nHJLRsi1gWrn
=LAWq
-----END PGP SIGNATURE-----
Merge tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB/usb-dbgp fixes and cleanups from Jason Wessel:
"There are no new features, those will be delayed to the 3.7 window.
There are only fixes/cleanup against the usual kernel churn and we are
removing more lines than we add:
- usb-dbgp - increase the controller wait time to come out of halt.
- kdb - Remove unused KDB_FLAG_ONLY_DO_DUMP code and cpu in more prompt
- debug core - pass NMI type on archs that provide NMI types"
* tag 'for_linux-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
USB: echi-dbgp: increase the controller wait time to come out of halt.
kernel/debug: Make use of KGDB_REASON_NMI
kdb: Remove cpu from the more prompt
kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP
Workqueue was lacking a mechanism to modify the timeout of an already
pending delayed_work. delayed_work users have been working around
this using several methods - using an explicit timer + work item,
messing directly with delayed_work->timer, and canceling before
re-queueing, all of which are error-prone and/or ugly.
This patch implements mod_delayed_work[_on]() which behaves similarly
to mod_timer() - if the delayed_work is idle, it's queued with the
given delay; otherwise, its timeout is modified to the new value.
Zero @delay guarantees immediate execution.
v2: Updated to reflect try_to_grab_pending() changes. Now safe to be
called from bh context.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
One is when someone else is queueing or deqeueing the work item. With
the previous patches, it is guaranteed that PENDING and queued state
will soon agree making it safe to busy-retry in this case.
The other is if multiple __cancel_work_timer() invocations are racing
one another. __cancel_work_timer() grabs PENDING and then waits for
running instances of the target work item on all CPUs while holding
PENDING and !queued. try_to_grab_pending() invoked from another task
will keep returning -EAGAIN while the current owner is waiting.
Not distinguishing the two cases is okay because __cancel_work_timer()
is the only user of try_to_grab_pending() and it invokes
wait_on_work() whenever grabbing fails. For the first case, busy
looping should be fine but wait_on_work() doesn't cause any critical
problem. For the latter case, the new contender usually waits for the
same condition as the current owner, so no unnecessarily extended
busy-looping happens. Combined, these make __cancel_work_timer()
technically correct even without irq protection while grabbing PENDING
or distinguishing the two different cases.
While the current code is technically correct, not distinguishing the
two cases makes it difficult to use try_to_grab_pending() for other
purposes than canceling because it's impossible to tell whether it's
safe to busy-retry grabbing.
This patch adds a mechanism to mark a work item being canceled.
try_to_grab_pending() now disables irq on success and returns -EAGAIN
to indicate that grabbing failed but PENDING and queued states are
gonna agree soon and it's safe to busy-loop. It returns -ENOENT if
the work item is being canceled and it may stay PENDING && !queued for
arbitrary amount of time.
__cancel_work_timer() is modified to mark the work canceling with
WORK_OFFQ_CANCELING after grabbing PENDING, thus making
try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it
invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't
necessary for correctness but makes it consistent with other future
users of try_to_grab_pending().
v2: try_to_grab_pending() was testing preempt_count() to ensure that
the caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Updated so that try_to_grab_pending() disables irq on success
rather than requiring preemption disabled by the caller. This
makes busy-looping easier and will allow try_to_grap_pending() to
be used from bh/irq contexts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
* Use bool @is_dwork instead of @timer and let try_to_grab_pending()
use to_delayed_work() to determine the delayed_work address.
* Move timer handling from __cancel_work_timer() to
try_to_grab_pending().
* Make try_to_grab_pending() use -EAGAIN instead of -1 for
busy-looping and drop the ret local variable.
* Add proper function comment to try_to_grab_pending().
This makes the code a bit easier to understand and will ease further
changes. This patch doesn't make any functional change.
v2: Use @is_dwork instead of @timer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the
rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
either a real CPU number or one of WORK_CPU_*.
Scheduled addition of mod_delayed_work[_on]() requires an additional
flag, which is used only while a work item is off queue. There are
more than enough bits to represent off-queue CPU number on both 32 and
64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
part of the @work->data high bits while off queue. This patch doesn't
define any actual OFFQ flag yet.
Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
make room for OFFQ flags.
To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
to check that there are enough bits to accomodate off-queue CPU number
is added.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
try_to_grab_pending() will be used by to-be-implemented
mod_delayed_work[_on](). Move try_to_grab_pending() and related
functions above queueing functions.
This patch only moves functions around.
Signed-off-by: Tejun Heo <tj@kernel.org>
If @delay is zero and the dealyed_work is idle, queue_delayed_work()
queues it for immediate execution; however, queue_delayed_work_on()
lacks this logic and always goes through timer regardless of @delay.
This patch moves 0 @delay handling logic from queue_delayed_work() to
queue_delayed_work_on() so that both functions behave the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Queueing functions have been using different methods to determine the
local CPU.
* queue_work() superflously uses get/put_cpu() to acquire and hold the
local CPU across queue_work_on().
* delayed_work_timer_fn() uses smp_processor_id().
* queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
which is interpreted as the local CPU.
* flush_delayed_work[_sync]() were using raw_smp_processor_id().
* __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
target workqueue is bound one but nobody uses this.
This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
to indicate local CPU and use the local binding feature of
__queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling
in __queue_work().
Signed-off-by: Tejun Heo <tj@kernel.org>
delayed_work->timer.function is currently initialized during
queue_delayed_work_on(). Export delayed_work_timer_fn() and set
delayed_work timer function during delayed_work initialization
together with other fields.
This ensures the timer function is always valid on an initialized
delayed_work. This is to help mod_delayed_work() implementation.
To detect delayed_work users which diddle with the internal timer,
trigger WARN if timer function doesn't match on queue.
Signed-off-by: Tejun Heo <tj@kernel.org>
Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access
to the target work item. They first try to claim the bit and proceed
with queueing only after that succeeds and there's a window between
PENDING being set and the actual queueing where the task can be
interrupted or preempted.
There's also a similar window in process_one_work() when clearing
PENDING. A work item is dequeued, gcwq->lock is released and then
PENDING is cleared and the worker might get interrupted or preempted
between releasing gcwq->lock and clearing PENDING.
cancel[_delayed]_work_sync() tries to claim or steal PENDING. The
function assumes that a work item with PENDING is either queued or in
the process of being [de]queued. In the latter case, it busy-loops
until either the work item loses PENDING or is queued. If canceling
coincides with the above described interrupts or preemptions, the
canceling task will busy-loop while the queueing or executing task is
preempted.
This patch keeps irq disabled across claiming PENDING and actual
queueing and moves PENDING clearing in process_one_work() inside
gcwq->lock so that busy looping from PENDING && !queued doesn't wait
for interrupted/preempted tasks. Note that, in process_one_work(),
setting last CPU and clearing PENDING got merged into single
operation.
This removes possible long busy-loops and will allow using
try_to_grab_pending() from bh and irq contexts.
v2: __queue_work() was testing preempt_count() to ensure that the
caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Disable irq instead of preemption. IRQ will be disabled while
grabbing gcwq->lock later anyway and this allows using
try_to_grab_pending() from bh and irq contexts.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
WORK_STRUCT_PENDING is used to claim ownership of a work item and
process_one_work() releases it before starting execution. When
someone else grabs PENDING, all pre-release updates to the work item
should be visible and all updates made by the new owner should happen
afterwards.
Grabbing PENDING uses test_and_set_bit() and thus has a full barrier;
however, clearing doesn't have a matching wmb. Given the preceding
spin_unlock and use of clear_bit, I don't believe this can be a
problem on an actual machine and there hasn't been any related report
but it still is theretically possible for clear_pending to permeate
upwards and happen before work->entry update.
Add an explicit smp_wmb() before work_clear_pending().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
All queueing functions return 1 on success, 0 if the work item was
already pending. Update them to return bool instead. This signifies
better that they don't return 0 / -errno.
This is cleanup and doesn't cause any functional difference.
While at it, fix comment opening for schedule_work_on().
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, queue/schedule[_delayed]_work_on() are located below the
counterpart without the _on postifx even though the latter is usually
implemented using the former. Swap them.
This is cleanup and doesn't cause any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
__ptrace_may_access() is used within only kernel/ptrace.c.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
Round of refactoring and enhancements to irq_domain infrastructure. This
series starts the process of simplifying irqdomain. The ultimate goal is
to merge LEGACY, LINEAR and TREE mappings into a single system, but had
to back off from that after some last minute bugs. Instead it mainly
reorganizes the code and ensures that the reverse map gets populated
when the irq is mapped instead of the first time it is looked up.
Merging of the irq_domain types is deferred to v3.7
In other news, this series adds helpers for creating static mappings on
a linear or tree mapping.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQGKKkAAoJEEFnBt12D9kB59gQAJnTjrihej1tr0OEkffIthGK
RyVI/DMo0jMgLs4K/rIo3Y+PdTSsNYd8x4R7ln8O7rNRQn8W6jE6NQgoMh51EvNc
FAltmTsBldq6hUNuz2FEnbmojBP4QklTzL8bAiXtX5EufWQsgMsP4guOuHXLCjEV
CkWYVk/slXEWJ8yYJc6GKVRvL+CNeiXVCTcOsYA0CI3ofN7O0rd+YAL314CRllIc
e5uARbWM+s9FJ/eXwCZP4+3jCmdI/CHJb284WldMc/mBD8Rbiqpb4kH6AZI+TH2O
CyiNEPWs6FG5eJPTID7HrOarXGzwYq/pvv8iG7Mh8NiKSae1C1HdkHelCjbLQ+pU
POya0fWF1Gvzlmw0gHik86dqaKjwb29btjj7SFg8KnQExWn2ifhsY70mM9wCTo3s
cwcQlssDIsARE83nttTFCoV/iAWh9AvTxafrXu/+9OKTjpsYlC8kgzdVjq5aAxON
JaAUK1OduTWRsd1TabKlh6naRXr9nRcLKikwKri2oYVKkj97wahBuib4ffzAcNqz
VklRBxTH6M+dz/t5NpcVyLXJpqzTN++QNdTAmeQG6LOnHJL4tpFTsx5sMa7ghmzX
LNpmp/AkVfP0MT7Drf0FUUx6iFA7sjANYzcepUVDrPGKHx0E3LyqbG5JKcC5LgM6
+UIoKAktF3vY7pdZJL9z
=ZUF/
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irqdomain changes from Grant Likely:
"Round of refactoring and enhancements to irq_domain infrastructure.
This series starts the process of simplifying irqdomain. The ultimate
goal is to merge LEGACY, LINEAR and TREE mappings into a single
system, but had to back off from that after some last minute bugs.
Instead it mainly reorganizes the code and ensures that the reverse
map gets populated when the irq is mapped instead of the first time it
is looked up.
Merging of the irq_domain types is deferred to v3.7
In other news, this series adds helpers for creating static mappings
on a linear or tree mapping."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
irqdomain: Improve diagnostics when a domain mapping fails
irqdomain: eliminate slow-path revmap lookups
irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
irqdomain: Eliminate dedicated radix lookup functions
irqdomain: Support for static IRQ mapping and association.
irqdomain: Always update revmap when setting up a virq
irqdomain: Split disassociating code into separate function
irq_domain: correct a minor wrong comment for linear revmap
irq_domain: Standardise legacy/linear domain selection
irqdomain: Make ops->map hook optional
irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
irqdomain: Simple NUMA awareness.
devicetree: add helper inline for retrieving a node's full name
Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...
from interrupts for /dev/random and /dev/urandom. The goal is to
addresses weaknesses discussed in the paper "Mining your Ps and Qs:
Detection of Widespread Weak Keys in Network Devices", by Nadia
Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
be published in the Proceedings of the 21st Usenix Security Symposium,
August 2012. (See https://factorable.net for more information and an
extended version of the paper.)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
bi5FC1OolqhvzVIdsqgt
=u7KM
-----END PGP SIGNATURE-----
Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull random subsystem patches from Ted Ts'o:
"This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.
The goal is to addresses weaknesses discussed in the paper "Mining
your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
which will be published in the Proceedings of the 21st Usenix Security
Symposium, August 2012. (See https://factorable.net for more
information and an extended version of the paper.)"
Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}
* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
random: mix in architectural randomness in extract_buf()
dmi: Feed DMI table to /dev/random driver
random: Add comment to random_initialize()
random: final removal of IRQF_SAMPLE_RANDOM
um: remove IRQF_SAMPLE_RANDOM which is now a no-op
sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
[ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
...
This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.
Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.
Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery. This patch borrows the task flags from whatever process happens
to be preempted by the softirq. It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.
The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process. This should be safe due to
the following reasons
Softirqs can run on multiple CPUs sure but the same task should not be
executing the same softirq code. Neither should the softirq
handler be preempted by any other softirq handler so the flags
should not leak to an unrelated softirq.
Softirqs re-enable hardware interrupts in __do_softirq() so can be
preempted by hardware interrupts so PF_MEMALLOC is inherited
by the hard IRQ. However, this is similar to a process in
reclaim being preempted by a hardirq. While PF_MEMALLOC is
set, gfp_to_alloc_flags() distinguishes between hard and
soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
flag.
If the softirq is deferred to ksoftirq then its flags may be used
instead of a normal tasks but as the softirq cannot be preempted,
the PF_MEMALLOC flag does not leak to other code by accident.
[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:
/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);
But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}
Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more. But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.
Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.
Even for mprotect_fixup(), we can get the right result in the end.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
fixes) now that Arnaldo is back from vacation."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Teach build_probe_list() to consider the range
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Don't recheck vma/f_mapping in write_opcode()
perf/x86: Fix missing struct before structure name
perf/x86: Fix format definition of SNB-EP uncore QPI box
perf/x86: Make bitfield unsigned
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
perf/x86: Add Intel Nehalem-EX uncore support
perf/x86: Fix typo in format definition of uncore PCU filter
...
Ingo noted that the numerous timekeeper.value references made
the timekeeping code ugly and caused many long lines that
had to be broken up. He recommended replacing timekeeper.value
references with tk->value.
This patch provides a local tk value for all top level time
functions and sets it to &timekeeper. Then all timekeeper
access is done via a tk pointer.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For performance reasons, we maintain ktime_t based duplicates of
wall_to_monotonic (offs_real) and total_sleep_time (offs_boot).
Since large problems could occur (such as the resume regression
on 3.5-rc7, or the leapsecond hrtimer issue) if these value
pairs were to be inconsistently updated, this patch this cleans
up how we modify these value pairs to ensure we are always
consistent.
As a side-effect this is also more efficient as we only
caulculate the duplicate values when they are changed,
rather then every update_wall_time call.
This also provides WARN_ONs to detect if future changes break
the invariants.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-5-git-send-email-john.stultz@linaro.org
[ Cleaned up minor style issues. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo noted that ACTHZ is a confusing name, and requested it
be renamed, so this patch renames ACTHZ to SHIFTED_HZ to
better describe it.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.
Now a target task can be set by using __perf_task().
The original idea and a draft patch belongs to Peter Zijlstra.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With this patch struct ld_env will have a pointer of the load balancing
cpumask and we don't need to pass a cpumask around anymore.
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add function tracer based kprobe optimization support
handlers on x86. This allows kprobes to use function
tracer for probing on mcount call.
Link: http://lkml.kernel.org/r/20120605102838.27845.26317.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Updated to new port of ftrace save regs functions ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Introduce function trace based kprobes optimization.
With using ftrace optimization, kprobes on the mcount calling
address, use ftrace's mcount call instead of breakpoint.
Furthermore, this optimization works with preemptive kernel
not like as current jump-based optimization. Of cource,
this feature works only if the probe is on mcount call.
Only if kprobe.break_handler is set, that probe is not
optimized with ftrace (nor put on ftrace). The reason why this
limitation comes is that this break_handler may be used only
from jprobes which changes ip address (for fetching the function
arguments), but function tracer ignores modified ip address.
Changes in v2:
- Fix ftrace_ops registering right after setting its filter.
- Unregister ftrace_ops if there is no kprobe using.
- Remove notrace dependency from __kprobes macro.
Link: http://lkml.kernel.org/r/20120605102832.27845.63461.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Break a big critical region into fine-grained pieces at
registering kprobe path. This helps us to solve circular
locking dependency when introducing ftrace-based kprobes.
Link: http://lkml.kernel.org/r/20120605102826.27845.81689.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently module_mutex is taken before kprobe_mutex, but this
can cause issues when we have kprobes register ftrace, as the ftrace
mutex is taken before enabling a tracepoint, which currently takes
the module mutex.
If module_mutex is taken before kprobe_mutex, then we can not
have kprobes use the ftrace infrastructure.
There seems to be no reason that the kprobe_mutex can't be taken
before the module_mutex. Running lockdep shows that it is safe
among the kernels I've run.
Link: http://lkml.kernel.org/r/20120605102814.27845.21047.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a new filter update interface ftrace_set_filter_ip()
to set ftrace filter by ip address, not only glob pattern.
Link: http://lkml.kernel.org/r/20120605102808.27845.67952.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the save-regs functionality of ftrace.
If the arch supports saving regs, then it will make sure that regs is
at least not NULL in the callback.
If the arch does not support saving regs, it makes sure that the
registering of the ftrace_ops that requests saving regs fails.
It then tests the registering of the ftrace_ops succeeds if the
'IF_SUPPORTED' flag is set. Then it makes sure that the regs passed to
the function is NULL.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add selftests to test the function tracing recursion protection actually
does work. It also tests if a ftrace_ops states it will perform its own
protection. Although, even if the ftrace_ops states it will protect itself,
the ftrace infrastructure may still provide protection if the arch does
not support all features or another ftrace_ops is registered.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As more users of the function tracer utility are being added, they do
not always add the necessary recursion protection. To protect from
function recursion due to tracing, if the callback ftrace_ops does not
specifically specify that it protects against recursion (by setting
the FTRACE_OPS_FL_RECURSION_SAFE flag), the list operation will be
called by the mcount trampoline which adds recursion protection.
If the flag is set, then the function will be called directly with no
extra protection.
Note, the list operation is called if more than one function callback
is registered, or if the arch does not support all of the function
tracer features.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently kernel never set KGDB_REASON_NMI. We do now, when we enter
KGDB/KDB from an NMI.
This is not to be confused with kgdb_nmicallback(), NMI callback is
an entry for the slave CPUs during CPUs roundup, but REASON_NMI is the
entry for the master CPU.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Having the CPU in the more prompt is completely redundent vs the
standard kdb prompt, and it also wastes 32 bytes on the stack.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
This code cleanup was missed in the original kdb merge, and this code
is simply not used at all. The code that was previously used to set
the KDB_FLAG_ONLY_DO_DUMP was removed prior to the initial kdb merge.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.
This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff). In
this case __request_resource would return the whole root range as
conflict range (i.e. 0-ffffffff). Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.
This patch aborts looking for an usable range when the request does not
intersect with the root range. When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.
When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.
[ 5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[ 5.975150] Modules linked in:
[ 5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[ 5.985324] Call Trace:
[ 5.987759] [<c1039dfc>] ? console_unlock+0x17b/0x18d
[ 5.992891] [<c1039620>] warn_slowpath_common+0x48/0x5d
[ 5.998194] [<c1031758>] ? sub_preempt_count+0x63/0x89
[ 6.003412] [<c1039644>] warn_slowpath_null+0xf/0x13
[ 6.008453] [<c1031758>] sub_preempt_count+0x63/0x89
[ 6.013499] [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[ 6.018453] [<c10c6349>] add_partial+0x36/0x3b
[ 6.022973] [<c10c7c0a>] deactivate_slab+0x96/0xb4
[ 6.027842] [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[ 6.034456] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.039842] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.045232] [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[ 6.050710] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[ 6.056100] [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[ 6.061320] [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[ 6.067230] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[ 7.179057] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[ 7.184970] [<c17b4779>] reserve_region_with_split+0x30/0x42
[ 7.190709] [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[ 7.196623] [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[ 7.202184] [<c17cad8a>] pcibios_init+0x23/0x35
[ 7.206789] [<c17ca574>] pci_subsys_init+0x3f/0x44
[ 7.211659] [<c1002088>] do_one_initcall+0x72/0x122
[ 7.216615] [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[ 7.221659] [<c17a27ff>] kernel_init+0xa6/0x118
[ 7.226265] [<c17a2759>] ? start_kernel+0x334/0x334
[ 7.231223] [<c14d7482>] kernel_thread_helper+0x6/0x10
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
register_sysctl_table() is a strange function, as it makes internal
allocations (a header) to register a sysctl_table. This header is a
handle to the table that is created, and can be used to unregister the
table. But if the table is permanent and never unregistered, the header
acts the same as a static variable.
Unfortunately, this allocation of memory that is never expected to be
freed fools kmemleak in thinking that we have leaked memory. For those
sysctl tables that are never unregistered, and have no pointer referencing
them, kmemleak will think that these are memory leaks:
unreferenced object 0xffff880079fb9d40 (size 192):
comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
[<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
[<ffffffff8110b852>] __kmalloc+0x107/0x153
[<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
[<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
[<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
[<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
[<ffffffff81afb0a1>] sysctl_init+0x10/0x14
[<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
[<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
[<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
[<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
[<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
[<ffffffffffffffff>] 0xffffffffffffffff
The sysctl_base_table used by sysctl itself is one such instance that
registers the table to never be unregistered.
Use kmemleak_not_leak() to suppress the kmemleak false positive.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last line of vmcoreinfo note does not end with \n. Parsing all the
lines in note becomes easier if all lines end with \n instead of trying to
special case the last line.
I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
assumption that all lines end with \n. I think it is a good idea to fix
it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function dup_task() may fail at the following function calls in the
following order.
0) alloc_task_struct_node()
1) alloc_thread_info_node()
2) arch_dup_task_struct()
Error by 0) is not a matter, it can just return. But error by 1) requires
releasing task_struct allocated by 0) before it returns. Likewise, error
by 2) requires releasing task_struct and thread_info allocated by 0) and
1).
The existing error handling calls free_task_struct() and
free_thread_info() which do not only release task_struct and thread_info,
but also call architecture specific arch_release_task_struct() and
arch_release_thread_info().
The problem is that task_struct and thread_info are not fully initialized
yet at this point, but arch_release_task_struct() and
arch_release_thread_info() are called with them.
For example, x86 defines its own arch_release_task_struct() that releases
a task_xstate. If alloc_thread_info_node() fails in dup_task(),
arch_release_task_struct() is called with task_struct which is just
allocated and filled with garbage in this error handling.
This actually happened with tools/testing/fault-injection/failcmd.sh
# env FAILCMD_TYPE=fail_page_alloc \
./tools/testing/fault-injection/failcmd.sh --times=100 \
--min-order=0 --ignore-gfp-wait=0 \
-- make -C tools/testing/selftests/ run_tests
In order to fix this issue, make free_{task_struct,thread_info}() not to
call arch_release_{task_struct,thread_info}() and call
arch_release_{task_struct,thread_info}() implicitly where needed.
Default arch_release_task_struct() and arch_release_thread_info() are
defined as empty by default. So this change only affects the
architectures which implement their own arch_release_task_struct() or
arch_release_thread_info() as listed below.
arch_release_task_struct(): x86, sh
arch_release_thread_info(): mn10300, tile
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To make way for "fork: fix error handling in dup_task()", which fixes the
errors more completely.
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current code can be replaced by vma_pages(). So use it to simplify
the code.
[akpm@linux-foundation.org: initialise `len' at its definition site]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The system deadlocks (at least since 2.6.10) when
call_usermodehelper(UMH_WAIT_EXEC) request triggers
call_usermodehelper(UMH_WAIT_PROC) request.
This is because "khelper thread is waiting for the worker thread at
wait_for_completion() in do_fork() since the worker thread was created
with CLONE_VFORK flag" and "the worker thread cannot call complete()
because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
thread cannot start processing UMH_WAIT_PROC request because the khelper
thread is waiting for the worker thread at wait_for_completion() in
do_fork()".
The easiest example to observe this deadlock is to use a corrupted
/sbin/hotplug binary (like shown below).
# : > /tmp/dummy
# chmod 755 /tmp/dummy
# echo /tmp/dummy > /proc/sys/kernel/hotplug
# modprobe whatever
call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
module. do_execve("/tmp/dummy") triggers a call to
request_module("binfmt-0000") from search_binary_handler() which in turn
calls call_usermodehelper(UMH_WAIT_PROC).
In order to avoid deadlock, as a for-now and easy-to-backport solution, do
not try to call wait_for_completion() in call_usermodehelper_exec() if the
worker thread was created by khelper thread with CLONE_VFORK flag. Future
and fundamental solution might be replacing singleton khelper thread with
some workqueue so that recursive calls up to max_active dependency loop
can be handled without deadlock.
[akpm@linux-foundation.org: add comment to kmod_thread_locker]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vprintk_emit() prefix parsing should only be done for internal kernel
messages. This allows existing behavior to be kept in all cases.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current form of a KERN_<LEVEL> is "<.>".
Add printk_get_level and printk_skip_level functions to handle these
formats.
These functions centralize tests of KERN_<LEVEL> so a future modification
can change the KERN_<LEVEL> style and shorten the number of bytes consumed
by these headers.
[akpm@linux-foundation.org: fix build error and warning]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Wu Fengguang <wfg@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If argv_split() failed, the code will end up calling argv_free(NULL). Fix
it up and clean things up a bit.
Addresses Coverity report 703573.
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On the suspend/resume path the boot CPU does not go though an
offline->online transition. This breaks the NMI detector post-resume
since it depends on PMU state that is lost when the system gets
suspended.
Fix this by forcing a CPU offline->online transition for the lockup
detector on the boot CPU during resume.
To provide more context, we enable NMI watchdog on Chrome OS. We have
seen several reports of systems freezing up completely which indicated
that the NMI watchdog was not firing for some reason.
Debugging further, we found a simple way of repro'ing system freezes --
issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
after the system has been suspended/resumed one or more times.
With this patch in place, the system freeze result in panics, as
expected.
These panics provide a nice stack trace for us to debug the actual issue
causing the freeze.
[akpm@linux-foundation.org: fiddle with code comment]
[akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND]
[akpm@linux-foundation.org: fix section errors]
Signed-off-by: Sameer Nanda <snanda@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>