We have a bug in the calculation of the next jiffie to trigger the RTC
synchronisation. The aim here is to run sync_cmos_clock() as close as
possible to the middle of a second. Which means we want this function to
be called less than or equal to half a jiffie away from when now.tv_nsec
equals 5e8 (500000000).
If this is not the case for a given call to the function, for this purpose
instead of updating the RTC we calculate the offset in nanoseconds to the
next point in time where now.tv_nsec will be equal 5e8. The calculated
offset is then converted to jiffies as these are the unit used by the
timer.
Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode,
where the resulting value is rounded up. As a result the range of
now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC
rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2.
As a result if for example sync_cmos_clock() happens to be called at the
time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 +
TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the
same range of now.tv_nsec again. Similarly for cases offsetted by an
integer multiple of TICK_NSEC.
This change addresses the problem by subtracting TICK_NSEC / 2 from the
nanosecond offset to the next point in time where now.tv_nsec will be
equal 5e8, effectively shifting the following rounding in
timespec_to_jiffies() so that it produces a rounded-to-nearest result.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I found that 2.6.27-rc5-mm1 does not compile with gcc 3.4.6.
The error is:
CC kernel/sched.o
kernel/sched.c: In function `start_rt_bandwidth':
kernel/sched.c:208: sorry, unimplemented: inlining failed in call to 'rt_bandwidth_enabled': function body not available
kernel/sched.c:214: sorry, unimplemented: called from here
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2
It seems that the gcc 3.4.6 requires full inline definition before first usage.
The patch below fixes the compilation problem.
Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> (if needed>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Until the C1E patches arrived there where no users of periodic broadcast
before switching to oneshot mode. Now we need to trigger a possible
waiter for a periodic broadcast when switching to oneshot mode.
Otherwise we can starve them for ever.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Spencer reported a problem where utime and stime were going negative despite
the fixes in commit b27f03d4bd. The suspected
reason for the problem is that signal_struct maintains it's own utime and
stime (of exited tasks), these are not updated using the new task_utime()
routine, hence sig->utime can go backwards and cause the same problem
to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
fixes the problem
TODO: using max(task->prev_utime, derived utime) works for now, but a more
generic solution is to implement cputime_max() and use the cputime_gt()
function for comparison.
Reported-by: spencer@bluehost.com
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If HLT stops the TSC, we'll fail to account idle time, thereby inflating the
actual process times. Fix this by re-calibrating the clock against GTOD when
leaving nohz mode.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a
too small value of the HPET minumum delta for programming an event.
The clockevents code needs to enforce an interrupt event on the clock event
device in some cases. The enforcement code was stupid and naive, as it just
added the minimum delta to the current time and tried to reprogram the device.
When the minimum delta is too small, then this loops forever.
Add a sanity check. Allow reprogramming to fail 3 times, then print a warning
and double the minimum delta value to make sure, that this does not happen again.
Use the same function for both tick-oneshot and tick-broadcast code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While chasing the C1E/HPET bugreports I went through the clock events
code inch by inch and found that the broadcast device can be initialized
and shutdown multiple times. Multiple shutdowns are not critical, but
useless waste of time. Multiple initializations are simply broken. Another
CPU might have the device in use already after the first initialization and
the second init could just render it unusable again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In tick_oneshot_setup we program the device to the given next_event,
but we do not check the return value. We need to make sure that the
device is programmed enforced so the interrupt handler engine starts
working. Split out the reprogramming function from tick_program_event()
and call it with the device, which was handed in to tick_setup_oneshot().
Set the force argument, so the devices is firing an interrupt.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The reprogramming of the periodic broadcast handler was broken,
when the first programming returned -ETIME. The clockevents code
stores the new expiry value in the clock events device next_event field
only when the programming time has not been elapsed yet. The loop in
question calculates the new expiry value from the next_event value
and therefor never increases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is a ordering related problem with clockevents code, due to which
clockevents_register_device() called after tickless/highres switch
will not work. The new clockevent ends up with clockevents_handle_noop as
event handler, resulting in no timer activity.
The problematic path seems to be
* old device already has hrtimer_interrupt as the event_handler
* new clockevent device registers with a higher rating
* tick_check_new_device() is called
* clockevents_exchange_device() gets called
* old->event_handler is set to clockevents_handle_noop
* tick_setup_device() is called for the new device
* which sets new->event_handler using the old->event_handler which is noop.
Change the ordering so that new device inherits the proper handler.
This does not have any issue in normal case as most likely all the clockevent
devices are setup before the highres switch. But, can potentially be affecting
some corner case where HPET force detect happens after the highres switch.
This was a problem with HPET in MSI mode code that we have been experimenting
with.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add reserve_region_with_split() to not lose e820 reserved entries if
they overlap with existing IO regions:
with test case by extend 0xe0000000 - 0xeffffff to 0xdd800000 -
we get:
e0000000-efffffff : PCI MMCONFIG 0
e0000000-efffffff : reserved
and in /proc/iomem we get:
found conflict for reserved [dd800000, efffffff], try to reserve with split
__reserve_region_with_split: (PCI Bus #80) [dd000000, ddffffff], res: (reserved) [dd800000, efffffff]
__reserve_region_with_split: (PCI Bus #00) [de000000, dfffffff], res: (reserved) [de000000, efffffff]
initcall pci_subsys_init+0x0/0x121 returned 0 after 381 msecs
in dmesg
various fixes and improvements suggested by Linus.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We should've set refcount on the root sysctl table; otherwise we'll blow
up the first time we get down to zero dynamically registered sysctl
tables.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make PM_QOS and CPU_IDLE play nicer when run with the RT-Preempt kernel.
The purpose of the patch is to remove the spin_lock around the read in the
function pm_qos_requirement - since spinlocks can sleep in -rt and this
function is called from idle.
CPU_IDLE polls the target_value's of some of the pm_qos parameters from
the idle loop causing sleeping locking warnings. Changing the
target_value to an atomic avoids this issue.
Remove the spinlock in pm_qos_requirement by making target_value an atomic
type.
Signed-off-by: mark gross <mgross@linux.intel.com>
Signed-off-by: John Kacur <jkacur@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't change pid_ns->child_reaper when the main thread of the
subnamespace init exits. As Robert Rex <robert.rex@exasol.com> pointed
out this is wrong.
Yes, the re-parenting itself works correctly, but if the reparented task
exits it needs ->parent->nsproxy->pid_ns in do_notify_parent(), and if the
main thread is zombie its ->nsproxy was already cleared by
exit_task_namespaces().
Introduce the new function, find_new_reaper(), which finds the new
->parent for the re-parenting and changes ->child_reaper if needed. Kill
the now unneeded exit_child_reaper().
Also move the changing of ->child_reaper from zap_pid_ns_processes() to
find_new_reaper(), this consolidates the games with ->child_reaper and
makes it stable under tasklist_lock.
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=11391
Reported-by: Robert Rex <robert.rex@exasol.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zap_pid_ns_processes() sets pid_ns->child_reaper = NULL, this is wrong.
Yes, we have already killed all tasks in this namespace, and sys_wait4()
doesn't see any child. But this doesn't mean ->children list is empty, we
may have EXIT_DEAD tasks which are not visible to do_wait(). In that case
the subsequent forget_original_parent() will crash the kernel because it
will try to re-parent these tasks to the NULL reaper.
Even if there are no childs, it is not good that forget_original_parent()
uses reaper == NULL.
Change the code to set ->child_reaper = init_pid_ns.child_reaper instead.
We could use pid_ns->parent->child_reaper as well, I think this does not
really matter. These EXIT_DEAD tasks are not visible to the new ->parent
after re-parenting, they will silently do release_task() eventually.
Note that we must change ->child_reaper, otherwise
forget_original_parent() will use reaper == father, and in that case we
will hit the (correct) BUG_ON(!list_empty(&father->children)).
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent commit 16d9679f33caf7e683471647d1472bfe133d858 changed
check_hung_task() to filter out the TASK_KILLABLE tasks. We can
move this check to the caller which has to test t->state anyway.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc warning for new function:
Warning(linux-2.6.27-rc5-git2//kernel/resource.c:448): No description found for parameter 'root'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
got rid of compilation warning:
ISO C90 forbids mixed declarations and code
Signed-off-by: Cordelia Sam <cordesam@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Not used anywhere yet, but this complements the existing plain
'insert_resource()' functionality with a version that can expand the
resource we are adding in order to fix up any conflicts it has with
existing resources.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pulling the ethernet cable on a 2.6.27-rc system with NFS mounts
currently leads to an ongoing flood of soft lockup detector backtraces
for all tasks blocked on the NFS mounts when the hickup takes
longer than 120s.
I don't think NFS problems should be all that noisy.
Luckily there's a reasonably easy way to distingush this case.
Don't report task softlockup warnings for tasks in TASK_KILLABLE
state, which is used by the network file systems.
I believe this patch is a 2.6.27 candidate.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In accordance with commit f42ac38c59
("ftrace: disable tracing for suspend to ram"), disable tracing
around the suspend code in hibernation code paths.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It fixes an accounting bug where we would continue accumulating runtime
even though the bandwidth control is disabled. This would lead to very long
throttle periods once bandwidth control gets turned on again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
kernel/sched.c: In function '__rt_schedulable':
kernel/sched.c:8771: error: implicit declaration of function 'walk_tg_tree'
kernel/sched.c:8771: error: 'tg_nop' undeclared (first use in this function)
kernel/sched.c:8771: error: (Each undeclared identifier is reported only once
kernel/sched.c:8771: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- During wake up of a new task, task_new_fair() can do a resched_task()
on the current task. Later in the code path, check_preempt_curr() also ends
up doing the same, which can be avoided. Check if TIF_NEED_RESCHED is
already set for the current task.
- task_new_fair() does a resched_task() on the current task unconditionally.
This can be done only in case when child runs before the parent.
So this is a small speedup.
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When sysctl_sched_rt_runtime is set to something other than -1 and the
CONFIG_RT_GROUP_SCHED kernel parameter is NOT enabled, we get into a state
where we see one or more CPUs idling forvever even though there are
real-time
tasks in their rt runqueue that are able to run (no longer throttled).
The sequence is:
- A real-time task is running when the timer sets the rt runqueue
to throttled, and the rt task is resched_task()ed and switched
out, and idle is switched in since there are no non-rt tasks to
run on that cpu.
- Eventually the do_sched_rt_period_timer() runs and un-throttles
the rt runqueue, but we just exit the timer interrupt and go back
to executing the idle task in the idle loop forever.
If we change the sched_rt_rq_enqueue() routine to use some of the code
from the CONFIG_RT_GROUP_SCHED enabled version of this same routine and
resched_task() the currently executing task (idle in our case) if it is
a lower priority task than the higher rt task in the now un-throttled
runqueue, the problem is no longer observed.
Signed-off-by: John Blackwood <john.blackwood@ccur.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I've been painstakingly debugging the issue with suspend to ram and
ftraced. The 2.6.28 code does not have this issue, but since the mcount
recording is not going to be in 27, this must be solved for the ftrace
daemon version.
The resume from suspend to ram would reboot because it was triple
faulting. Debugging further, I found that calling the mcount function
itself was not an issue, but it would fault when it incremented
preempt_count. preempt_count is on the tasks info structure that is on the
low memory address of the task's stack. For some reason, it could not
write to it. Resuming out of suspend to ram does quite a lot of funny
tricks to get to work, so it is not surprising at all that simply doing a
preempt_disable() would cause a fault.
Thanks to Rafael for suggesting to add a "while (1);" to find the place in
resuming that is causing the fault. I would place the loop somewhere in
the code, compile and reboot and see if it would either reboot (hit the
fault) or simply hang (hit the loop). Doing this over and over again, I
narrowed it down that it was happening in enable_nonboot_cpus.
At this point, I found that it is easier to simply disable tracing around
the suspend code, instead of searching for the particular function that
can not handle doing a preempt_disable.
This patch disables the tracer as it suspends and reenables it on resume.
I tested this patch on my Laptop, and it can resume fine with the patch.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CC kernel/rcuclassic.o
kernel/rcuclassic.c: In function 'rcu_init_percpu_data':
kernel/rcuclassic.c:705: warning: comparison of distinct pointer types lacks a cast
kernel/rcuclassic.c:713: warning: comparison of distinct pointer types lacks a cast
flags should be unsigned long.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
task->signal->notify_count is only initialized if
task->signal->group_exit_task is not NULL. Reorder a conditional so
that uninitialised memory is not used. Found by Valgrind.
Signed-off-by: Steve VanDeBogart <vandebo-lkml@nerdbox.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix bad contention counting in /proc/lock_stat.
/proc/lockstat tries to gather per-ip contention
statistics per-lock. This was failing due to
a garbage per-ip index selector being used.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix rounding error in /proc/lock_stat numerical output.
On occasion the two digit fractional part contains the three
digit value '100'. This is due to a bug in the rounding algorithm
which pushes values in the range '95..99' to '100' rather than
to '00' + an increment to the integer part. For example,
- 123456.100 old display
+ 123457.00 new display
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds kernel doc for the completion feature.
An error in the split-man.pl PERL snippet in kernel-doc-nano-HOWTO.txt is
also fixed.
Signed-off-by: Kevin Diggs <kevdig@hypersurf.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Have smp_call_function_single() return invalid CPU indicies and return
-ENXIO. This function is already executed inside a
get_cpu()..put_cpu() which locks out CPU removal, so rather than
having the higher layers doing another layer of locking to guard
against unplugged CPUs do the test here.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
'load_module()' is a complex function that contains all the ELF section
logic, and inlining it is utterly insane. But gcc will do it, simply
because there is only one call-site. As a result, all the stack space
that is allocated for all the work to load the module will still be
active when we actually call the module init sequence, and the deep call
chain makes stack overflows happen.
And stack overflows are really hard to debug, because they not only
corrupt random pages below the stack, but also corrupt the thread_info
structure that is allocated under the stack.
In this case, Alan Brunelle reported some crazy oopses at bootup, after
loading the processor module that ends up doing complex ACPI stuff and
has quite a deep callchain. This should fix it, and is the sane thing
to do regardless.
Cc: Alan D. Brunelle <Alan.Brunelle@hp.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch lets the files using linux/version.h match the files that
#include it.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_task_inactive() returns 1 when p->nvcsw == 0 || p->nvcsw == 1. This
means that two subsequent calls can return the same number while the task
was scheduled in between.
Change the code to return "nvcsw | LONG_MIN" instead of "nvcsw ?: 1", now
the overlap always needs LONG_MAX schedules.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If wait_task_inactive() returns success the task was deactivated. In that
case schedule() always increments ->nvcsw which alone can be used as a
"generation counter".
If the next call returns the same number, we can be sure that the task was
unscheduled. Otherwise, because we know that .on_rq == 0 again, ->nvcsw
should have been changed in between.
Q: perhaps it is better to do "ncsw = (p->nvcsw << 1) | 1" ? This
decreases the possibility of "was it unscheduled" false positive when
->nvcsw == 0.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change do_wait_for_common() to use signal_pending_state() instead of open
coding.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some earlier tip/core/rcu patches caused RCU to incorrectly enable irqs
too early in boot. This caused Yinghai's repeated-kexec testing to
hit oopses, presumably due to so that device interrupts left over from
the prior kernel instance (which would oops the newly booting kernel
before it got a chance to reset said devices). This patch therefore
converts all the local_irq_disable()s in rcuclassic.c to local_irq_save().
Besides, I never did like local_irq_disable() anyway. ;-)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the tickless system(CONFIG_NO_HZ=y and CONFIG_HIGH_RES_TIMERS=n), after
I made an offlined cpu online, I found this cpu's event handler was
tick_handle_periodic, not tick_nohz_handler.
After debuging, I found this bug was caused by the wrong tick mode. the
tick mode is not changed to NOHZ_MODE_INACTIVE when the cpu is offline.
This patch fixes this bug.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix RCU's synchronize_rcu() so that it looks like a C function, enabling
it to be recognized as a function with kernel-doc annotation.
Warning(linux-2.6.26-git11//kernel/rcupdate.c:81): No description found for parameter 'synchronize_rcu'
Warning(linux-2.6.26-git11//kernel/rcupdate.c:81): No description found for parameter 'call_rcu'
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yanmin reported a significant regression on his 16-core machine due to:
commit 93b75217df
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Jun 27 13:41:33 2008 +0200
Flip back to the old behaviour.
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When user calls sys_setpriority(PRIO_PGRP ...) on a NPTL style multi-LWP
process, only the task leader of the process is affected, all other
sibling LWP threads didn't receive the setting. The problem was that the
iterator used in sys_setpriority() only iteartes over one task for each
process, ignoring all other sibling thread.
Introduce a new macro do_each_pid_thread / while_each_pid_thread to walk
each thread of a process. Convert 4 call sites in {set/get}priority and
ioprio_{set/get}.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I outwitted myself again in commit 2b2a1ff64a,
and broke the SA_NOCLDWAIT behavior so it leaks zombies. This fixes it.
Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Roland McGrath <roland@redhat.com>
The last patch allows sysctl_sched_rt_runtime to disable bandwidth accounting
for the group scheduler - however it doesn't deal with sched_setscheduler(),
which will keep tasks out of groups that have no assigned runtime.
If we relax this, we get into the situation where RT tasks can get into a group
when we disable bandwidth control, and then starve them by enabling it again.
Rework the schedulability code to check for this condition and fail to turn
on bandwidth control with -EBUSY when this situation is found.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extract walk_tg_tree() and make it a little more generic so we can use it
in the schedulablity test.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
More extensive disable of bandwidth control. It allows sysctl_sched_rt_runtime
to disable full group bandwidth control.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It fixes an accounting bug where we would continue accumulating runtime
even though the bandwidth control is disabled. This would lead to very long
throttle periods once bandwidth control gets turned on again.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
CC kernel/rcuclassic.o
kernel/rcuclassic.c: In function '__rcu_process_callbacks':
kernel/rcuclassic.c:561: error: 'flags' undeclared (first use in this function)
kernel/rcuclassic.c:561: error: (Each undeclared identifier is reported only once
kernel/rcuclassic.c:561: error: for each function it appears in.)
Declare missing variable flags.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Given that the rcp->lock is now acquired from call_rcu(), which can be
invoked from irq-disable regions, all acquisitions need to disable irqs.
The following patch fixes this.
Although I don't have any reason to believe that this is the cause of
Yinghai's oops, it does need to be fixed.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the redundant definition of ACCESS_ONCE() from rcupreempt.c in
favor of the one in compiler.h. Also merge the comment header from
rcupreempt.c's definition into that in compiler.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since f82b217e35 lockdep can output spurious
warnings related to hwirqs due to hardirq_off shrinkage from int to bit-sized
flag. Guard it with double negation to fix the warning.
Signed-off-by: Dmitry Baryshkov <dbaryshkov@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: fix build if CONFIG_PROVE_LOCKING not defined
lockdep: use WARN() in kernel/lockdep.c
lockdep: spin_lock_nest_lock(), checkpatch fixes
lockdep: build fix
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: scale sysctl_sched_shares_ratelimit with nr_cpus
sched: fix rt-bandwidth hotplug race
sched: fix the race between walk_tg_tree and sched_create_group
If CONFIG_PROVE_LOCKING not defined, then no dependency information
is available.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
David reported that his Niagra spend a little too much time in
tg_shares_up(), which considering he has a large cpu count makes sense.
So scale the ratelimit value with the number of cpus like we do for
other controls as well.
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the initialization of the RCU trace module, if
rcupreempt_debugfs_init() fails, we never free the the trace buffer.
This patch frees the trace buffer in case the debugfs fails.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
m68k fails to build with these functions inlined in completion.h. Move
them out of line into sched.c and export them to avoid this problem.
Signed-off-by: Dave Chinner <david@fromorbit.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ftrace depends on some processor state that we destroyed during kexec and
restored by restore_processor_state(). So save_processor_state() and
restore_processor_state() are moved into machine_kexec() and ftrace is
restored after restore_processor_state().
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename KEXEC_CONTROL_CODE_SIZE to KEXEC_CONTROL_PAGE_SIZE, because control
page is used for not only code on some platform. For example in kexec
jump, it is used for data and stack too.
[akpm@linux-foundation.org: unbreak powerpc and arm, finish conversion]
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/kexec.c: In function 'kernel_kexec':
kernel/kexec.c:1506: warning: value computed is not used
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch simplifies the locking and memory-barrier usage in the Classic
RCU grace-period-detection mechanism, incorporating Lai Jiangshan's
feedback from the earlier version (http://lkml.org/lkml/2008/8/1/400
and http://lkml.org/lkml/2008/8/3/43). Passed 10 hours of
rcutorture concurrent with CPUs being put online and taken offline on
a 128-hardware-thread Power machine. My apologies to whoever in the
Eastern Hemisphere was planning to use this machine over the Western
Hemisphere night, but it was sitting idle and...
So this is ready for tip/core/rcu.
This patch is in preparation for moving to a hierarchical
algorithm to allow the very large SMP machines -- requested by some
people at OLS, and there seem to have been a few recent patches in the
4096-CPU direction as well. The general idea is to move to a much more
conservative concurrency design, then apply a hierarchy to reduce
contention on the global lock by a few orders of magnitude (larger
machines would see greater reductions). The reason for taking a
conservative approach is that this code isn't on any fast path.
Prototype in progress.
This patch is against the linux-tip git tree (tip/core/rcu). If you
wish to test this against 2.6.26, use the following set of patches:
http://www.rdrop.com/users/paulmck/patches/2.6.26-ljsimp-1.patchhttp://www.rdrop.com/users/paulmck/patches/2.6.26-ljsimpfix-3.patch
The first patch combines commits 5127bed588
and 3cac97cbb1 from Lai Jiangshan
<laijs@cn.fujitsu.com>, and the second patch contains my changes.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
One small change needed to keep from flooding the console when one
CPU notices that another is AWOL. Unless I am missing something subtle.
Otherwise the cleanups look good!
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be
detatched. Alex observed the case where a runqueue was stealing bandwidth
from an already disabled runqueue to satisfy its own needs.
Stop this by skipping over already disabled runqueues.
Reported-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.
__capable() is using neither atomic ops nor locking to protect t->flags. This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.
This patch further splits security_ptrace() in two:
(1) security_ptrace_may_access(). This passes judgement on whether one
process may access another only (PTRACE_MODE_ATTACH for ptrace() and
PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
current is the parent.
(2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
and takes only a pointer to the parent process. current is the child.
In Smack and commoncap, this uses has_capability() to determine whether
the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
This does not set PF_SUPERPRIV.
Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().
Of the places that were using __capable():
(1) The OOM killer calls __capable() thrice when weighing the killability of a
process. All of these now use has_capability().
(2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
whether the parent was allowed to trace any process. As mentioned above,
these have been split. For PTRACE_ATTACH and /proc, capable() is now
used, and for PTRACE_TRACEME, has_capability() is used.
(3) cap_safe_nice() only ever saw current, so now uses capable().
(4) smack_setprocattr() rejected accesses to tasks other than current just
after calling __capable(), so the order of these two tests have been
switched and capable() is used instead.
(5) In smack_file_send_sigiotask(), we need to allow privileged processes to
receive SIGIO on files they're manipulating.
(6) In smack_task_wait(), we let a process wait for a privileged process,
whether or not the process doing the waiting is privileged.
I've tested this with the LTP SELinux and syscalls testscripts.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
This is an updated version of my previous cpuset patch on top of
the latest mainline git.
The patch fixes CPU hotplug handling issues in the current cpusets code.
Namely circular locking in rebuild_sched_domains() and unsafe access to
the cpu_online_map in the cpuset cpu hotplug handler.
This version includes changes suggested by Paul Jackson (naming, comments,
style, etc). I also got rid of the separate workqueue thread because it is
now safe to call get_online_cpus() from workqueue callbacks.
Here are some more details:
rebuild_sched_domains() is the only way to rebuild sched domains
correctly based on the current cpuset settings. What this means
is that we need to be able to call it from different contexts,
like cpu hotplug for example.
Also latest scheduler code in -tip now calls rebuild_sched_domains()
directly from functions like arch_reinit_sched_domains().
In order to support that properly we need to rework cpuset locking
rules to avoid circular dependencies, which is what this patch does.
New lock nesting rules are explained in the comments.
We can now safely call rebuild_sched_domains() from virtually any
context. The only requirement is that it needs to be called under
get_online_cpus(). This allows cpu hotplug handlers and the scheduler
to call rebuild_sched_domains() directly.
The rest of the cpuset code now offloads sched domains rebuilds to
a workqueue (async_rebuild_sched_domains()).
This version of the patch addresses comments from the previous review.
I fixed all miss-formated comments and trailing spaces.
I also factored out the code that builds domain masks and split up CPU and
memory hotplug handling. This was needed to simplify locking, to avoid unsafe
access to the cpu_online_map from mem hotplug handler, and in general to make
things cleaner.
The patch passes moderate testing (building kernel with -j 16, creating &
removing domains and bringing cpus off/online at the same time) on the
quad-core2 based machine.
It passes lockdep checks, even with preemptable RCU enabled.
This time I also tested in with suspend/resume path and everything is working
as expected.
Signed-off-by: Max Krasnyansky <maxk@qualcomm.com>
Acked-by: Paul Jackson <pj@sgi.com>
Cc: menage@google.com
Cc: a.p.zijlstra@chello.nl
Cc: vegard.nossum@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use WARN() instead of a printk+WARN_ON() pair; this way the message
becomes part of the warning section for better reporting/collection.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fix:
kernel/built-in.o: In function `lockdep_stats_show':
lockdep_proc.c:(.text+0x3cb2f): undefined reference to `lockdep_count_forward_deps'
kernel/built-in.o: In function `l_show':
lockdep_proc.c:(.text+0x3d02b): undefined reference to `lockdep_count_forward_deps'
lockdep_proc.c:(.text+0x3d047): undefined reference to `lockdep_count_backward_deps'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Switch /proc/irq/*/smp_affinity , /proc/irq/default_smp_affinity to
seq_files.
cat(1) reads with 1024 chunks by default, with high enough NR_CPUS, there
will be -EINVAL.
As side effect, there are now two less users of the ->read_proc interface.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s390 doesn't support the additional_cpus kernel parameter anymore since a
long time. So we better update the code and documentation to reflect
that.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask(), fix
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
fix spinlock recursion in hvc_console
stop_machine: remove unused variable
modules: extend initcall_debug functionality to the module loader
export virtio_rng.h
lguest: use get_user_pages_fast() instead of get_user_pages()
mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
lguest: don't set MAC address for guest unless specified
> > Nick Piggin (1):
> > generic-ipi: fix stack and rcu interaction bug in
> > smp_call_function_mask()
>
> I'm still not 100% sure that I have this patch right... I might have seen
> a lockup trace implicating the smp call function path... which may have
> been due to some other problem or a different bug in the new call function
> code, but if some more people can take a look at it before merging?
OK indeed it did have a couple of bugs. Firstly, I wasn't freeing the
data properly in the alloc && wait case. Secondly, I wasn't resetting
CSD_FLAG_WAIT in the for each cpu loop (so only the first CPU would
wait).
After those fixes, the patch boots and runs with the kmalloc commented
out (so it always executes the slowpath).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The kernel has this really nice facility where if you put "initcall_debug"
on the kernel commandline, it'll print which function it's going to
execute just before calling an initcall, and then after the call completes
it will
1) print if it had an error code
2) checks for a few simple bugs (like leaving irqs off)
and
3) print how long the init call took in milliseconds.
While trying to optimize the boot speed of my laptop, I have been loving
number 3 to figure out what to optimize... ... and then I wished that
the same thing was done for module loading.
This patch makes the module loader use this exact same functionality; it's
a logical extension in my view (since modules are just sort of late
binding initcalls anyway) and so far I've found it quite useful in finding
where things are too slow in my boot.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched, cpu hotplug: fix set_cpus_allowed() use in hotplug callbacks
sched: fix mysql+oltp regression
sched_clock: delay using sched_clock()
sched clock: couple local and remote clocks
sched clock: simplify __update_sched_clock()
sched: eliminate scd->prev_raw
sched clock: clean up sched_clock_cpu()
sched clock: revert various sched_clock() changes
sched: move sched_clock before first use
sched: test runtime rather than period in global_rt_runtime()
sched: fix SCHED_HRTICK dependency
sched: fix warning in hrtick_start_fair()
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
posix-timers: fix posix_timer_event() vs dequeue_signal() race
posix-timers: do_schedule_next_timer: fix the setting of ->si_overrun
When we enable DEBUG_LOCK_ALLOC but do not enable PROVE_LOCKING and or
LOCK_STAT, lock_alloc() and lock_release() turn into nops, even though
we should be doing hlock checking (check=1).
This causes a false warning and a lockdep self-disable.
Rectify this.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Mark Langsdorf reported:
> One of my co-workers noticed that the powernow-k8
> driver no longer restarts when a CPU core is
> hot-disabled and then hot-enabled on AMD quad-core
> systems.
>
> The following comands work fine on 2.6.26 and fail
> on 2.6.27-rc1:
>
> echo 0 > /sys/devices/system/cpu/cpu3/online
> echo 1 > /sys/devices/system/cpu/cpu3/online
> find /sys -name cpufreq
>
> For 2.6.26, the find will return a cpufreq
> directory for each processor. In 2.6.27-rc1,
> the cpu3 directory is missing.
>
> After digging through the code, the following
> logic is failing when the core is hot-enabled
> at runtime. The code works during the boot
> sequence.
>
> cpumask_t = current->cpus_allowed;
> set_cpus_allowed_ptr(current, &cpumask_of_cpu(cpu));
> if (smp_processor_id() != cpu)
> return -ENODEV;
So set the CPU active before calling the CPU_ONLINE notifier chain,
there are a handful of notifiers that use set_cpus_allowed().
This fix also solves the problem with x86-microcode. I've sent
alternative patches for microcode, but as this "rely on
set_cpus_allowed_ptr() being workable in cpu-hotplug(CPU_ONLINE, ...)"
assumption seems to be more broad than what we thought, perhaps this fix
should be applied.
With this patch we define that by the moment CPU_ONLINE is being sent,
a 'cpu' is online and ready for tasks to be migrated onto it.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Reported-by: Mark Langsdorf <mark.langsdorf@amd.com>
Tested-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Venki Pallipadi <venkatesh.pallipadi@intel.com> wrote:
> Found a OOPS on a big SMP box during an overnight reboot test with
> upstream git.
>
> Suresh and I looked at the oops and looks like the root cause is in
> generic_smp_call_function_interrupt() and smp_call_function_mask() with
> wait parameter.
>
> The actual oops looked like
>
> [ 11.277260] BUG: unable to handle kernel paging request at ffff8802ffffffff
> [ 11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff
> [ 11.278155] PGD 202063 PUD 0
> [ 11.278576] Oops: 0010 [1] SMP
> [ 11.279006] CPU 5
> [ 11.279336] Modules linked in:
> [ 11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f #290
> [ 11.280039] RIP: 0010:[<ffff8802ffffffff>] [<ffff8802ffffffff>] 0xffff8802ffffffff
> [ 11.280692] RSP: 0018:ffff88027f1f7f70 EFLAGS: 00010086
> [ 11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000
> [ 11.281264] RDX: 0000000000004f4e RSI: 0000000000000001 RDI: 0000000000000000
> [ 11.281624] RBP: ffff88027f1f7f98 R08: 0000000000000001 R09: ffffffff802509af
> [ 11.281925] R10: ffff8800280c2780 R11: 0000000000000000 R12: ffff88027f097d48
> [ 11.282214] R13: ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000
> [ 11.282502] FS: 0000000000000000(0000) GS:ffff88027f1c3340(0000) knlGS:0000000000000000
> [ 11.283096] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4: 00000000000006e0
> [ 11.283760] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 11.284048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 11.284337] Process swapper (pid: 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640)
> [ 11.284936] Stack: ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66
> [ 11.285802] ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5 ffff88027f1f3e40
> [ 11.286599] ffffffff8020bdd6 ffff88027f1f3e40 <EOI> ffff88027f1f3ef8 0000000000000000
> [ 11.287120] Call Trace:
> [ 11.287768] <IRQ> [<ffffffff80250963>] ? generic_smp_call_function_interrupt+0x61/0x12c
> [ 11.288354] [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27
> [ 11.288744] [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70
> [ 11.289030] <EOI> [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73
> [ 11.289380] [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa
> [ 11.289760] [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa
> [ 11.290051] [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2
> [ 11.290338] [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d
> [ 11.290723] [<ffffffff8060224a>] ? start_secondary+0x14d/0x152
> [ 11.291010]
> [ 11.291287]
> [ 11.291654] Code: Bad RIP value.
> [ 11.292041] RIP [<ffff8802ffffffff>] 0xffff8802ffffffff
> [ 11.292380] RSP <ffff88027f1f7f70>
> [ 11.292741] CR2: ffff8802ffffffff
> [ 11.310951] ---[ end trace 137c54d525305f1c ]---
>
> The problem is with the following sequence of events:
>
> - CPU A calls smp_call_function_mask() for CPU B with wait parameter
> - CPU A sets up the call_function_data on the stack and does an rcu add to
> call_function_queue
> - CPU A waits until the WAIT flag is cleared
> - CPU B gets the call function interrupt and starts going through the
> call_function_queue
> - CPU C also gets some other call function interrupt and starts going through
> the call_function_queue
> - CPU C, which is also going through the call_function_queue, starts referencing
> CPU A's stack, as that element is still in call_function_queue
> - CPU B finishes the function call that CPU A set up and as there are no other
> references to it, rcu deletes the call_function_data (which was from CPU A
> stack)
> - CPU B sees the wait flag and just clears the flag (no call_rcu to free)
> - CPU A which was waiting on the flag continues executing and the stack
> contents change
>
> - CPU C is still in rcu_read section accessing the CPU A's stack sees
> inconsistent call_funation_data and can try to execute
> function with some random pointer, causing stack corruption for A
> (by clearing the bits in mask field) and oops.
Nice debugging work.
I'd suggest something like the attached (boot tested) patch as the simple
fix for now.
I expect the benefits from the less synchronized, multiple-in-flight-data
global queue will still outweigh the costs of dynamic allocations. But
if worst comes to worst then we just go back to a globally synchronous
one-at-a-time implementation, but that would be pretty sad!
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Defer commit 6d299f1b53 to the next release.
Testing of the tip/sched/clock tree revealed a mysql+oltp regression
which bisection eventually traced back to this commit in mainline.
Pertinent test results: Three run sysbench averages, throughput units
in read/write requests/sec.
clients 1 2 4 8 16 32 64
6e0534f 9646 17876 34774 33868 32230 30767 29441
2.6.26.1 9112 17936 34652 33383 31929 30665 29232
6d299f1 9112 14637 28370 33339 32038 30762 29204
Note: subsequent commits hide the majority of this regression until you
apply the clock fixes, at which time it reemerges at full magnitude.
We cannot see anything bad about the change itself so we defer it to the
next release until this problem is fully analysed.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this is a diagnostic patch for Classic RCU.
The approach is to record a timestamp at the beginning
of the grace period (in rcu_start_batch()), then have
rcu_check_callbacks() complain if:
1. it is running on a CPU that has holding up grace periods for
a long time (say one second). This will identify the culprit
assuming that the culprit has not disabled hardware irqs,
instruction execution, or some such.
2. it is running on a CPU that is not holding up grace periods,
but grace periods have been held up for an even longer time
(say two seconds).
It is enabled via the default-off CONFIG_DEBUG_RCU_STALL kernel parameter.
Rather than exponential backoff, it backs off to once per 30 seconds.
My feeling upon thinking on it was that if you have stalled RCU grace
periods for that long, a few extra printk() messages are probably the
least of your worries...
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: David Witbrodt <dawitbro@sbcglobal.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the names were too generic:
drivers/uio/uio.c:87: error: expected identifier or '(' before 'do'
drivers/uio/uio.c:87: error: expected identifier or '(' before 'while'
drivers/uio/uio.c:113: error: 'map_release' undeclared here (not in a function)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Solve this by marking the classes as unused and not printing information
about the unused classes.
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Expose the new lock protection lock.
This can be used to annotate places where we take multiple locks of the
same class and avoid deadlocks by always taking another (top-level) lock
first.
NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Fri, 2008-08-01 at 16:26 -0700, Linus Torvalds wrote:
> On Fri, 1 Aug 2008, David Miller wrote:
> >
> > Taking more than a few locks of the same class at once is bad
> > news and it's better to find an alternative method.
>
> It's not always wrong.
>
> If you can guarantee that anybody that takes more than one lock of a
> particular class will always take a single top-level lock _first_, then
> that's all good. You can obviously screw up and take the same lock _twice_
> (which will deadlock), but at least you cannot get into ABBA situations.
>
> So maybe the right thing to do is to just teach lockdep about "lock
> protection locks". That would have solved the multi-queue issues for
> networking too - all the actual network drivers would still have taken
> just their single queue lock, but the one case that needs to take all of
> them would have taken a separate top-level lock first.
>
> Never mind that the multi-queue locks were always taken in the same order:
> it's never wrong to just have some top-level serialization, and anybody
> who needs to take <n> locks might as well do <n+1>, because they sure as
> hell aren't going to be on _any_ fastpaths.
>
> So the simplest solution really sounds like just teaching lockdep about
> that one special case. It's not "nesting" exactly, although it's obviously
> related to it.
Do as Linus suggested. The lock protection lock is called nest_lock.
Note that we still have the MAX_LOCK_DEPTH (48) limit to consider, so anything
that spills that it still up shit creek.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Most the free-standing lock_acquire() usages look remarkably similar, sweep
them into a new helper.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instead of using a per-rq lock class, use the regular nesting operations.
However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).
So what can happen is:
spin_lock(rq->lock); // this rq subclass 0
double_lock_balance(rq, other_rq);
// release rq
// acquire other_rq->lock subclass 0
// acquire rq->lock subclass 1
spin_unlock(other_rq->lock);
leaving you with rq->lock in subclass 1
So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.
Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
this can be used to reset a held lock's subclass, for arbitrary-depth
iterated data structures such as trees or lists which have per-node
locks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some arch's can't handle sched_clock() being called too early - delay
this until sched_clock_init() has been called.
Reported-by: Bill Gatliff <bgat@billgatliff.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>
CC: Russell King - ARM Linux <linux@arm.linux.org.uk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A documentation cleanup patch. With a minor tweak to clarify units for
kbs.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: mark gross <mgross@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_order() takes byte-sized input, not a page-granular one.
Irrespective of this fix I'm inclined to believe that this doesn't work
right anyway - bitmap_allocate_region() has an implicit assumption of
'pos' being suitable for 'order', which this function doesn't seem to
enforce (and since it's being called with a byte-granular value there's no
reason to believe that the callers would make sure device_addr is passed
accordingly - it's also not documented that way).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dmitry Baryshkov <dbaryshkov@gmail.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While I'm glad to finally see the hole fixed whereby passing an invalid
IRQ trigger type to request_irq() would be ignored, the current diagnostic
isn't quite useful. Fixed by also listing the trigger type which was
rejected.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Uwe Kleine-König <Uwe.Kleine-Koenig@digi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change __down_common() to use signal_pending_state() instead of open
coding.
The changes in kernel/semaphore.o are just artifacts, the state checks are
optimized away.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In relay's current read implementation, if the buffer is completely full
but hasn't triggered the buffer-full condition (i.e. the last write
didn't cross the subbuffer boundary) and the last subbuffer is exactly
full, the subbuffer accounting code erroneously finds nothing available.
This patch fixes the problem.
Signed-off-by: Tom Zanussi <tzanussi@gmail.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Mathieu Desnoyers <compudj@krystal.dyndns.org>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'audit.b56' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
Re: [PATCH] Fix the kernel panic of audit_filter_task when key field is set
The "user" parameter to __sched_setscheduler indicates whether the
change is being done on behalf of a user process or not. If not, we
shouldn't apply any permissions checks, so don't call
security_task_setscheduler().
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sorry, I miss a blank between if and "(".
And I add "unlikely" to check "ctx" in audit_match_perm() and audit_match_filetype().
This is a new patch for it.
Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
My commit 2b2a1ff64a introduced a regression
(sorry about that) for the odd case of exit_signal=0 (e.g. clone_flags=0).
This is not a normal use, but it's used by a case in the glibc test suite.
Dying with exit_signal=0 sends no signal, but it's supposed to wake up a
parent's blocked wait*() calls (unlike the delayed_group_leader case).
This fixes tracehook_notify_death() and its caller to distinguish a
"signal 0" wakeup from the delayed_group_leader case (with no wakeup).
Signed-off-by: Roland McGrath <roland@redhat.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kgdb: fix gdb serial thread queries
kgdb: fix kgdb_validate_break_address to perform a mem write
kgdb: remove the requirement for CONFIG_FRAME_POINTER
When the "status_get->mask" is "AUDIT_STATUS_RATE_LIMIT || AUDIT_STATUS_BACKLOG_LIMIT".
If "audit_set_rate_limit" fails and "audit_set_backlog_limit" succeeds, the "err" value
will be greater than or equal to 0. It will miss the failure of rate set.
Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When calling audit_filter_task(), it calls audit_filter_rules() with audit_context is NULL.
If the key field is set, the result in audit_filter_rules() will be set to 1 and
ctx->filterkey will be set to key.
But the ctx is NULL in this condition, so kernel will panic.
Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> shouldn't these be using the "audit_get_loginuid(current)" and if we
> are going to output loginuid we also should be outputting sessionid
Thanks for your detailed explanation.
I have made a new patch for outputing "loginuid" and "sessionid" by audit_get_loginuid(current) and audit_get_sessionid(current).
If there are some deficiencies, please give me your indication.
Signed-off-by: Zhang Xiliang <zhangxiliang@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Hello,
According to my understanding there is an off-by-one bug in the
function:
audit_string_contains_control()
in:
kernel/audit.c
Patch is included.
I do not know from how many places the function is called from, but for
example, SELinux Access Vector Cache tries to log untrusted filenames via
call path:
avc_audit()
audit_log_untrustedstring()
audit_log_n_untrustedstring()
audit_string_contains_control()
If audit_string_contains_control() detects control characters, then the
string is hex-encoded. But the hex=0x7f dec=127, DEL-character, is not
detected.
I guess this could have at least some minor security implications, since a
user can create a filename with 0x7f in it, causing logged filename to
possibly look different when someone reads it on the terminal.
Signed-off-by: Vesa-Matti Kari <vmkari@cc.helsinki.fi>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Makes the kernel audit subsystem collect information about the sending
process when that process sends SIGUSR2 to the userspace audit daemon.
SIGUSR2 is a new interesting signal to auditd telling auditd that it
should try to start logging to disk again and the error condition which
caused it to stop logging to disk (usually out of space) has been
rectified.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The command "info threads" did not work correctly with kgdb. It would
result in a silent kernel hang if used.
This patach addresses several problems.
- Fix use of deprecated NR_CPUS
- Fix kgdb to not walk linearly through the pid space
- Correctly implement shadow pids
- Change the threads per query to a #define
- Fix kgdb_hex2long to work with negated values
The threads 0 and -1 are reserved to represent the current task. That
means that CPU 0 will start with a shadow thread id of -2, and CPU 1
will have a shadow thread id of -3, etc...
From the debugger you can switch to a shadow thread to see what one of
the other cpus was doing, however it is not possible to execute run
control operations on any other cpu execept the cpu executing the
kgdb_handle_exception().
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
A regression to the kgdb core was found in the case of using the
CONFIG_DEBUG_RODATA kernel option. When this option is on, a breakpoint
cannot be written into any readonly memory page. When an external
debugger requests a breakpoint to get set, the
kgdb_validate_break_address() was only checking to see if the address
to place the breakpoint was readable and lacked a write check.
This patch changes the validate routine to try reading (via the
breakpoint set request) and also to try immediately writing the break
point. If either fails, an error is correctly returned and the
debugger behaves correctly. Then an end user can make the
descision to use hardware breakpoints.
Also update the documentation to reflect that using
CONFIG_DEBUG_RODATA will inhibit the use of software breakpoints.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
While thinking about David's graph walk lockdep patch it _finally_
dawned on me that there is no reason we have a lock class per cpu ...
Sorry for being dense :-/
The below changes the annotation from a lock class per cpu, to a single
nested lock, as the scheduler never holds more that 2 rq locks at a time
anyway.
If there was code requiring holding all rq locks this would not work and
the original annotation would be the only option, but that not being the
case, this is a much lighter one.
Compiles and boots on a 2-way x86_64.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we traverse the graph, either forwards or backwards, we
are interested in whether a certain property exists somewhere
in a node reachable in the graph.
Therefore it is never necessary to traverse through a node more
than once to get a correct answer to the given query.
Take advantage of this property using a global ID counter so that we
need not clear all the markers in all the lock_class entries before
doing a traversal. A new ID is choosen when we start to traverse, and
we continue through a lock_class only if it's ID hasn't been marked
with the new value yet.
This short-circuiting is essential especially for high CPU count
systems. The scheduler has a runqueue per cpu, and needs to take
two runqueue locks at a time, which leads to long chains of
backwards and forwards subgraphs from these runqueue lock nodes.
Without the short-circuit implemented here, a graph traversal on
a runqueue lock can take up to (1 << (N - 1)) checks on a system
with N cpus.
For anything more than 16 cpus or so, lockdep will eventually bring
the machine to a complete standstill.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When taking the time of a remote CPU, use the opportunity to
couple (sync) the clocks to each other. (in a monotonic way)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
- return the current clock instead of letting callers
fetch it from scd->clock
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>
eliminate prev_raw and use tick_raw instead.
It's enough to base the current time on the scheduler tick timestamp
alone - the monotonicity and maximum checks will prevent any damage.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Mike Galbraith <efault@gmx.de>