Commit Graph

3665 Commits

Author SHA1 Message Date
Jens Axboe
d6b29d7cee splice: divorce the splice structure/function definitions from the pipe header
We need to move even more stuff into the header so that folks can use
the splice_to_pipe() implementation instead of open-coding a lot of
pipe knowledge (see relay implementation), so move to our own header
file finally.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Tom Zanussi
ebf9909343 splice: relay support
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-10 08:04:14 +02:00
Ingo Molnar
c31f2e8a42 sched: add CFS credits
add credits for recent major scheduler contributions:

  Con Kolivas, for pioneering the fair-scheduling approach
  Peter Williams, for smpnice
  Mike Galbraith, for interactivity tuning of CFS
  Srivatsa Vaddagiri, for group scheduling enhancements

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:01 +02:00
Ingo Molnar
0fec171cdb sched: clean up sleep_on() APIs
clean up the sleep_on() APIs:

 - do not use fastcall
 - replace fragile macro magic with proper inline functions

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:01 +02:00
Ingo Molnar
9761eea851 sched: style cleanups
4 small style cleanups to sched.c: checkpatch.pl is now happy about
the totality of sched.c [ignoring false positives] - yay! ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
23bdd703a5 sched: do not set softirqs to nice +19
do not set softirqs to nice +19. _If_ for whatever reason
we missed to process some high-prio softirq and woke up
ksoftirqd, we should give it a fair chance to actually
get some work done, even if the system is under load.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
43ae34cb4c sched: scheduler debugging, core
scheduler debugging core: implement /proc/sched_debug and
/proc/<PID>/sched files for scheduler debugging.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
77e54a1f88 sched: add CFS debug sysctls
add CFS debug sysctls: only tweakable if SCHED_DEBUG is enabled.
This allows for faster debugging of scheduler problems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
b2cfba19f6 sched: remove unused rq types from sched.c
remove unused rq types from sched.c, now that we switched
over to CFS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
634fa8c97c sched: remove interactivity types
remove now unused interactivity-heuristics related defined and
types of the old scheduler.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
dff06c157b sched: clean up include files in sched.c
clean up include files in sched.c, they were still old-style <asm/>.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Balbir Singh
172ba844a8 sched: update delay-accounting to use CFS's precise stats
update delay-accounting to use CFS's precise stats.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:52:00 +02:00
Ingo Molnar
1b9f19c212 sched: turn on the use of unstable events
make use of sched-clock-unstable events.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
bb29ab2686 sched: x86, track TSC-unstable events
track TSC-unstable events and propagate it to the scheduler code.
Also allow sched_clock() to be used when the TSC is unstable,
the rq_clock() wrapper creates a reliable clock out of it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
dd41f596cd sched: cfs core code
apply the CFS core code.

this change switches over the scheduler core to CFS's modular
design and makes use of kernel/sched_fair/rt/idletask.c to implement
Linux's scheduling policies.

thanks to Andrew Morton and Thomas Gleixner for lots of detailed review
feedback and for fixlets.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
2007-07-09 18:51:59 +02:00
Ingo Molnar
f3479f10c5 sched: remove the sleep-bonus interactivity code
remove the sleep-bonus interactivity code from the core scheduler.

scheduling policy is implemented in the policy modules, and CFS does
not need such type of heuristics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
c18a17329b sched: remove expired_starving()
remove the expired_starving() heuristics from the core scheduler.

CFS does not need it, and this did not really work well in practice
anyway, due to the rq->nr_running multiplier to STARVATION_LIMIT.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
f2ac58ee61 sched: remove sleep_type
remove the sleep_type heuristics from the core scheduler - scheduling
policy is implemented in the scheduling-policy modules. (and CFS does
not use this type of sleep-type heuristics)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
45bf76df48 sched: cfs, add load-calculation methods
add the new load-calculation methods of CFS.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
14531189f0 sched: clean up __normal_prio() position
clean up: move __normal_prio() in head of normal_prio().

no code changed.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
71f8bd4600 sched: cleanup: move dequeue/enqueue_task()
cleanup: move dequeue/enqueue_task() to a more logical place, to
not split up __normal_prio()/normal_prio().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
c24d20dbef sched: move around resched_task()
move resched_task()/resched_cpu() into the 'public interfaces'
section of sched.c, for use by kernel/sched_fair/rt/idletask.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
e05606d330 sched: clean up the rt priority macros
clean up the rt priority macros, pointed out by Andrew Morton.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:59 +02:00
Ingo Molnar
138a8aeb5b sched: add cfs_rq ops
add the set_task_cfs_rq() abstraction needed by CONFIG_FAIR_GROUP_SCHED.

(not activated yet)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
41b86e9c51 sched: make posix-cpu-timers use CFS's accounting information
update the posix-cpu-timers code to use CFS's CPU accounting information.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
20d315d42a sched: add rq_clock()/__rq_clock()
add rq_clock()/__rq_clock(), a robust wrapper around sched_clock(),
used by CFS. It protects against common type of sched_clock() problems
(caused by hardware): time warps forwards and backwards.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
6aa645ea5f sched: cfs rq data types
add the CFS rq data types to sched.c.

(the old scheduler fields are still intact, they are removed
 by a later patch)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
fa72e9e484 sched: cfs core, kernel/sched_idletask.c
add kernel/sched_idletask.c - which implements the idle thread
scheduling class. This further simplifies sched.c (under CFS),
for example a number of 'if (p == rq->idle)' type of special-cases
can be removed from sched.c, and schedule() gets simpler too.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
bb44e5d1c6 sched: cfs core, kernel/sched_rt.c
add kernel/sched_rt.c: SCHED_FIFO/SCHED_RR support. The behavior
and semantics of SCHED_FIFO/SCHED_RR tasks is unchanged.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
bf0f6f24a1 sched: cfs core, kernel/sched_fair.c
add kernel/sched_fair.c - which implements the bulk of CFS's
behavioral changes for SCHED_OTHER tasks.

see Documentation/sched-design-CFS.txt about details.

Authors:

 Ingo Molnar <mingo@elte.hu>
 Dmitry Adamushko <dmitry.adamushko@gmail.com>
 Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
 Mike Galbraith <efault@gmx.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
2007-07-09 18:51:58 +02:00
Ingo Molnar
425e0968a2 sched: move code into kernel/sched_stats.h
create sched_stats.h and move sched.c schedstats code into it.
This cleans up sched.c a bit.

no code changes are caused by this patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
1df21055e3 sched: add init_idle_bootup_task()
add the init_idle_bootup_task() callback to the bootup thread,
unused at the moment. (CFS will use it to switch the scheduling
class of the boot thread to the idle class)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
f64f61145a sched: remove sched_exit()
remove sched_exit(): the elaborate dance of us trying to recover
timeslices given to child tasks never really worked.

CFS does not need it either.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
c65cc87052 sched: uninline set_task_cpu()
uninline set_task_cpu(): CFS will add more code to it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:58 +02:00
Ingo Molnar
0437e109e1 sched: zap the migration init / cache-hot balancing code
the SMP load-balancer uses the boot-time migration-cost estimation
code to attempt to improve the quality of balancing. The reason for
this code is that the discrete priority queues do not preserve
the order of scheduling accurately, so the load-balancer skips
tasks that were running on a CPU 'recently'.

this code is fundamental fragile: the boot-time migration cost detector
doesnt really work on systems that had large L3 caches, it caused boot
delays on large systems and the whole cache-hot concept made the
balancing code pretty undeterministic as well.

(and hey, i wrote most of it, so i can say it out loud that it sucks ;-)

under CFS the same purpose of cache affinity can be achieved without
any special cache-hot special-case: tasks are sorted in the 'timeline'
tree and the SMP balancer picks tasks from the left side of the
tree, thus the most cache-cold task is balanced automatically.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:57 +02:00
Ingo Molnar
d15bcfdbe1 sched: rename idle_type/SCHED_IDLE
enum idle_type (used by the load-balancer) clashes with the
SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead
of 'SCHED_IDLE' is more descriptive as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-07-09 18:51:57 +02:00
Thomas Gleixner
746976a301 NTP: remove clock_was_set() call to prevent deadlock
The clock_was_set() call in seconds_overflow() which happens only when
leap seconds are inserted / deleted is wrong in two aspects:

1. it results in a call to on_each_cpu() with interrupts disabled
2. it is potential deadlock source vs. call_lock in smp_call_function()

The only possible side effect of the removal might be, that an absolute
CLOCK_REALTIME timer fires 1 second too late, in the rare case of leap
second deletion and an absolute CLOCK_REALTIME timer which expires in
the affected time frame. It will never fire too early.

This was probably observed by the reporter of a June 30th -> July 1st
hang: http://lkml.org/lkml/2007/7/3/103

A similar problem was observed by Dave Jones, who provided a screen shot
with a lockdep back trace, which allowed to analyse the problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03 13:54:27 -07:00
Rafael J. Wysocki
2391dae3e3 PM: introduce set_target method in pm_ops
Commit 52ade9b3b9 changed the suspend code
ordering to execute pm_ops->prepare() after the device model per-device
.suspend() calls in order to fix some ACPI-related issues.  Unfortunately, it
broke the at91 platform which assumed that pm_ops->prepare() would be called
before suspending devices.

at91 used pm_ops->prepare() to get notified of the target system sleep state,
so that it could use this information while suspending devices.  However, with
the current suspend code ordering pm_ops->prepare() is called too late for
this purpose.  Thus, at91 needs an additional method in 'struct pm_ops' that
will be used for notifying the platform of the target system sleep state.
Moreover, in the future such a method will also be needed by ACPI.

This patch adds the .set_target() method to 'struct pm_ops' and makes the
suspend code call it, if implemented, before executing the device model
per-device .suspend() calls.  It also modifies the at91 code to use
pm_ops->set_target() instead of pm_ops->prepare().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-01 12:29:44 -07:00
Masami Hiramatsu
a66e356c04 relayfs: fix overwrites
When I use relayfs with "overwrite" mode, read() still sets incorrect
number of consumed bytes.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Tom Zanussi <zanussi@us.ibm.com>
Acked-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:38:18 -07:00
David Wilder
8d62fdebda relay file read: start-pos fix
Fix a bug in the relay read interface causing the number of consumed bytes
to be set incorrectly.

Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28 11:34:54 -07:00
Thomas Gleixner
a06381fec7 FUTEX: Restore the dropped ERSCH fix
The return value of futex_find_get_task() needs to be -ESRCH in case
that the search fails.  This was part of the original futex fixes and
got accidentally dropped, when the futex-tidy-up patch was split out.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Stable Team <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 12:08:53 -07:00
Tony Jones
7b018b2888 audit: fix oops removing watch if audit disabled
Removing a watched file will oops if audit is disabled (auditctl -e 0).

To reproduce:
- auditctl -e 1
- touch /tmp/foo
- auditctl -w /tmp/foo
- auditctl -e 0
- rm /tmp/foo (or mv)

Signed-off-by: Tony Jones <tonyj@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:12 -07:00
Christoph Lameter
92c4ca5c3a sched: fix next_interval determination in idle_balance()
The intervals of domains that do not have SD_BALANCE_NEWIDLE must be
considered for the calculation of the time of the next balance.  Otherwise
we may defer rebalancing forever.

Siddha also spotted that the conversion of the balance interval
to jiffies is missing. Fix that to.

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

also continue the loop if !(sd->flags & SD_LOAD_BALANCE).

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

It did in fact trigger under all three of mainline, CFS, and -rt including CFS
-- see below for a couple of emails from last Friday giving results for these
three on the AMD box (where it happened) and on a single-quad NUMA-Q system
(where it did not, at least not with such severity).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:11 -07:00
Cedric Le Goater
4e71e474c7 fix refcounting of nsproxy object when unshared
When a namespace is unshared, a refcount on the previous nsproxy is
abusively taken, leading to a memory leak of nsproxy objects.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:10 -07:00
Thomas Gleixner
58229a1899 posix-timers: Prevent softirq starvation by small intervals and SIG_IGN
posix-timers which deliver an ignored signal are currently rearmed in
the timer softirq: This is necessary because the timer needs to be
delivered again when SIG_IGN is removed. This is not a problem, when
the interval is reasonable.

With high resolution timers enabled one might arm a posix timer with a
very small interval and ignore the signal. This might lead to a
softirq starvation when the interval is so small that the timer is
requeued onto the softirq pending list right away.

This problem was pointed out by Jan Kiszka. Thanks Jan !

The correct solution would be to stop the timer, when the signal is
ignored and rearm it when SIG_IGN is removed. Unfortunately this
requires modification in sigaction and involves non trivial sighand
locking. It's too late in the release cycle for such a change.

For now we just keep the timer running and enforce that the timer only
fires every jiffie. This does not break anything as we keep the
overrun counter correct. It adds a little inaccuracy to the
timer_gettime() interface, but...

The more complex change is necessary anyway to fix another short
coming of the current implementation, which I discovered while looking
at this problem: A pending signal is discarded when SIG_IGN is set. In
case that a posixtimer signal is pending then it is discarded as well,
but when SIG_IGN is removed later nothing rearms the timer. This is
not new, it's that way since posix timers have been merged. So nothing
to worry about right now.

I have a working solution to fix all of this, but the impact is too
large for both stable and 2.6.22. I'm going to send it out for review
in the next days.

This should go into 2.6.21.stable as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Stable Team <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-21 15:57:04 -07:00
Linus Torvalds
fa490cfd15 Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.

He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).

This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section.  If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.

So the code now looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
		rq = task_rq(p);
		while (task_running(rq, p))
			cpu_relax();

		/* Get the *real* values */
		rq = task_rq_lock(p, &flags);
		running = task_running(rq, p);
		array = p->array;
		task_rq_unlock(rq, &flags);

		/* Check them.. */
		if (unlikely(running)) {
			cpu_relax();
			goto repeat;
		}

		/* Preempted away? Yield if so.. */
		if (unlikely(array)) {
			yield();
			goto repeat;
		}

Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care.  Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.

So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_.  And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.

(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).

Thanks to Miklos for all the testing and tracking it down.

Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
Ingo Molnar
a0f98a1cb7 sched: fix SysRq-N (normalize RT tasks)
Gene Heskett reported the following problem while testing CFS: SysRq-N
is not always effective in normalizing tasks back to SCHED_OTHER.

The reason for that turns out to be the following bug:

 - normalize_rt_tasks() uses for_each_process() to iterate through all
   tasks in the system.  The problem is, this method does not iterate
   through all tasks, it iterates through all thread groups.

The proper mechanism to enumerate over all threads is to use a
do_each_thread() + while_each_thread() loop.

Reported-by: Gene Heskett <gene.heskett@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
Benjamin Herrenschmidt
caec4e8dc8 Fix signalfd interaction with thread-private signals
Don't let signalfd dequeue private signals off other threads (in the
case of things like SIGILL or SIGSEGV, trying to do so would result
in undefined behaviour on who actually gets the signal, since they
are force unblocked).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 10:18:32 -07:00
Thomas Gleixner
bd197234b0 Revert "futex_requeue_pi optimization"
This reverts commit d0aa7a70bf.

It not only introduced user space visible changes to the futex syscall,
it is also non-functional and there is no way to fix it proper before
the 2.6.22 release.

The breakage report ( http://lkml.org/lkml/2007/5/12/17 ) went
unanswered, and unfortunately it turned out that the concept is not
feasible at all.  It violates the rtmutex semantics badly by introducing
a virtual owner, which hacks around the coupling of the user-space
pi_futex and the kernel internal rt_mutex representation.

At the moment the only safe option is to remove it fully as it contains
user-space visible changes to broken kernel code, which we do not want
to expose in the 2.6.22 release.

The patch reverts the original patch mostly 1:1, but contains a couple
of trivial manual cleanups which were necessary due to patches, which
touched the same area of code later.

Verified against the glibc tests and my own PI futex tests.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 09:48:41 -07:00
Rafael J. Wysocki
2f41dddbbd swsusp: Fix userland interface
Fix oops caused by 'cat /dev/snapshot', reported by Arkadiusz Miskiewicz,
and make it impossible to thaw tasks with the help of the swsusp userland
interface while there is a snapshot image ready to save.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Paul Jackson
3e903e7b16 cpuset: zero malloc - fix for old cpusets
The cpuset code to present a list of tasks using a cpuset to user space could
write to an array that it had kmalloc'd, after a kmalloc request of zero size.

The problem was that the code didn't check for writes past the allocated end
of the array until -after- the first write.

This is a race condition that is likely rare -- it would only show up if a
cpuset went from being empty to having a task in it, during the brief time
between the allocation and the first write.

Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
zero kmalloc returned a few usable bytes anyway, and no harm was done with the
bogus write.

With the 2.6.22 kernel changes to make issue a warning if code tries to write
to the location returned from a zero size allocation, this problem is no
longer benign.  This cpuset code would occassionally trigger that warning.

The fix is trivial -- check before storing into the array, not after, whether
the array is big enough to hold the store.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Alexey Kuznetsov
778e9a9c3e pi-futex: fix exit races and locking problems
1. New entries can be added to tsk->pi_state_list after task completed
   exit_pi_state_list(). The result is memory leakage and deadlocks.

2. handle_mm_fault() is called under spinlock. The result is obvious.

3. results in self-inflicted deadlock inside glibc.
   Sometimes futex_lock_pi returns -ESRCH, when it is not expected
   and glibc enters to for(;;) sleep() to simulate deadlock. This problem
   is quite obvious and I think the patch is right. Though it looks like
   each "if" in futex_lock_pi() got some stupid special case "else if". :-)

4. sometimes futex_lock_pi() returns -EDEADLK,
   when nobody has the lock. The reason is also obvious (see comment
   in the patch), but correct fix is far beyond my comprehension.
   I guess someone already saw this, the chunk:

                        if (rt_mutex_trylock(&q.pi_state->pi_mutex))
                                ret = 0;

   is obviously from the same opera. But it does not work, because the
   rtmutex is really taken at this point: wake_futex_pi() of previous
   owner reassigned it to us. My fix works. But it looks very stupid.
   I would think about removal of shift of ownership in wake_futex_pi()
   and making all the work in context of process taking lock.

From: Thomas Gleixner <tglx@linutronix.de>

Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
    an additional state transition to the exit code.

    This fixes also the issue, when a task with recursive segfaults
    is not able to release the futexes.

Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
    problem finally.

Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
    in the lock protected section by using the in_atomic userspace access
    functions.

    This removes also the ugly lock drop / unqueue inside of fixup_pi_state()

Fix 4) Fix a stale lock in the error path of futex_wake_pi()

Added some error checks for verification.

The -EDEADLK problem is solved by the rtmutex fixups.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Thomas Gleixner
1a539a8728 rt-mutex: fix chain walk early wakeup bug
Alexey Kuznetsov found some problems in the pi-futex code.

One of the root causes is:

When a wakeup happens, we do not to stop the chain walk so we follow a not
longer relevant locking chain.

Drop out when this happens.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Thomas Gleixner
c0d1d2bf5a rt-mutex: fix stale return value
Alexey Kuznetsov found some problems in the pi-futex code.

The major problem is a stale return value in rt_mutex_slowlock():

When the pi chain walk returns -EDEADLK, but the waiter was woken up during
the phases where the locks were dropped, the rtmutex could be acquired, but
due to the stale return value -EDEADLK returned to the caller.

Reset the return value in the retry path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:34 -07:00
Roland McGrath
b74d0deb96 Restrict clearing TIF_SIGPENDING
This patch should get a few birds.  It prevents sigaction calls from
clearing TIF_SIGPENDING in other threads, which could leak -ERESTART*.
And It fixes ptrace_stop not to clear it, which done at the syscall exit
stop could leak -ERESTART*.  It probably removes the harm from signalfd,
at least assuming it never calls dequeue_signal on kernel threads that
might have used block_all_signals.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-07 08:52:15 -07:00
Ingo Molnar
c1a834dc70 timer stats: speedups
Make timer-stats have almost zero overhead when enabled in the config but
not used.  (this way distros can enable it more easily)

Also update the documentation about overhead of timer_stats - it was
written for the first version which had a global lock and a linear list
walk based lookup ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Bjorn Steinbrink
9fcc15ec3c timer statistics: fix race
Fix two races in the timer stats lookup code.  One by ensuring that the
initialization of a new entry is finished upon insertion of that entry.
The other by cleaning up the hash table when the entries array is cleared,
so that we don't have any "pre-inserted" entries.

Thanks to Eric Dumazet for reminding me of the memory barriers.

Signed-off-by: Bjorn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Ian Kumlien <pomac@vapor.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Ulrich Drepper
f0ede66fca fix compat futex code for private futexes
When the private futex support was added the compat code wasn't changed.
The result is that code using compat code which fail, e.g., because the
timeout values are not correctly passed.  The following patch should fix
that.

Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:28 -07:00
Kyle McMartin
7a74fc4925 fix possible null ptr deref in kallsyms_lookup
ugh, this function gets called by our unwinder. recursive backtrace for
the win... bisection to find this one was "fun."

Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-30 10:51:38 -07:00
Thomas Gleixner
eaad084bb0 NOHZ: prevent multiplication overflow - stop timer for huge timeouts
get_next_timer_interrupt() returns a delta of (LONG_MAX > 1) in case
there is no timer pending. On 64 bit machines this results in a
multiplication overflow in tick_nohz_stop_sched_tick().

Reported by: Dave Miller <davem@davemloft.net>

Make the return value a constant and limit the return value to a 32 bit
value.

When the max timeout value is returned, we can safely stop the tick
timer device. The max jiffies delta results in a 12 days timeout for
HZ=1000.

In the long term the get_next_timer_interrupt() code needs to be
reworked to return ktime instead of jiffies, but we have to wait until
the last users of the original NO_IDLE_HZ code are converted.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-29 18:11:10 -07:00
Linus Torvalds
92ea77275b Fix crash with irqpoll due to the IRQF_IRQPOLL flag testing
With irqpoll enabled, trying to test the IRQF_IRQPOLL flag in the
actions would cause a NULL pointer dereference if no action was
installed (for example, the driver might have been unloaded with
interrupts still pending).

So be a bit more careful about testing the flag by making sure to test
for that case.

(The actual _change_ is trivial, the patch is more than a one-liner
because I rewrote the testing to also be much more readable.

Original (discarded) bugfix by Bernhard Walle.

Cc: Bernhard Walle <bwalle@suse.de>
Tested-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-24 08:37:14 -07:00
Thomas Gleixner
98d8256739 Prevent going idle with softirq pending
The NOHZ patch contains a check for softirqs pending when a CPU goes idle.
The BUG is unrelated to NOHZ, it just was made visible by the NOHZ patch.
The BUG showed up mainly on P4 / hyperthreading enabled machines which lead
the investigations into the wrong direction in the first place.  The real
cause is in cond_resched_softirq():

cond_resched_softirq() is enabling softirqs without invoking the softirq
daemon when softirqs are pending.  This leads to the warning message in the
NOHZ idle code:

t1 runs softirq disabled code on CPU#0
interrupt happens, softirq is raised, but deferred (softirqs disabled)
t1 calls cond_resched_softirq()
	enables softirqs via _local_bh_enable()
	calls schedule()
t2 runs
t1 is migrated to CPU#1
t2 is done and invokes idle()
NOHZ detects the pending softirq

Fix: change _local_bh_enable() to local_bh_enable() so the softirq
daemon is invoked.

Thanks to Anant Nitya for debugging this with great patience !

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:15 -07:00
OGAWA Hirofumi
6373da1fb7 power: Fix sizeof(PAGE_SIZE) typo
Fix sizeof(PAGE_SIZE) typo.  It should be just PAGE_SIZE for zeroing the
swsusp_header.

Signed-off-by: OGAWA Hirofumi <hogawa@miraclelinux.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:14 -07:00
Oleg Nesterov
14441960e8 simplify cleanup_workqueue_thread()
cleanup_workqueue_thread() and cwq_should_stop() are overcomplicated.

Convert the code to use kthread_should_stop/kthread_stop as was
suggested by Gautham and Srivatsa.

In particular this patch removes the (unlikely) busy-wait loop from the
exit path, it was a temporary and ugly kludge (if not a bug).

Note: the current code was designed to solve another old problem:
work->func can't share locks with hotplug callbacks.  I think this could
be done, see

	http://marc.info/?l=linux-kernel&m=116905366428633

but this needs some more complications to preserve CPU affinity of
cwq->thread during cpu_up().  A freezer-based hotplug looks more
appealing.

[akpm@linux-foundation.org: make it more tolerant of gcc borkenness]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Zilvinas Valinskas <zilvinas@wilibox.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Roland McGrath
7bb44adef3 recalc_sigpending_tsk fixes
Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in
do_sigaction but no signal_wake_up call was made, preventing later signals
from waking up blocked threads with TIF_SIGPENDING already set.

In fact, the few other calls to recalc_sigpending_tsk outside the signals
code are also subject to this problem in other race conditions.

This change makes recalc_sigpending_tsk private to the signals code.  It
changes the outside calls, as well as do_sigaction, to use the new
recalc_sigpending_and_wake instead.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: <Steve.Hawkes@motorola.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:12 -07:00
Thomas Gleixner
3528231606 NOHZ: Rate limit the local softirq pending warning output
The warning in the NOHZ code, which triggers when a CPU goes idle with
softirqs pending can fill up the logs quite quickly.  Rate limit the output
until we found the root cause of that problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Thomas Gleixner
72fcde9662 Ignore bogus ACPI info for offline CPUs
Booting a SMP kernel with maxcpus=1 on a SMP system leads to a hard hang,
because ACPI ignores the maxcpus setting and sends timer broadcast info for
the offline CPUs.  This results in a stuck for ever call to
smp_call_function_single() on an offline CPU.

Ignore the bogus information and print a kernel error to remind ACPI
folks to fix it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Gautham R Shenoy
88f18ba028 freezer: move frozen_process() to kernel/power/process.c
Other than refrigerator, no one else calls frozen_process().  So move it from
include/linux/freezer.h to kernel/power/process.c.

Also, since a task can be marked as frozen by itself, we don't need to pass
the (struct task_struct *p) parameter to frozen_process().

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Oleg Nesterov
a076e4bca2 freezer: fix kthread_create vs freezer theoretical race
kthread() sleeps in TASK_INTERRUPTIBLE state waiting for the first wakeup.  In
theory, this wakeup may come from freeze_process()->signal_wake_up(), so the
task can disappear even before kthread_create() sets its ->comm.

Change kthread() to use TASK_UNINTERRUPTIBLE.

[akpm@linux-foundation.org: s/BUG_ON/WARN_ON+recover]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki
49b12d4f5e freezer: take kernel_execve into consideration
Kernel threads can become userland processes by calling kernel_execve().

In particular, this may happen right after the try_to_freeze_tasks()
called with FREEZER_USER_SPACE has returned, so try_to_freeze_tasks()
needs to take userspace processes into consideration even if it is
called with FREEZER_KERNEL_THREADS.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki
ba96a0c880 freezer: fix vfork problem
Currently try_to_freeze_tasks() has to wait until all of the vforked processes
exit and for this reason every user can make it fail.  To fix this problem we
can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
that do not want to be counted as freezable by the freezer and want to have
TIF_FREEZE set nevertheless.  Then, this flag can be set by tasks using
sys_vfork() before they call wait_for_completion(&vfork) and cleared after
they have woken up.  After clearing it, the tasks should call try_to_freeze()
as soon as possible.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Rafael J. Wysocki
33e1c288da freezer: close potential race between refrigerator and thaw_tasks
If the freezing of tasks fails and a task is preempted in refrigerator()
before calling frozen_process(), then thaw_tasks() may run before this task is
frozen.  In that case the task will freeze and no one will thaw it.

To fix this race we can call freezing(current) in refrigerator() along with
frozen_process(current) under the task_lock() which also should be taken in
the error path of try_to_freeze_tasks() as well as in thaw_process().
Moreover, if thaw_process() additionally clears TIF_FREEZE for tasks that are
not frozen, we can be sure that all tasks are thawed and there are no pending
"freeze" requests after thaw_tasks() has run.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:10 -07:00
Alexey Dobriyan
e8edc6e03a Detach sched.h from mm.h
First thing mm.h does is including sched.h solely for can_do_mlock() inline
function which has "current" dereference inside. By dealing with can_do_mlock()
mm.h can be detached from sched.h which is good. See below, why.

This patch
a) removes unconditional inclusion of sched.h from mm.h
b) makes can_do_mlock() normal function in mm/mlock.c
c) exports can_do_mlock() to not break compilation
d) adds sched.h inclusions back to files that were getting it indirectly.
e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
   getting them indirectly

Net result is:
a) mm.h users would get less code to open, read, preprocess, parse, ... if
   they don't need sched.h
b) sched.h stops being dependency for significant number of files:
   on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
   after patch it's only 3744 (-8.3%).

Cross-compile tested on

	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
	alpha alpha-up
	arm
	i386 i386-up i386-defconfig i386-allnoconfig
	ia64 ia64-up
	m68k
	mips
	parisc parisc-up
	powerpc powerpc-up
	s390 s390-up
	sparc sparc-up
	sparc64 sparc64-up
	um-x86_64
	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig

as well as my two usual configs.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-21 09:18:19 -07:00
Rafael J. Wysocki
8d98a690f5 swsusp: fix sysfs interface
The sysfs files /sys/power/disk and /sys/power/state do not work as
documented, since they allow the user to write only a few initial
characters of the input string to trigger the option (eg.  'echo pl >
/sys/power/disk' activates the platform mode of hibernation).  Fix it.

Special thanks to Peter Moulder <Peter.Moulder@infotech.monash.edu.au> for
pointing out the problem.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:05 -07:00
Dan Aloni
71ce92f3fa make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size
Make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core
filename size and change it to 128, so that extensive patterns such as
'/local/cores/%e-%h-%s-%t-%p.core' won't result in truncated filename
generation.

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:05 -07:00
Christoph Lameter
a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Linus Torvalds
52ade9b3b9 Fix ACPI suspend / device suspend ordering problem
In commit e3c7db621b we fixed the resume
ordering, so that the ACPI low-level resume code was called before the
actual driver resume was called. However, that broke the nesting logic
of suspend and resume, and we continued to suspend the devices _after_
we the ACPI device suspend code was called.

That resulted in us saving PCI state for devices that had already been
changed by ACPI, and in some cases disabled entirely (causing the PCI
save_state to be all-ones).  Which in turn caused the wrong state to be
written back on resume.

This moves the ACPI device suspend to after the device model per-device
suspend() calls. This fixes the bogus state save.

Thanks to Lukáš Hejtmánek for testing.

Acked-by: Lukas Hejtmanek <xhejtman@ics.muni.cz>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-16 15:33:19 -07:00
Al Viro
327b9eebbf audit_match_signal() and friends are used only if CONFIG_AUDITSYSCALL is set
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-15 18:56:37 -07:00
Thomas Gleixner
8f89441b37 clocksource: fix lock order in the resume path
lockdep complains about the lock nesting of clocksource and watchdog lock
in the resume path.

Change the resume marker to a bit operation and remove the lock from this
path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-15 08:54:00 -07:00
Thomas Gleixner
d10ff3fb62 timekeeping fix patch got mis-applied
The time keeping code move to kernel/time/timekeeping.c broke the
clocksource resume logic patch, which got applied to the old file by a
fuzzy application.  Fix it up and move the clocksource_resume() call to
the appropriate place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[ tssk, tssk, everybody should use --fuzz=0 ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-14 12:13:11 -07:00
Heiko Carstens
8df767dd75 compat signalfd and timerfd are cond syscalls
Add missing cond_syscall statements for compat_sys_signalfd and
compat_sys_timerfd.

Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-12 10:55:40 -07:00
Linus Torvalds
2a383c63ff Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] Quicklist support for IA64
  [IA64] fix Kprobes reentrancy
  [IA64] SN: validate smp_affinity mask on intr redirect
  [IA64] drivers/char/snsc_event.c:206: warning: unused variable `p'
  [IA64] mca.c:121: warning: 'cpe_poll_timer' defined but not used
  [IA64] Fix - Section mismatch: reference to .init.data:mvec_name
  [IA64] more warning cleanups
  [IA64] Wire up epoll_pwait and utimensat
  [IA64] Fix warnings resulting from type-checking in dev_dbg()
  [IA64] typo s/kenrel/kernel/
2007-05-11 12:53:21 -07:00
Linus Torvalds
853da00220 Merge branch 'audit.b38' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b38' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] Abnormal End of Processes
  [PATCH] match audit name data
  [PATCH] complete message queue auditing
  [PATCH] audit inode for all xattr syscalls
  [PATCH] initialize name osid
  [PATCH] audit signal recipients
  [PATCH] add SIGNAL syscall class (v3)
  [PATCH] auditing ptrace
2007-05-11 09:57:16 -07:00
John Keller
25d61578da [IA64] SN: validate smp_affinity mask on intr redirect
On SN, only allow one bit to be set in the smp_affinty mask when
redirecting an interrupt.  Currently setting multiple bits is allowed, but
only the first bit is used in determining the CPU to redirect to.  This has
caused confusion among some customers.

[akpm@linux-foundation.org: fixes]
Signed-off-by: John Keller <jpk@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2007-05-11 09:35:38 -07:00
Davide Libenzi
e1ad7468c7 signal/timer/event: eventfd core
This is a very simple and light file descriptor, that can be used as event
wait/dispatch by userspace (both wait and dispatch) and by the kernel
(dispatch only).  It can be used instead of pipe(2) in all cases where those
would simply be used to signal events.  Their kernel overhead is much lower
than pipes, and they do not consume two fds.  When used in the kernel, it can
offer an fd-bridge to enable, for example, functionalities like KAIO or
syslets/threadlets to signal to an fd the completion of certain operations.
But more in general, an eventfd can be used by the kernel to signal readiness,
in a POSIX poll/select way, of interfaces that would otherwise be incompatible
with it.  The API is:

int eventfd(unsigned int count);

The eventfd API accepts an initial "count" parameter, and returns an eventfd
fd.  It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).

The POLLIN flag is raised when the internal counter is greater than zero.

The POLLOUT flag is raised when at least a value of "1" can be written to the
internal counter.

The POLLERR flag is raised when an overflow in the counter value is detected.

The write(2) operation can never overflow the counter, since it blocks (unless
O_NONBLOCK is set, in which case -EAGAIN is returned).

But the eventfd_signal() function can do it, since it's supposed to not sleep
during its operation.

The read(2) function reads the __u64 counter value, and reset the internal
value to zero.  If the value read is equal to (__u64) -1, an overflow happened
on the internal counter (due to 2^64 eventfd_signal() posts that has never
been retired - unlickely, but possible).

The write(2) call writes an __u64 count value, and adds it to the current
counter.  The eventfd fd supports O_NONBLOCK also.

On the kernel side, we have:

struct file *eventfd_fget(int fd);
int eventfd_signal(struct file *file, unsigned int n);

The eventfd_fget() should be called to get a struct file* from an eventfd fd
(this is an fget() + check of f_op being an eventfd fops pointer).

The kernel can then call eventfd_signal() every time it wants to post an event
to userspace.  The eventfd_signal() function can be called from any context.
An eventfd() simple test and bench is available here:

http://www.xmailserver.org/eventfd-bench.c

This is the eventfd-based version of pipetest-4 (pipe(2) based):

http://www.xmailserver.org/pipetest-4.c

Not that performance matters much in the eventfd case, but eventfd-bench
shows almost as double as performance than pipetest-4.

[akpm@linux-foundation.org: fix i386 build]
[akpm@linux-foundation.org: add sys_eventfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Davide Libenzi
83f5d12669 signal/timer/event: timerfd compat code
This patch implements the necessary compat code for the timerfd system call.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Davide Libenzi
b215e28399 signal/timer/event: timerfd core
This patch introduces a new system call for timers events delivered though
file descriptors.  This allows timer event to be used with standard POSIX
poll(2), select(2) and read(2).  As a consequence of supporting the Linux
f_op->poll subsystem, they can be used with epoll(2) too.

The system call is defined as:

int timerfd(int ufd, int clockid, int flags, const struct itimerspec *utmr);

The "ufd" parameter allows for re-use (re-programming) of an existing timerfd
w/out going through the close/open cycle (same as signalfd).  If "ufd" is -1,
s new file descriptor will be created, otherwise the existing "ufd" will be
re-programmed.

The "clockid" parameter is either CLOCK_MONOTONIC or CLOCK_REALTIME.  The time
specified in the "utmr->it_value" parameter is the expiry time for the timer.

If the TFD_TIMER_ABSTIME flag is set in "flags", this is an absolute time,
otherwise it's a relative time.

If the time specified in the "utmr->it_interval" is not zero (.tv_sec == 0,
tv_nsec == 0), this is the period at which the following ticks should be
generated.

The "utmr->it_interval" should be set to zero if only one tick is requested.
Setting the "utmr->it_value" to zero will disable the timer, or will create a
timerfd without the timer enabled.

The function returns the new (or same, in case "ufd" is a valid timerfd
descriptor) file, or -1 in case of error.

As stated before, the timerfd file descriptor supports poll(2), select(2) and
epoll(2).  When a timer event happened on the timerfd, a POLLIN mask will be
returned.

The read(2) call can be used, and it will return a u32 variable holding the
number of "ticks" that happened on the interface since the last call to
read(2).  The read(2) call supportes the O_NONBLOCK flag too, and EAGAIN will
be returned if no ticks happened.

A quick test program, shows timerfd working correctly on my amd64 box:

http://www.xmailserver.org/timerfd-test.c

[akpm@linux-foundation.org: add sys_timerfd to sys_ni.c]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Davide Libenzi
fba2afaaec signal/timer/event: signalfd core
This patch series implements the new signalfd() system call.

I took part of the original Linus code (and you know how badly it can be
broken :), and I added even more breakage ;) Signals are fetched from the same
signal queue used by the process, so signalfd will compete with standard
kernel delivery in dequeue_signal().  If you want to reliably fetch signals on
the signalfd file, you need to block them with sigprocmask(SIG_BLOCK).  This
seems to be working fine on my Dual Opteron machine.  I made a quick test
program for it:

http://www.xmailserver.org/signafd-test.c

The signalfd() system call implements signal delivery into a file descriptor
receiver.  The signalfd file descriptor if created with the following API:

int signalfd(int ufd, const sigset_t *mask, size_t masksize);

The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
to close/create cycle (Linus idea).  Use "ufd" == -1 if you want a brand new
signalfd file.

The "mask" allows to specify the signal mask of signals that we are interested
in.  The "masksize" parameter is the size of "mask".

The signalfd fd supports the poll(2) and read(2) system calls.  The poll(2)
will return POLLIN when signals are available to be dequeued.  As a direct
consequence of supporting the Linux poll subsystem, the signalfd fd can use
used together with epoll(2) too.

The read(2) system call will return a "struct signalfd_siginfo" structure in
the userspace supplied buffer.  The return value is the number of bytes copied
in the supplied buffer, or -1 in case of error.  The read(2) call can also
return 0, in case the sighand structure to which the signalfd was attached,
has been orphaned.  The O_NONBLOCK flag is also supported, and read(2) will
return -EAGAIN in case no signal is available.

If the size of the buffer passed to read(2) is lower than sizeof(struct
signalfd_siginfo), -EINVAL is returned.  A read from the signalfd can also
return -ERESTARTSYS in case a signal hits the process.  The format of the
struct signalfd_siginfo is, and the valid fields depends of the (->code &
__SI_MASK) value, in the same way a struct siginfo would:

struct signalfd_siginfo {
	__u32 signo;	/* si_signo */
	__s32 err;	/* si_errno */
	__s32 code;	/* si_code */
	__u32 pid;	/* si_pid */
	__u32 uid;	/* si_uid */
	__s32 fd;	/* si_fd */
	__u32 tid;	/* si_fd */
	__u32 band;	/* si_band */
	__u32 overrun;	/* si_overrun */
	__u32 trapno;	/* si_trapno */
	__s32 status;	/* si_status */
	__s32 svint;	/* si_int */
	__u64 svptr;	/* si_ptr */
	__u64 utime;	/* si_utime */
	__u64 stime;	/* si_stime */
	__u64 addr;	/* si_addr */
};

[akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Sukadev Bhattiprolu
0800d30832 Use task_pgrp() task_session() in copy_process()
Use task_pgrp() and task_session() in copy_process(), and avoid find_pid()
call when attaching the task to its process group and session.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:36 -07:00
Sukadev Bhattiprolu
85868995d9 Use struct pid parameter in copy_process()
Modify copy_process() to take a struct pid * parameter instead of a pid_t.
This simplifies the code a bit and also avoids having to call find_pid() to
convert the pid_t to a struct pid.

Changelog:
	- Fixed Badari Pulavarty's comments and passed in &init_struct_pid
	  from fork_idle().
	- Fixed Eric Biederman's comments and simplified this patch and
	  used a new patch to remove the likely(pid) check.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Sukadev Bhattiprolu
820e45db23 statically initialize struct pid for swapper
Statically initialize a struct pid for the swapper process (pid_t == 0) and
attach it to init_task.  This is needed so task_pid(), task_pgrp() and
task_session() interfaces work on the swapper process also.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: <containers@lists.osdl.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Sukadev Bhattiprolu
e713d0dab2 attach_pid() with struct pid parameter
attach_pid() currently takes a pid_t and then uses find_pid() to find the
corresponding struct pid.  Sometimes we already have the struct pid.  We can
then skip find_pid() if attach_pid() were to take a struct pid parameter.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Daniel Walker
3e88c553db use defines in sys_getpriority/sys_setpriority
Switch to the defines for these two checks, instead of hard coding the
values.

[akpm@linux-foundation.org: add missing include]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:35 -07:00
Benjamin Herrenschmidt
a12bb44471 stop_machine() now uses hard_irq_disable
Add a call to hard_irq_disable() to stop_machine so that we make sure IRQs are
really disabled and not only lazy-disabled on archs like powerpc as some users
of stop_machine() may rely on that.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:34 -07:00
Eric Dumazet
6eaeeaba39 getrusage(): fill ru_inblock and ru_oublock fields if possible
If CONFIG_TASK_IO_ACCOUNTING is defined, we update io accounting counters for
each task.

This patch permits reporting of values using the well known getrusage()
syscall, filling ru_inblock and ru_oublock instead of null values.

As TASK_IO_ACCOUNTING currently counts bytes counts, we approximate blocks
count doing : nr_blocks = nr_bytes / 512

Example of use :
----------------------
After patch is applied, /usr/bin/time command can now give a good
approximation of IO that the process had to do.

$ /usr/bin/time grep tototo /usr/include/*
Command exited with non-zero status 1
0.00user 0.02system 0:02.11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
24288inputs+0outputs (0major+259minor)pagefaults 0swaps

$ /usr/bin/time dd if=/dev/zero of=/tmp/testfile count=1000
1000+0 enregistrements lus
1000+0 enregistrements écrits
512000 octets (512 kB) copiés, 0,00326601 seconde, 157 MB/s
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+3000outputs (0major+299minor)pagefaults 0swaps

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-11 08:29:34 -07:00
Steve Grubb
0a4ff8c259 [PATCH] Abnormal End of Processes
Hi,

I have been working on some code that detects abnormal events based on audit
system events. One kind of event that we currently have no visibility for is
when a program terminates due to segfault - which should never happen on a
production machine. And if it did, you'd want to investigate it. Attached is a
patch that collects these events and sends them into the audit system.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:26 -04:00
Amy Griffis
5712e88f2b [PATCH] match audit name data
Make more effort to detect previously collected names, so we don't log
multiple PATH records for a single filesystem object. Add
audit_inc_name_count() to reduce duplicate code.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:26 -04:00
Amy Griffis
4fc03b9beb [PATCH] complete message queue auditing
Handle the edge cases for POSIX message queue auditing. Collect inode
info when opening an existing mq, and for send/receive operations. Remove
audit_inode_update() as it has really evolved into the equivalent of
audit_inode().

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:26 -04:00
Amy Griffis
e41e8bde43 [PATCH] initialize name osid
Audit contexts can be reused, so initialize a name's osid to the
default in audit_getname(). This ensures we don't log a bogus object
label when no inode data is collected for a name.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:25 -04:00
Amy Griffis
e54dc2431d [PATCH] audit signal recipients
When auditing syscalls that send signals, log the pid and security
context for each target process. Optimize the data collection by
adding a counter for signal-related rules, and avoiding allocating an
aux struct unless we have more than one target process. For process
groups, collect pid/context data in blocks of 16. Move the
audit_signal_info() hook up in check_kill_permission() so we audit
attempts where permission is denied.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:25 -04:00
Al Viro
a5cb013da7 [PATCH] auditing ptrace
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-05-11 05:38:25 -04:00
akpm@linux-foundation.org
e9910846fd timer: revert parenthesis fix in tbase_get_deferrable() etc
On 09-05-2007 21:10, Pallipadi, Venkatesh wrote:
...
> On a 64 bit system, converting pointer to int causes unnecessary
> compiler warning, and intermediate long conversion was to avoid that.
> I will have to rephrase my comment to remove 32 bit value and use int,
> as that is what the function returns.

So, this patch reverts all changes done by my previous patch.

I apologize for my wrong comment about "logical error" here.

Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Satyam Sharma <satyam.sharma@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:53 -07:00
Linus Torvalds
aabded9c3a Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  [POWERPC] Further fixes for the removal of 4level-fixup hack from ppc32
  [POWERPC] EEH: log all PCI-X and PCI-E AER registers
  [POWERPC] EEH: capture and log pci state on error
  [POWERPC] EEH: Split up long error msg
  [POWERPC] EEH: log error only after driver notification.
  [POWERPC] fsl_soc: Make mac_addr const in fs_enet_of_init().
  [POWERPC] Don't use SLAB/SLUB for PTE pages
  [POWERPC] Spufs support for 64K LS mappings on 4K kernels
  [POWERPC] Add ability to 4K kernel to hash in 64K pages
  [POWERPC] Introduce address space "slices"
  [POWERPC] Small fixes & cleanups in segment page size demotion
  [POWERPC] iSeries: Make HVC_ISERIES the default
  [POWERPC] iSeries: suppress build warning in lparmap.c
  [POWERPC] Mark pages that don't exist as nosave
  [POWERPC] swsusp: Introduce register_nosave_region_late
2007-05-09 12:56:01 -07:00
Linus Torvalds
9a9136e270 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  sound: convert "sound" subdirectory to UTF-8
  MAINTAINERS: Add cxacru website/mailing list
  include files: convert "include" subdirectory to UTF-8
  general: convert "kernel" subdirectory to UTF-8
  documentation: convert the Documentation directory to UTF-8
  Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
  remove broken URLs from net drivers' output
  Magic number prefix consistency change to Documentation/magic-number.txt
  trivial: s/i_sem /i_mutex/
  fix file specification in comments
  drivers/base/platform.c: fix small typo in doc
  misc doc and kconfig typos
  Remove obsolete fat_cvf help text
  Fix occurrences of "the the "
  Fix minor typoes in kernel/module.c
  Kconfig: Remove reference to external mqueue library
  Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
  Correct comments in genrtc.c to refer to correct /proc file.
  Fix more "deprecated" spellos.
  Fix "deprecated" typoes.
  ...

Fix trivial comment conflict in kernel/relay.c.
2007-05-09 12:54:17 -07:00
Roman Zippel
f7e4217b00 rename thread_info to stack
This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.

Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.

It'll allow for a few more cleanups of the fork code, from which e.g.  ia64
could benefit.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
[akpm@linux-foundation.org: build fix]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Roman Zippel
c9f4f06d31 wrap access to thread_info
Recently a few direct accesses to the thread_info in the task structure snuck
back, so this wraps them with the appropriate wrapper.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Thomas Gleixner
b52f52a093 clocksource: fix resume logic
We need to make sure that the clocksources are resumed, when timekeeping is
resumed.  The current resume logic does not guarantee this.

Add a resume function pointer to the clocksource struct, so clocksource
drivers which need to reinitialize the clocksource can provide a resume
function.

Add a resume function, which calls the maybe available clocksource resume
functions and resets the watchdog function, so a stable TSC can be used
accross suspend/resume.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter
77461ab332 Make vm statistics update interval configurable
Make it configurable.  Code in mm makes the vm statistics intervals
independent from the cache reaper use that opportunity to make it
configurable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki
455c017ae3 microcode: use suspend-related CPU hotplug notifications
Make the microcode driver use the suspend-related CPU hotplug notifications
to handle the CPU hotplug events occuring during system-wide suspend and
resume transitions.  Remove the global variable suspend_cpu_hotplug
previously used for this purpose.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki
8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Jarek Poplawski
38a23e311b timer: parenthesis fix in tbase_get_deferrable() etc
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Eric Dumazet
34f01cc1f5 FUTEX: new PRIVATE futexes
Analysis of current linux futex code :
  --------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.

Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
	(rollback is value != expected_value, returns EWOULDBLOCK)
	(various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part  may be done
   by the futex_wake())

Futexes were designed to improve scalability but current implementation has
various problems :

- Central hashtable :

  This means scalability problems if many processes/threads want to use
  futexes at the same time.
  This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

  Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
  ops on mmap_sem, dirtying cache line :
    - lot of cache line ping pongs on SMP configurations.

  mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
  Highly threaded processes might suffer from mmap_sem contention.

  mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
  programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
  It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
  because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace.  As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).  We
chose to force all futexes be 'shared'.  This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented.  Time has come for linux
to have better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes.  And be
prepared to fallback on old subcommands for old kernels.  Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Pierre Peiffer <pierre.peiffer@bull.net>
Cc: "Ulrich Drepper" <drepper@gmail.com>
Cc: "Nick Piggin" <nickpiggin@yahoo.com.au>
Cc: "Ingo Molnar" <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Pierre Peiffer
d0aa7a70bf futex_requeue_pi optimization
This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of a
PI-futex.

This provides an optimization, already used for (normal) futexes, to be used
with the PI-futexes.

This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes.  With futex_requeue_pi, it can be used with
PRIO_INHERIT mutexes too.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Pierre Peiffer
c19384b5b2 Make futex_wait() use an hrtimer for timeout
This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().

schedule_timeout() is tick based, therefore the timeout granularity is the
tick (1 ms, 4 ms or 10 ms depending on HZ).  By using a high resolution timer
for timeout wakeup, we can attain a much finer timeout granularity (in the
microsecond range).  This parallels what is already done for futex_lock_pi().

The timeout passed to the syscall is no longer converted to jiffies and is
therefore passed to do_futex() and futex_wait() as an absolute ktime_t
therefore keeping nanosecond resolution.

Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.

In futex_wait(), if there is no timeout then a regular schedule() is
performed.  Otherwise, an hrtimer is fired before schedule() is called.

[akpm@linux-foundation.org: fix `make headers_check']
Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Pierre Peiffer
ec92d08292 futex priority based wakeup
Today, all threads waiting for a given futex are woken in FIFO order (first
waiter woken first) instead of priority order.

This patch makes use of plist (pirotity ordered lists) instead of simple list
in futex_hash_bucket.

All non-RT threads are stored with priority MAX_RT_PRIO, causing them to be
woken last, in FIFO order (RT-threads are woken first, in priority order).

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:55 -07:00
Oleg Nesterov
6e84d644b5 make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.

cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work().  So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called.  It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.

With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.

As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.

Disadvantages:

	- this patch adds wmb() to insert_work().

	- slowdowns the fast path (when del_timer() succeeds on entry) of
	  cancel_rearming_delayed_work(), because wait_on_work() is called
	  unconditionally. In that case, compared to the old version, we are
	  doing "unneeded" lock/unlock for each online CPU.

	  On the other hand, this means we don't need to use cancel_work_sync()
	  after cancel_rearming_delayed_work().

	- complicates the code (.text grows by 130 bytes).

[akpm@linux-foundation.org: fix speling]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: David Chinner <dgc@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Acked-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Gautham R Shenoy
7b0834c26f Remove kthread_bind() call from _cpu_down()
We are anyway kthread_stop()ping other per-cpu kernel threads after
move_task_off_dead_cpu(), so we can do it with the stop_machine_run thread
as well.

I just checked with Vatsa if there was any subtle reason why they
had put in the kthread_bind() in cpu.c. Vatsa cannot seem to recollect
any and I can't see any. So let us just remove the kthread_bind.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
10ab825bde change kernel threads to ignore signals instead of blocking them
Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals.  This doesn't prevent the signal delivery, this only blocks
signal_wake_up().  Every "killall -33 kthreadd" means a "struct siginfo"
leak.

Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that).  If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.

Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND.  This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.

However, disallow_signal() doesn't block the signal any longer but ignores
it.

NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
5de18d1697 worker_thread: don't play with SIGCHLD and numa policy
worker_thread() inherits ignored SIGCHLD and numa_default_policy() from its
parent, kthreadd.  No need to setup this again.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
90cce03d9b wait_for_helper: remove unneeded do_sigaction()
allow_signal(SIGCHLD) does all necessary job, no need to call do_sigaction()
prior to.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Eric W. Biederman
49d769d52e Change reparent_to_init to reparent_to_kthreadd
When a kernel thread calls daemonize, instead of reparenting the thread to
init reparent the thread to kthreadd next to the threads created by
kthread_create.

This is really just a stop gap until daemonize goes away, but it does
ensure no kernel threads are under init and they are all in one place that
is easy to find.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Eric W. Biederman
73c279927f kthread: don't depend on work queues
Currently there is a circular reference between work queue initialization
and kthread initialization.  This prevents the kthread infrastructure from
initializing until after work queues have been initialized.

We want the properties of tasks created with kthread_create to be as close
as possible to the init_task and to not be contaminated by user processes.
The later we start our kthreadd that creates these tasks the harder it is
to avoid contamination from user processes and the more of a mess we have
to clean up because the defaults have changed on us.

So this patch modifies the kthread support to not use work queues but to
instead use a simple list of structures, and to have kthreadd start from
init_task immediately after our kernel thread that execs /sbin/init.

By being a true child of init_task we only have to change those process
settings that we want to have different from init_task, such as our process
name, the cpus that are allowed, blocking all signals and setting SIGCHLD
to SIG_IGN so that all of our children are reaped automatically.

By being a true child of init_task we also naturally get our ppid set to 0
and do not wind up as a child of PID == 1.  Ensuring that tasks generated
by kthread_create will not slow down the functioning of the wait family of
functions.

[akpm@linux-foundation.org: use interruptible sleeps]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
c93465181f ____call_usermodehelper: don't flush_signals()
____call_usermodehelper() has no reason for flush_signals().  It is a fresh
forked process which is going to exec a user-space application or exit on
failure.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
28e53bddf8 unify flush_work/flush_work_keventd and rename it to cancel_work_sync
flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this).  So we can unify
flush_work_keventd and flush_work.

Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.

(akpm: this is why the earlier patches bypassed maintainers)

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
a4798833d2 zap_other_threads: remove unneeded ->exit_signal change
We already depend on fact that all sub-threads have ->exit_signal == -1, no
need to set it in zap_other_threads().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
85f4186af9 worker_thread: fix racy try_to_freeze() usage
worker_thread() can miss freeze_process()->signal_wake_up() if it happens
between try_to_freeze() and prepare_to_wait().  We should check freezing()
before entering schedule().

This race was introduced by me in

	[PATCH 1/1] workqueue: don't migrate pending works from the dead CPU

Looks like mm/vmscan.c:kswapd() has the same race.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
b9aac8e0d3 worker_thread: don't play with signals
worker_thread() doesn't need to "Block and flush all signals", this was
already done by its caller, kthread().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:53 -07:00
Oleg Nesterov
23b2e5991a workqueue: kill NOAUTOREL works
We don't have any users, and it is not so trivial to use NOAUTOREL works
correctly.  It is better to simplify API.

Delete NOAUTOREL support and rename work_release to work_clear_pending to
avoid a confusion.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
1634c48f8b make cancel_rearming_delayed_work() work on any workqueue, not just keventd_wq
cancel_rearming_delayed_workqueue(wq, dwork) doesn't need the first
parameter.  We don't hang on un-queued dwork any longer, and work->data
doesn't change its type.  This means we can always figure out "wq" from
dwork when it is needed.

Remove this parameter, and rename the function to
cancel_rearming_delayed_work().  Re-create an inline "obsolete"
cancel_rearming_delayed_workqueue(wq) which just calls
cancel_rearming_delayed_work().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
a848e3b67c workqueue: introduce wq_per_cpu() helper
Cleanup.  A number of per_cpu_ptr(wq->cpu_wq, cpu) users have to check that
cpu is valid for this wq.  Make a simple helper.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
63bc036252 unify queue_delayed_work() and queue_delayed_work_on()
Change queue_delayed_work() to use queue_delayed_work_on() to avoid the code
duplication (saves 133 bytes).

Q: queue_delayed_work() enqueues &dwork->work directly when delay == 0, why?

[jirislaby@gmail.com: oops fix]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
ed7c0feede make queue_delayed_work() friendly to flush_fork()
Currently typeof(delayed_work->work.data) is

	"struct workqueue_struct" when the timer is pending

	"struct cpu_workqueue_struct" whe the work is queued

This makes impossible to use flush_fork(delayed_work->work) in addition
to cancel_delayed_work/cancel_rearming_delayed_work, not good.

Change queue_delayed_work/delayed_work_timer_fn to use cwq, not wq. This
complicates (and uglifies) these functions a little bit, but alows us to
use flush_fork(dwork) and imho makes the whole code more consistent.

Also, document the fact that cancel_rearming_delayed_work() doesn't garantee
the completion of work->func() upon return.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
06ba38a9a0 workqueues: shift kthread_bind() from CPU_UP_PREPARE to CPU_ONLINE
CPU_UP_PREPARE binds cwq->thread to the new CPU.  So CPU_UP_CANCELED tries to
wake up the task which is bound to the failed CPU.

With this patch we don't bind cwq->thread until CPU becomes online.  The first
wake_up() after kthread_create() is a bit special, make a simple helper for
that.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
c12920d190 workqueue: make init_workqueues() __init
The only caller of init_workqueues() is do_basic_setup().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
cce1a1656c workqueue: introduce workqueue_struct->singlethread
Add explicit workqueue_struct->singlethread flag.  This lessens .text a
little, but most importantly this allows us to manipulate wq->list without
changine the meaning of is_single_threaded().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
b1f4ec172f workqueue: introduce cpu_singlethread_map
The code like

	if (is_single_threaded(wq))
		do_something(singlethread_cpu);
	else {
		for_each_cpu_mask(cpu, cpu_populated_map)
			do_something(cpu);
	}

looks very annoying. We can add "static cpumask_t cpu_singlethread_map" and
simplify the code. Lessens .text a bit, and imho makes the code more readable.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
dfb4b82e1c workqueue: make cancel_rearming_delayed_workqueue() work on idle dwork
cancel_rearming_delayed_workqueue(dwork) will hang forever if dwork was not
scheduled, because in that case cancel_delayed_work()->del_timer_sync() never
returns true.

I don't know if there are any callers which may have problems, but this is not
so convenient, and the fix is very simple.

Q: looks like we don't need "struct workqueue_struct *wq" parameter.  If the
timer was aborted successfully, get_wq_data() == wq.  Is it worth to add the
new function?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
f293ea9200 workqueue: don't save interrupts in run_workqueue()
work->func() may sleep, it's a bug to call run_workqueue() with irqs disabled.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
7097a87afe workqueue: kill run_scheduled_work()
Because it has no callers.

Actually, I think the whole idea of run_scheduled_work() was not right, not
good to mix "unqueue this work and execute its ->func()" in one function.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
3af24433ef workqueue: don't migrate pending works from the dead CPU
Currently CPU_DEAD uses kthread_stop() to stop cwq->thread and then
transfers cwq->worklist to another CPU.  However, it is very unlikely that
worker_thread() will notice kthread_should_stop() before flushing
cwq->worklist.  It is only possible if worker_thread() was preempted after
run_workqueue(cwq), a new work_struct was added, and CPU_DEAD happened
before cwq->thread has a chance to run.

This means that take_over_work() mostly adds unneeded complications.  Note
also that kthread_stop() is not good per se, wake_up_process() may confuse
work->func() if it sleeps waiting for some event.

Remove take_over_work() and migrate_sequence complications.  CPU_DEAD sets
the cwq->should_stop flag (introduced by this patch) and waits for
cwq->thread to flush cwq->worklist and exit.  Because the dead CPU is not
on cpu_online_map, no more works can be added to that cwq.

cpu_populated_map was introduced to optimize for_each_possible_cpu(), it is
not strictly needed, and it is more a documentation in fact.

Saves 418 bytes.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Gautham shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
36aa9dfc39 workqueue: don't clear cwq->thread until it exits
Pointed out by Srivatsa Vaddagiri.

cleanup_workqueue_thread() sets cwq->thread = NULL and does kthread_stop().
This breaks the "if (cwq->thread == current)" logic in flush_cpu_workqueue()
and leads to deadlock.

Kill the thead first, then clear cwq->thread. workqueue_mutex protects us
from create_workqueue_thread() so we don't need cwq->lock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Gautham shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
d721304dce workqueue: fix flush_workqueue() vs CPU_DEAD race
Many thanks to Srivatsa Vaddagiri for the helpful discussion and for spotting
the bug in my previous attempt.

work->func() (and thus flush_workqueue()) must not use workqueue_mutex,
this leads to deadlock when CPU_DEAD does kthread_stop(). However without
this mutex held we can't detect CPU_DEAD in progress, which can move pending
works to another CPU while the dead one is not on cpu_online_map.

Change flush_workqueue() to use for_each_possible_cpu(). This means that
flush_cpu_workqueue() may hit CPU which is already dead. However in that
case

	!list_empty(&cwq->worklist) || cwq->current_work != NULL

means that CPU_DEAD in progress, it will do kthread_stop() + take_over_work()
so we can proceed and insert a barrier. We hold cwq->lock, so we are safe.

Also, add migrate_sequence incremented by take_over_work() under cwq->lock.
If take_over_work() happened before we checked this CPU, we should see the
new value after spin_unlock().

Further possible changes:

	remove CPU_DEAD handling (along with take_over_work, migrate_sequence)
	from workqueue.c. CPU_DEAD just sets cwq->please_exit_after_flush flag.

	CPU_UP_PREPARE->create_workqueue_thread() clears this flag, and creates
	the new thread if cwq->thread == NULL.

This way the workqueue/cpu-hotplug interaction is almost zero, workqueue_mutex
just protects "workqueues" list, CPU_LOCK_ACQUIRE/CPU_LOCK_RELEASE go away.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Gautham shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:52 -07:00
Oleg Nesterov
319c2a986e workqueue: fix freezeable workqueues implementation
Currently ->freezeable is per-cpu, this is wrong. CPU_UP_PREPARE creates
cwq->thread which is not freezeable. Move ->freezeable to workqueue_struct.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Gautham shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Heiko Carstens
e7407dcc69 call cpu_chain with CPU_DOWN_FAILED if CPU_DOWN_PREPARE failed
This makes cpu hotplug symmetrical: if CPU_UP_PREPARE fails we get
CPU_UP_CANCELED, so we can undo what ever happened on PREPARE.  The same
should happen for CPU_DOWN_PREPARE.

[akpm@linux-foundation.org: fix for reduce-size-of-task_struct-on-64-bit-machines]
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Gautham R Shenoy
5be9361cdf Eliminate lock_cpu_hotplug in kernel/schedc
Eliminate lock_cpu_hotplug from kernel/sched.c and use sched_hotcpu_mutex
instead to postpone a hotplug event.

In the migration_call hotcpu callback function, take sched_hotcpu_mutex
while handling the event CPU_LOCK_ACQUIRE and release it while handling
CPU_LOCK_RELEASE event.

[akpm@linux-foundation.org: fix deadlock]
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Gautham R Shenoy
baaca49f41 Define and use new events,CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE
This is an attempt to provide an alternate mechanism for postponing
a hotplug event instead of using a global mechanism like lock_cpu_hotplug.

The proposal is to add two new events namely CPU_LOCK_ACQUIRE and
CPU_LOCK_RELEASE. The notification for these two events would be sent
out before and after a cpu_hotplug event respectively.

During the CPU_LOCK_ACQUIRE event, a cpu-hotplug-aware subsystem is
supposed to acquire any per-subsystem hotcpu mutex ( Eg. workqueue_mutex
in kernel/workqueue.c ).

During the CPU_LOCK_RELEASE release event the cpu-hotplug-aware subsystem
is supposed to release the per-subsystem hotcpu mutex.

The reasons for defining new events as opposed to reusing the existing events
like CPU_UP_PREPARE/CPU_UP_FAILED/CPU_ONLINE for locking/unlocking of
per-subsystem hotcpu mutexes are as follow:

	- CPU_LOCK_ACQUIRE: All hotcpu mutexes are taken before subsystems
	start handling pre-hotplug events like CPU_UP_PREPARE/CPU_DOWN_PREPARE
	etc, thus ensuring a clean handling of these events.

	- CPU_LOCK_RELEASE: The hotcpu mutexes will be released only after
	all subsystems have handled post-hotplug events like CPU_DOWN_FAILED,
	CPU_DEAD,CPU_ONLINE etc thereby ensuring that there are no subsequent
	clashes amongst the interdependent subsystems after a cpu hotplugs.

This patch also uses __raw_notifier_call chain in _cpu_up to take care
of the dependency between the two consequetive calls to
raw_notifier_call_chain.

[akpm@linux-foundation.org: fix a bug]
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Gautham R Shenoy
6f7cc11aa6 Extend notifier_call_chain to count nr_calls made
Since 2.6.18-something, the community has been bugged by the problem to
provide a clean and a stable mechanism to postpone a cpu-hotplug event as
lock_cpu_hotplug was badly broken.

This is another proposal towards solving that problem.  This one is along the
lines of the solution provided in kernel/workqueue.c

Instead of having a global mechanism like lock_cpu_hotplug, we allow the
subsytems to define their own per-subsystem hot cpu mutexes.  These would be
taken(released) where ever we are currently calling
lock_cpu_hotplug(unlock_cpu_hotplug).

Also, in the per-subsystem hotcpu callback function,we take this mutex before
we handle any pre-cpu-hotplug events and release it once we finish handling
the post-cpu-hotplug events.  A standard means for doing this has been
provided in [PATCH 2/4] and demonstrated in [PATCH 3/4].

The ordering of these per-subsystem mutexes might still prove to be a
problem, but hopefully lockdep should help us get out of that muddle.

The patch set to be applied against linux-2.6.19-rc5 is as follows:

[PATCH 1/4] :	Extend notifier_call_chain with an option to specify the
		number of notifications to be sent and also count the
		number of notifications actually sent.

[PATCH 2/4] :	Define events CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE
		and send out notifications for these in _cpu_up and
		_cpu_down. This would help us standardise the acquire and
		release of the subsystem locks in the hotcpu
		callback functions of these subsystems.

[PATCH 3/4] :	Eliminate lock_cpu_hotplug from kernel/sched.c.

[PATCH 4/4] :	In workqueue_cpu_callback function, acquire(release) the
		workqueue_mutex while handling
		CPU_LOCK_ACQUIRE(CPU_LOCK_RELEASE).

If the per-subsystem-locking approach survives the test of time, we can expect
a slow phasing out of lock_cpu_hotplug, which has not yet been eliminated in
these patches :)

This patch:

Provide notifier_call_chain with an option to call only a specified number of
notifiers and also record the number of call to notifiers made.

The need for this enhancement was identified in the post entitled
"Slab - Eliminate lock_cpu_hotplug from slab"
(http://lkml.org/lkml/2006/10/28/92) by Ravikiran G Thirumalai and
Andrew Morton.

This patch adds two additional parameters to notifier_call_chain API namely
 - int nr_to_calls : Number of notifier_functions to be called.
 		     The don't care value is -1.

 - unsigned int *nr_calls : Records the total number of notifier_funtions
			    called by notifier_call_chain. The don't care
			    value is NULL.

[michal.k.k.piotrowski@gmail.com: build fix]
Credit: Andrew Morton <akpm@osdl.org>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Tom Zanussi
7c9cb38302 relay: use plain timer instead of delayed work
relay doesn't need to use schedule_delayed_work() for waking readers
when a simple timer will do.

Signed-off-by: Tom Zanussi <zanussi@comcast.net>
Cc: Satyam Sharma <satyam.sharma@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Oleg Nesterov
83c22520c5 flush_cpu_workqueue: don't flush an empty ->worklist
Now when we have ->current_work we can avoid adding a barrier and waiting
for its completition when cwq's queue is empty.

Note: this change is also useful if we change flush_workqueue() to also
check the dead CPUs.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Andrew Morton
edab2516a6 flush_workqueue(): use preempt_disable to hold off cpu hotplug
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Gautham Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:51 -07:00
Oleg Nesterov
b89deed32c implement flush_work()
A basic problem with flush_scheduled_work() is that it blocks behind _all_
presently-queued works, rather than just the work whcih the caller wants to
flush.  If the caller holds some lock, and if one of the queued work happens
to want that lock as well then accidental deadlocks can occur.

One example of this is the phy layer: it wants to flush work while holding
rtnl_lock().  But if a linkwatch event happens to be queued, the phy code will
deadlock because the linkwatch callback function takes rtnl_lock.

So we implement a new function which will flush a *single* work - just the one
which the caller wants to free up.  Thus we avoid the accidental deadlocks
which can arise from unrelated subsystems' callbacks taking shared locks.

flush_work() non-blockingly dequeues the work_struct which we want to kill,
then it waits for its handler to complete on all CPUs.

Add ->current_work to the "struct cpu_workqueue_struct", it points to
currently running "struct work_struct". When flush_work(work) detects
->current_work == work, it inserts a barrier at the _head_ of ->worklist
(and thus right _after_ that work) and waits for completition. This means
that the next work fired on that CPU will be this barrier, or another
barrier queued by concurrent flush_work(), so the caller of flush_work()
will be woken before any "regular" work has a chance to run.

When wait_on_work() unlocks workqueue_mutex (or whatever we choose to protect
against CPU hotplug), CPU may go away. But in that case take_over_work() will
move a barrier we queued to another CPU, it will be fired sometime, and
wait_on_work() will be woken.

Actually, we are doing cleanup_workqueue_thread()->kthread_stop() before
take_over_work(), so cwq->thread should complete its ->worklist (and thus
the barrier), because currently we don't check kthread_should_stop() in
run_workqueue(). But even if we did, everything should be ok.

[akpm@osdl.org: cleanup]
[akpm@osdl.org: add flush_work_keventd() wrapper]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:50 -07:00
Oleg Nesterov
fc2e4d7041 reimplement flush_workqueue()
Remove ->remove_sequence, ->insert_sequence, and ->work_done from struct
cpu_workqueue_struct.  To implement flush_workqueue() we can queue a
barrier work on each CPU and wait for its completition.

The barrier is queued under workqueue_mutex to ensure that per cpu
wq->cpu_wq is alive, we drop this mutex before going to sleep.  If CPU goes
down while we are waiting for completition, take_over_work() will move the
barrier on another CPU, and the handler will wake up us eventually.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:50 -07:00
Andrew Morton
e18f3ffb9c schedule_on_each_cpu(): use preempt_disable()
We take workqueue_mutex in there to keep CPU hotplug away.  But
preempt_disable() will suffice for that.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:50 -07:00
David Miller
9b04bd2756 Fix printk format warnings in timer_list.c
u64 and s64 are not necessarily 'long long' on some 64-bit
platforms, so explicit the type to kill the compiler warnings.

Also consistently use '%Lu' which is unsigned.

Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:50 -07:00
Daniel Walker
35c35d1afa clocksource: spelling error in watchdog code
There's more that need fixing, and fix my own subject spelling error too.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:49 -07:00
Roland McGrath
55c0d1f83e Move sig_kernel_* et al macros to linux/signal.h
This patch moves the sig_kernel_* and related macros from kernel/signal.c
to linux/signal.h, and cleans them up slightly.  I need the sig_kernel_*
macros for default signal behavior in the utrace code, and want to avoid
duplication or overhead to share the knowledge.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:49 -07:00
Akinobu Mita
85badbdf51 use simple_read_from_buffer in kernel/
Cleanup using simple_read_from_buffer() for /dev/cpuset/tasks and
/proc/config.gz.

Cc: Paul Jackson <pj@sgi.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:49 -07:00
Jeff Dike
cb0c78cc94 Fix Linuxdoc comment
A linuxdoc comment had fallen out of date - it refers to an argument which no
longer exists.

Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Rafael J. Wysocki
a3d25c275d PM: Separate hibernation code from suspend code
[ With Johannes Berg <johannes@sipsolutions.net> ]

Separate the hibernation (aka suspend to disk code) from the other suspend
code.  In particular:

 * Remove the definitions related to hibernation from include/linux/pm.h
 * Introduce struct hibernation_ops and a new hibernate() function to hibernate
   the system, defined in include/linux/suspend.h
 * Separate suspend code in kernel/power/main.c from hibernation-related code
   in kernel/power/disk.c and kernel/power/user.c (with the help of
   hibernation_ops)
 * Switch ACPI (the only user of pm_ops.pm_disk_mode) to hibernation_ops

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
Andrew Morton
d60846c4d1 swsusp: clean up printk
Remove an inexplicable /

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:48 -07:00
John Anthony Kazos Jr
f42df9e658 general: convert "kernel" subdirectory to UTF-8
Convert the "kernel" subdirectory of the tree to UTF-8. The only file
modified is <kernel/sys.c>.

Signed-off-by: John Anthony Kazos Jr. <jakj@j-a-k-j.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-05-09 08:58:20 +02:00
Michael Opdenacker
59c51591a0 Fix occurrences of "the the "
Signed-off-by: Michael Opdenacker <michael@free-electrons.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-05-09 08:57:56 +02:00
Johannes Berg
940d67f6b9 [POWERPC] swsusp: Introduce register_nosave_region_late
This patch introduces a new register_nosave_region_late function that
can be called from initcalls when register_nosave_region can no longer
be used because it uses bootmem.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-09 16:34:56 +10:00
Robert P. J. Day
02a3e59a08 Fix minor typoes in kernel/module.c
Fix minor (comment) typoes in kernel/module.c.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-05-09 07:26:28 +02:00
David Sterba
3dde6ad8fc Fix trivial typos in Kconfig* files
Fix several typos in help text in Kconfig* files.

Signed-off-by: David Sterba <dave@jikos.cz>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-05-09 07:12:20 +02:00
Andrew Morton
d5f9f942c6 revert 'sched: redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array'
Revert commit bd53f96ca5.

Con says:

This is no good, sorry. The one I saw originally was with the staircase
deadline cpu scheduler in situ and was different.

  #define TASK_PREEMPTS_CURR(p, rq) \
     ((p)->prio < (rq)->curr->prio)
     (((p)->prio < (rq)->curr->prio) && ((p)->array == (rq)->active))

This will fail to wake up a runqueue for a task that has been migrated to the
expired array of a runqueue which is otherwise idle which can happen with smp
balancing,

Cc: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 20:41:15 -07:00
Linus Torvalds
df6d3916f3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (77 commits)
  [POWERPC] Abolish powerpc_flash_init()
  [POWERPC] Early serial debug support for PPC44x
  [POWERPC] Support for the Ebony 440GP reference board in arch/powerpc
  [POWERPC] Add device tree for Ebony
  [POWERPC] Add powerpc/platforms/44x, disable platforms/4xx for now
  [POWERPC] MPIC U3/U4 MSI backend
  [POWERPC] MPIC MSI allocator
  [POWERPC] Enable MSI mappings for MPIC
  [POWERPC] Tell Phyp we support MSI
  [POWERPC] RTAS MSI implementation
  [POWERPC] PowerPC MSI infrastructure
  [POWERPC] Rip out the existing powerpc msi stubs
  [POWERPC] Remove use of 4level-fixup.h for ppc32
  [POWERPC] Add powerpc PCI-E reset API implementation
  [POWERPC] Holly bootwrapper
  [POWERPC] Holly DTS
  [POWERPC] Holly defconfig
  [POWERPC] Add support for 750CL Holly board
  [POWERPC] Generalize tsi108 PCI setup
  [POWERPC] Generalize tsi108 PHY types
  ...

Fixed conflict in include/asm-powerpc/kdebug.h manually

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:50:19 -07:00
Bernhard Walle
d85a60d85e Add IRQF_IRQPOLL flag (common code)
irqpoll is broken on some architectures that don't use the IRQ 0 for the timer
interrupt like IA64.  This patch adds a IRQF_IRQPOLL flag.

Each architecture is handled in a separate pach.  As I left the irq == 0 as
condition, this should not break existing architectures that use timer_irq ==
0 and that I did't address with that patch (because I don't know).

This patch:

This patch adds a IRQF_IRQPOLL flag that the interrupt registration code could
use for the interrupt it wants to use for IRQ polling.

Because this must not be the timer interrupt, an additional flag was added
instead of re-using the IRQF_TIMER constant.  Until all architectures will
have an IRQF_IRQPOLL interrupt, irq == 0 will stay as alternative as it should
not break anything.

Also, note_interrupt() is called on CPU-specific interrupts to be used as
interrupt source for IRQ polling.

Signed-off-by: Bernhard Walle <bwalle@suse.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Grant Grundler <grundler@google.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:22 -07:00
Ananth N Mavinakayanahalli
bf8f6e5b3e Kprobes: The ON/OFF knob thru debugfs
This patch provides a debugfs knob to turn kprobes on/off

o A new file /debug/kprobes/enabled indicates if kprobes is enabled or
  not (default enabled)
o Echoing 0 to this file will disarm all installed probes
o Any new probe registration when disabled will register the probe but
  not arm it. A message will be printed out in such a case.
o When a value 1 is echoed to the file, all probes (including ones
  registered in the intervening period) will be enabled
o Unregistration will happen irrespective of whether probes are globally
  enabled or not.
o Update Documentation/kprobes.txt to reflect these changes. While there
  also update the doc to make it current.

We are also looking at providing sysrq key support to tie to the disabling
feature provided by this patch.

[akpm@linux-foundation.org: Use bool like a bool!]
[akpm@linux-foundation.org: add printk facility levels]
[cornelia.huck@de.ibm.com: Add the missing arch_trampoline_kprobe() for s390]
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:19 -07:00
Christoph Hellwig
4c4308cb93 kprobes: kretprobes simplifications
- consolidate duplicate code in all arch_prepare_kretprobe instances
   into common code
 - replace various odd helpers that use hlist_for_each_entry to get
   the first elemenet of a list with either a hlist_for_each_entry_save
   or an opencoded access to the first element in the caller
 - inline add_rp_inst into it's only remaining caller
 - use kretprobe_inst_table_head instead of opencoding it

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:19 -07:00
Christoph Hellwig
6f716acd5f kprobes: codingstyle cleanups
Remove superflous braces and fix indentation aswell as comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:19 -07:00
Christoph Hellwig
b0bb501651 kprobes: use hlist_for_each_entry
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:19 -07:00
Josh Triplett
ade5fb818f rcutorture: Remove redundant assignment to cur_ops in for loop
The for loop in rcutorture_init uses the condition
cur_ops = torture_ops[i], cur_ops
but then makes the same assignment to cur_ops inside the loop.  Remove the
redundant assignment inside the loop, and remove now-unnecessary braces.

Signed-off-by: Josh Triplett <josh@kernel.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Josh Triplett
c8e5b16310 rcutorture: style cleanup: avoid != NULL in boolean tests
Signed-off-by: Josh Triplett <josh@kernel.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Ahmed S. Darwish
788e770eb2 rcutorture: Use ARRAY_SIZE macro when appropriate
Use ARRAY_SIZE macro already defined in kernel.h

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Siddha, Suresh B
c3396620ca sched: align rq to cacheline boundary
Align the per cpu runqueue to the cacheline boundary.  This will minimize
the number of cachelines touched during remote wakeup.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Ravikiran G Thirumalai <kiran@scalex86.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Dmitry Adamushko
bd53f96ca5 sched: redundant reschedule when set_user_nice() boosts a prio of a task from the "expired" array
- Make TASK_PREEMPTS_CURR(task, rq) return "true" only if the task's prio
  is higher than the current's one and the task is in the "active" array.
  This ensures we don't make redundant resched_task() calls when the task
  is in the "expired" array (as may happen now in set_user_prio(),
  rt_mutex_setprio() and pull_task() ) ;

- generalise conditions for a call to resched_task() in set_user_nice(),
  rt_mutex_setprio() and sched_setscheduler()

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Siddha, Suresh B
4953198b6c sched: optimize siblings status check logic in wake_idle()
When a logical cpu 'x' already has more than one process running, then most
likely the siblings of that cpu 'x' must be busy.  Otherwise the idle
siblings would have likely(in most of the scenarios) picked up the extra
load making the load on 'x' atmost one.

Use this logic to eliminate the siblings status check and minimize the cache
misses encountered on a heavily loaded system.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Eric Dumazet
5517d86bea Speed up divides by cpu_power in scheduler
I noticed expensive divides done in try_to_wakeup() and
find_busiest_group() on a bi dual core Opteron machine (total of 4 cores),
moderatly loaded (15.000 context switch per second)

oprofile numbers :

CPU: AMD64 processors, speed 2600.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 50000
samples  %        symbol name
...
613914    1.0498  try_to_wake_up
    834  0.0013 :ffffffff80227ae1:   div    %rcx
77513  0.1191 :ffffffff80227ae4:   mov    %rax,%r11

608893    1.0413  find_busiest_group
   1841  0.0031 :ffffffff802260bf:       div    %rdi
140109  0.2394 :ffffffff802260c2:       test   %sil,%sil

Some of these divides can use the reciprocal divides we introduced some
time ago (currently used in slab AFAIK)

We can assume a load will fit in a 32bits number, because with a
SCHED_LOAD_SCALE=128 value, its still a theorical limit of 33554432

When/if we reach this limit one day, probably cpus will have a fast
hardware divide and we can zap the reciprocal divide trick.

Ingo suggested to rename cpu_power to __cpu_power to make clear it should
not be modified without changing its reciprocal value too.

I did not convert the divide in cpu_avg_load_per_task(), because tracking
nr_running changes may be not worth it ?  We could use a static table of 32
reciprocal values but it would add a conditional branch and table lookup.

[akpm@linux-foundation.org: !SMP build fix]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Siddha, Suresh B
46cb4b7c88 sched: dynticks idle load balancing
Fix the process idle load balancing in the presence of dynticks.  cpus for
which ticks are stopped will sleep till the next event wakes it up.
Potentially these sleeps can be for large durations and during which today,
there is no periodic idle load balancing being done.

This patch nominates an owner among the idle cpus, which does the idle load
balancing on behalf of the other idle cpus.  And once all the cpus are
completely idle, then we can stop this idle load balancing too.  Checks added
in fast path are minimized.  Whenever there are busy cpus in the system, there
will be an owner(idle cpu) doing the system wide idle load balancing.

Open items:
1. Intelligent owner selection (like an idle core in a busy package).
2. Merge with rcu's nohz_cpu_mask?

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Siddha, Suresh B
bdecea3a92 sched: fix idle load balancing in softirqd context
Periodic load balancing in recent kernels happen in the softirq.  In
certain -rt configurations, these softirqs are handled in softirqd context.
 And hence the check for idle processor was always returning busy (as
nr_running > 1).

This patch captures the idle information at the tick and passes this info
to softirq context through an element 'idle_at_tick' in rq.

[kernel@kolivas.org: Fix reverse idle at tick logic]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:17 -07:00
Stas Sergeev
6bdb6b620e export hrtimer_forward
Other symbols of the hrtimers API are already exported.

Signed-off-by: Stas Sergeev <stsp@aknet.ru>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:15 -07:00
David Rientjes
6f7f02e78a cpusets: allow empty {cpus,mems}_allowed to be set for unpopulated cpuset
You currently cannot remove all cpus or mems from cpus_allowed or
mems_allowed of a cpuset.  We now allow both if there are no attached
tasks.

Acked-by: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:14 -07:00
Jarek Poplawski
4ff773bbde lockdep: removed unused ip argument in mark_lock & mark_held_locks
It looks like a remainder from designing...

Signed-off-by: Jarek Poplawski <jarkao@o2.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:13 -07:00
Adrian Bunk
35bab756b4 The scheduled -EINVAL for invalid timevals in setitimer
As scheduled, do_setitimer() now returns -EINVAL for invalid timeval.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:13 -07:00
Tom Alsberg
9926e4c743 CPU time limit patch / setrlimit(RLIMIT_CPU, 0) cheat fix
As discovered here today, the change in Kernel 2.6.17 intended to inhibit
users from setting RLIMIT_CPU to 0 (as that is equivalent to unlimited) by
"cheating" and setting it to 1 in such a case, does not make a difference,
as the check is done in the wrong place (too late), and only applies to the
profiling code.

On all systems I checked running kernels above 2.6.17, no matter what the
hard and soft CPU time limits were before, a user could escape them by
issuing in the shell (sh/bash/zsh) "ulimit -t 0", and then the user's
process was not ever killed.

Attached is a trivial patch to fix that.  Simply moving the check to a
slightly earlier location (specifically, before the line that actually
assigns the limit - *old_rlim = new_rlim), does the trick.

Do note that at least the zsh (but not ash, dash, or bash) shell has the
problem of "caching" the limits set by the ulimit command, so when running
zsh the fix will not immediately be evident - after entering "ulimit -t 0",
"ulimit -a" will show "-t: cpu time (seconds) 0", even though the actual
limit as returned by getrlimit(...) will be 1.  It can be verified by
opening a subshell (which will not have the values of the parent shell in
cache) and checking in it, or just by running a CPU intensive command like
"echo '65536^1048576' | bc" and verifying that it dumps core after one
second.

Regardless of whether that is a misfeature in the shell, perhaps it would
be better to return -EINVAL from setrlimit in such a case instead of
cheating and setting to 1, as that does not really reflect the actual state
of the process anymore.  I do not however know what the ground for that
decision was in the original 2.6.17 change, and whether there would be any
"backward" compatibility issues, so I preferred not to touch that right
now.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:12 -07:00
Pavel Emelianov
b5e618181a Introduce a handy list_first_entry macro
There are many places in the kernel where the construction like

   foo = list_entry(head->next, struct foo_struct, list);

are used.
The code might look more descriptive and neat if using the macro

   list_first_entry(head, type, member) \
             list_entry((head)->next, type, member)

Here is the macro itself and the examples of its usage in the generic code.
 If it will turn out to be useful, I can prepare the set of patches to
inject in into arch-specific code, drivers, networking, etc.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:11 -07:00
Jarek Poplawski
9e860d000a lockdep: lookup_chain_cache comment errata
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:11 -07:00
Thomas Gleixner
d3ed782458 highres/dyntick: prevent xtime lock contention
While the !highres/!dyntick code assigns the duty of the do_timer() call to
one specific CPU, this was dropped in the highres/dyntick part during
development.

Steven Rostedt discovered the xtime lock contention on highres/dyntick due
to several CPUs trying to update jiffies.

Add the single CPU assignement back.  In the dyntick case this needs to be
handled carefully, as the CPU which has the do_timer() duty must drop the
assignement and let it be grabbed by another CPU, which is active.
Otherwise the do_timer() calls would not happen during the long sleep.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Mark Lord <mlord@pobox.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:10 -07:00
Martin Peschke
5a0c6a0d1a kallsyms: cleanup: use seq_release_private() where appropriate
We can save some lines of code by using seq_release_private().

Signed-off-by: Martin Peschke <mp3@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:09 -07:00
Robert P. J. Day
039b6b3ed8 audit: add spaces on either side of case "..." operator.
Following the programming advice laid down in the gcc manual, make
sure the case "..." operator has spaces on either side.

According to:

http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Case-Ranges.html#Case-Ranges:

  "Be careful: Write spaces around the ..., for otherwise it may be
parsed wrong when you use it with integer values."

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:09 -07:00
Ravikiran G Thirumalai
e729aa16b1 Pad irq_desc to internode cacheline size
We noticed a drop in n/w performance due to the irq_desc being cacheline
aligned rather than internode aligned.  We see 50% of expected performance
when two e1000 nics local to two different nodes have consecutive irq
descriptors allocated, due to false sharing.

Note that this patch does away with cacheline padding for the UP case, as
it does not seem useful for UP configurations.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:09 -07:00
Pavel Emelianov
428e6ce023 Lockdep treats down_write_trylock like regular down_write
This causes constructions like

down_write(&mm1->mmap_sem);
if (down_write_trylock(&mm2->mmap_sem)) {
       ...
       up_write(&mm2->mmap_sem);
}
up_write(&mm1->mmap_sem);

generate a lockdep warning about circular locking dependence.

Call rwsem_acquire() with trylock set to 1.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:09 -07:00
Bert Wesarg
9730b5b06f kernel/params.c: fix lying comment for param_array()
This fixes the comment for the function param_array. Which lies that it
only *temporarily* mangle the input string @val.

Signed-off-by: Bert Wesarg <wesarg@informatik.uni-halle.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Alexey Dobriyan
a5c43dae7a Fix race between cat /proc/slab_allocators and rmmod
Same story as with cat /proc/*/wchan race vs rmmod race, only
/proc/slab_allocators want more info than just symbol name.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Alexey Dobriyan
9d65cb4a17 Fix race between cat /proc/*/wchan and rmmod et al
kallsyms_lookup() can go iterating over modules list unprotected which is OK
for emergency situations (oops), but not OK for regular stuff like
/proc/*/wchan.

Introduce lookup_symbol_name()/lookup_module_symbol_name() which copy symbol
name into caller-supplied buffer or return -ERANGE.  All copying is done with
module_mutex held, so...

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Alexey Dobriyan
ffb4512276 Simplify kallsyms_lookup()
Several kallsyms_lookup() pass dummy arguments but only need, say, module's
name.  Make kallsyms_lookup() accept NULLs where possible.

Also, makes picture clearer about what interfaces are needed for all symbol
resolving business.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Alexey Dobriyan
ea07890a68 Fix race between rmmod and cat /proc/kallsyms
module_get_kallsym() leaks "struct module *" outside of module_mutex which is
no-no, because module can dissapear right after mutex unlock.

Copy all needed information from inside module_mutex into caller-supplied
space.

[bunk@stusta.de: is_exported() can now become static]
Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Alexey Dobriyan
ae84e32470 Simplify module_get_kallsym() by dropping length arg
module_get_kallsym() could in theory truncate module symbol name to fit in
buffer, but nobody does this.  Always use KSYM_NAME_LEN + 1 bytes for name.

Suggested by lg^WRusty.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:08 -07:00
Jan Engelhardt
b73a7e76c1 Fix kevent's childs priority greediness
Fix kevent's childs priority greediness.  Such tasks were always scheduled
at nice level -5 and, at that time, udev stole us the CPU time with -5.

Already posted at http://lkml.org/lkml/2005/1/10/85

[akpm@linux-foundation.org: add comment]
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:07 -07:00
Simon Horman
6672f76a5a kdump/kexec: calculate note size at compile time
Currently the size of the per-cpu region reserved to save crash notes is
set by the per-architecture value MAX_NOTE_BYTES.  Which in turn is
currently set to 1024 on all supported architectures.

While testing ia64 I recently discovered that this value is in fact too
small.  The particular setup I was using actually needs 1172 bytes.  This
lead to very tedious failure mode where the tail of one elf note would
overwrite the head of another if they ended up being alocated sequentially
by kmalloc, which was often the case.

It seems to me that a far better approach is to caclculate the size that
the area needs to be.  This patch does just that.

If a simpler stop-gap patch for ia64 to be squeezed into 2.6.21(.X) is
needed then this should be as easy as making MAX_NOTE_BYTES larger in
arch/asm-ia64/kexec.h.  Perhaps 2048 would be a good choice.  However, I
think that the approach in this patch is a much more robust idea.

Acked-by:  Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:07 -07:00
Randy Dunlap
e63340ae6b header cleaning: don't include smp_lock.h when not used
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:07 -07:00
Jeremy Fitzhardinge
04c9167f91 add touch_all_softlockup_watchdogs()
Add touch_all_softlockup_watchdogs() to allow the softlockup watchdog
timers on all cpus to be updated.  This is used to prevent sysrq-t from
generating a spurious watchdog message when generating lots of output.

Softlockup watchdogs use sched_clock() as its timebase, which is inherently
per-cpu (at least, when it is measuring unstolen time).  Because of this,
it isn't possible for one CPU to directly update the other CPU's timers,
but it is possible to tell the other CPUs to do update themselves
appropriately.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: Chris Lalancette <clalance@redhat.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Rick Lindsley <ricklind@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:06 -07:00
Jeremy Fitzhardinge
966812dc98 Ignore stolen time in the softlockup watchdog
The softlockup watchdog is currently a nuisance in a virtual machine, since
the whole system could have the CPU stolen from it for a long period of
time.  While it would be unlikely for a guest domain to be denied timer
interrupts for over 10s, it could happen and any softlockup message would
be completely spurious.

Earlier I proposed that sched_clock() return time in unstolen nanoseconds,
which is how Xen and VMI currently implement it.  If the softlockup
watchdog uses sched_clock() to measure time, it would automatically ignore
stolen time, and therefore only report when the guest itself locked up.
When running native, sched_clock() returns real-time nanoseconds, so the
behaviour would be unchanged.

Note that sched_clock() used this way is inherently per-cpu, so this patch
makes sure that the per-processor watchdog thread initialized its own
timestamp.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: James Morris <jmorris@namei.org>
Cc: Dan Hecht <dhecht@vmware.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Chris Lalancette <clalance@redhat.com>
Cc: Rick Lindsley <ricklind@us.ibm.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:06 -07:00
john stultz
8524070b79 Move timekeeping code to timekeeping.c
Move the timekeeping code out of kernel/timer.c and into
kernel/time/timekeeping.c.  I made no cleanups or other changes in transit.

[akpm@linux-foundation.org: build fix]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:06 -07:00
Ahmed S. Darwish
f75d222b83 IRQ: check for PERCPU flag only when adding first irqaction
An irqaction structure won't be added to an IRQ descriptor irqaction list if
it doesn't agree with other irqactions on the IRQF_PERCPU flag.  Don't check
for this flag to change IRQ descriptor `status' for every irqaction added to
the list, Doing the check only for the first irqaction added is enough.

Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:06 -07:00
Venki Pallipadi
6e453a6751 Add support for deferrable timers
Introduce a new flag for timers - deferrable: Timers that work normally
when system is busy.  But, will not cause CPU to come out of idle (just to
service this timer), when CPU is idle.  Instead, this timer will be
serviced when CPU eventually wakes up with a subsequent non-deferrable
timer.

The main advantage of this is to avoid unnecessary timer interrupts when
CPU is idle.  If the routine currently called by a timer can wait until
next event without any issues, this new timer can be used to setup timer
event for that routine.  This, with dynticks, allows CPUs to be lazy,
allowing them to stay in idle for extended period of time by reducing
unnecesary wakeup and thereby reducing the power consumption.

This patch:

Builds this new timer on top of existing timer infrastructure.  It uses
last bit in 'base' pointer of timer_list structure to store this deferrable
timer flag.  __next_timer_interrupt() function skips over these deferrable
timers when CPU looks for next timer event for which it has to wake up.

This is exported by a new interface init_timer_deferrable() that can be
called in place of regular init_timer().

[akpm@linux-foundation.org: Privatise a #define]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:05 -07:00
Dmitry Adamushko
d2d9433a4c kernel/irq/proc.c: unprotected iteration over the IRQ action list in name_unique()
setup_irq() releases a desc->lock before calling register_handler_proc(), so
the iteration over the IRQ action list is not protected.

(akpm: the check itself is still racy, but at least it probably won't oops
now).

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:05 -07:00
Srivatsa Vaddagiri
dd9037a26a Fix race between attach_task and cpuset_exit
Currently cpuset_exit() changes the exiting task's ->cpuset pointer w/o
taking task_lock().  This can lead to ugly races between attach_task and
cpuset_exit.  Details of the races are described at
http://lkml.org/lkml/2007/3/24/132.

Patch below closes those races.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:05 -07:00
Christoph Hellwig
1eeb66a1bb move die notifier handling to common code
This patch moves the die notifier handling to common code.  Previous
various architectures had exactly the same code for it.  Note that the new
code is compiled unconditionally, this should be understood as an appel to
the other architecture maintainer to implement support for it aswell (aka
sprinkling a notify_die or two in the proper place)

arm had a notifiy_die that did something totally different, I renamed it to
arm_notify_die as part of the patch and made it static to the file it's
declared and used at.  avr32 used to pass slightly less information through
this interface and I brought it into line with the other architectures.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
[bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Randy Dunlap
e386979299 kprobes: fix sparse NULL warning
Fix sparse NULL warnings:
kernel/kprobes.c:915:49: warning: Using plain integer as NULL pointer

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Gerd Hoffmann
69331af79c Fixes and cleanups for earlyprintk aka boot console
The console subsystem already has an idea of a boot console, using the
CON_BOOT flag.  The implementation has some flaws though.  The major
problem is that presence of a boot console makes register_console() ignore
any other console devices (unless explicitly specified on the kernel
command line).

This patch fixes the console selection code to *not* consider a boot
console a full-featured one, so the first non-boot console registering will
become the default console instead.  This way the unregister call for the
boot console in the register_console() function actually triggers and the
handover from the boot console to the real console device works smoothly.
Added a printk for the handover, so you know which console device the
output goes to when the boot console stops printing messages.

The disable_early_printk() call is obsolete with that patch, explicitly
disabling the early console isn't needed any more as it works automagically
with that patch.

I've walked through the tree, dropped all disable_early_printk() instances
found below arch/ and tagged the consoles with CON_BOOT if needed.  The
code is tested on x86, sh (thanks to Paul) and mips (thanks to Ralf).

Changes to last version: Rediffed against -rc3, adapted to mips cleanups by
Ralf, fixed "udbg-immortal" cmd line arg on powerpc.

Signed-off-by: Gerd Hoffmann <kraxel@exsuse.de>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:04 -07:00
Nick Piggin
72c1bbf308 futex: restartable futex_wait
LTP test sigaction_16_24 fails, because it expects sem_wait to be restarted
if SA_RESTART is set.  sem_wait is implemented with futex_wait, that
currently doesn't support being restarted.  Ulrich confirms that the call
should be restartable.

Implement a restart_block method to handle the relative timeout, and allow
restarts.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:03 -07:00
Rusty Russell
9adef58b1d futex: get_futex_key, get_key_refs and drop_key_refs
lguest uses the convenient futex infrastructure for inter-domain I/O, so
expose get_futex_key, get_key_refs (renamed get_futex_key_refs) and
drop_key_refs (renamed drop_futex_key_refs).  Also means we need to expose the
union that these use.

No code changes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:03 -07:00
Kees Cook
5096add84b proc: maps protection
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes.  Issues:

- maps should not be world-readable, especially if programs expect any
  kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
  check the maps when %n is in a *printf call, and a setuid(getuid())
  process wouldn't be able to read its own maps file.  (For reference
  see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
  non-root applications that depend on access to the maps contents.

This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents.  To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.

[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat]
Signed-off-by: Kees Cook <kees@outflux.net>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:02 -07:00
Eric Dumazet
753e9c5cd9 Optimize timespec_trunc()
The first thing done by timespec_trunc() is :

  if (gran <= jiffies_to_usecs(1) * 1000)

This should really be a test against a constant known at compile time.

Alas, it isnt. jiffies_to_usec() was unilined so C compiler emits a function
call and a multiply to compute : a CONSTANT.

mov    $0x1,%edi
mov    %rbx,0xffffffffffffffe8(%rbp)
mov    %r12,0xfffffffffffffff0(%rbp)
mov    %edx,%ebx
mov    %rsi,0xffffffffffffffc8(%rbp)
mov    %rsi,%r12
callq  ffffffff80232010 <jiffies_to_usecs>
imul   $0x3e8,%eax,%eax
cmp    %ebx,%eax

This patch reorders kernel/time.c a bit so that jiffies_to_usecs() is defined
before timespec_trunc() so that compiler now generates :

cmp    $0x3d0900,%edx  (HZ=250 on my machine)

This gives a better code (timespec_trunc() becoming a leaf function), and
shorter kernel size as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:01 -07:00
Josh Triplett
6f8bc500a1 rcutorture: Mark rcu_torture_init as __init
The corresponding rcu_torture_cleanup cannot get marked as __exit, because
rcu_torture_init uses it to clean up if init fails.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:00 -07:00
Badari Pulavarty
e3222c4ecc Merge sys_clone()/sys_unshare() nsproxy and namespace handling
sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
namespaces.  But they have different code paths.

This patch merges all the nsproxy and its associated namespace copy/clone
handling (as much as possible).  Posted on container list earlier for
feedback.

- Create a new nsproxy and its associated namespaces and pass it back to
  caller to attach it to right process.

- Changed all copy_*_ns() routines to return a new copy of namespace
  instead of attaching it to task->nsproxy.

- Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.

- Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
  just incase.

- Get rid of all individual unshare_*_ns() routines and make use of
  copy_*_ns() instead.

[akpm@osdl.org: cleanups, warning fix]
[clg@fr.ibm.com: remove dup_namespaces() declaration]
[serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
[akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <containers@lists.osdl.org>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:00 -07:00
Prarit Bhargava
ee527cd3a2 Use stop_machine_run in the Intel RNG driver
Replace call_smp_function with stop_machine_run in the Intel RNG driver.

CPU A has done read_lock(&lock)
CPU B has done write_lock_irq(&lock) and is waiting for A to release the lock.

A third CPU calls call_smp_function and issues the IPI.  CPU A takes CPU
C's IPI.  CPU B is waiting with interrupts disabled and does not see the
IPI.  CPU C is stuck waiting for CPU B to respond to the IPI.

Deadlock.

The solution is to use stop_machine_run instead of call_smp_function
(call_smp_function should not be called in situations where the CPUs may be
suspended).

[haruo.tomita@toshiba.co.jp: fix a typo in mod_init()]
[haruo.tomita@toshiba.co.jp: fix memory leak]
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: "Tomita, Haruo" <haruo.tomita@toshiba.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:00 -07:00
Pekka Enberg
6d4f9c5500 module: use krealloc
This converts an open-coded krealloc() to use the shiny new API.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:15:00 -07:00
Oleg Nesterov
02fb6149f7 softlockup: s/99/MAX_RT_PRIO/
Don't use hardcoded 99 value, use MAX_RT_PRIO.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:59 -07:00
Oleg Nesterov
1065d130dd freezer: task->exit_state should be treated as bolean
Except for BUG_ON() checks, we should not use EXIT_XXXX defines outside of
exit/wait paths.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:58 -07:00
Christoph Hellwig
ab1b6f03a1 simplify the stacktrace code
Simplify the stacktrace code:

 - remove the unused task argument to save_stack_trace, it's always
   current
 - remove the all_contexts flag, it's alwasy 0

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@suse.de>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 11:14:58 -07:00
Paul Mackerras
02bbc0f09c Merge branch 'linux-2.6' 2007-05-08 13:37:51 +10:00
Rafael J. Wysocki
56f99bcb52 swsusp: free more memory
Move the definition of PAGES_FOR_IO to kernel/power/power.h and introduce
SPARE_PAGES representing the number of pages that should be freed by the
swsusp's memory shrinker in addition to PAGES_FOR_IO so that device drivers
can allocate some memory (up to 1 MB total) in their .suspend() routines
without causing the suspend to fail.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki
9b95e43763 swsusp: fix snapshot_release
Remove the leftover enable_nonboot_cpus() from snapshot_release().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
David Brownell
a7ee2e5f5b kconfig: mention 'hibernation' not just swsusp
Clarify that "software suspend" is what's called "hibernation" in most user
interfaces, shrinking a terminology gap.  (Examples include Gnome and
MS-Windows.)

Also provide a more succinct description of what it does, so you won't have
to read the whole novel in Kconfig; and highlights just why the lack of
BIOS requirements for swsusp are a big deal.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Johannes Berg
f0ced9b229 power management: change /sys/power/disk display
Change /sys/power/disk to display all valid modes as well as the currently
selected one in a fashion known from the LED subsystem.

This changes userspace API, but it is apparently not used much (we asked
some userspace developers)

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Johannes Berg
ab3bfca7ab remove software_suspend()
Remove software_suspend() and all its users since
pm_suspend(PM_SUSPEND_DISK) should be equivalent and there's no point in
having two interfaces for the same thing.

The patch also changes the valid_state function to return 0 (false) for
PM_SUSPEND_DISK when SOFTWARE_SUSPEND is not configured instead of
accepting it and having the whole thing fail later.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki
d1d241cc2c swsusp: use rbtree for tracking allocated swap
Make swsusp use extents instead of a bitmap to trace swap pages allocated
for saving the image (the tracking is only needed in case there's an error,
so that the allocated swap pages can be released).

This should allow us to reduce the memory usage, practically always, and
improve performance.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki
0709db6072 swsusp: use GFP_KERNEL for creating basic data structures
Make swsusp call create_basic_memory_bitmaps() before processes are frozen, so
that GFP_KERNEL allocations can be made in it.  Additionally, ensure that the
swsusp's userland interface won't be used while either pm_suspend_disk() or
software_resume() is being executed.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki
1525a2ad76 swsusp: fix error paths in snapshot_open
We forget to increase device_available if there's an error in snapshot_open(),
so the snapshot device cannot be open at all after snapshot_open() has
returned an error.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki
74dfd666de swsusp: do not use page flags
Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and
free pages.  This allows us to 'recycle' two page flags that can be used for
other purposes.  Also, the memory needed to store the bitmaps is allocated
when necessary (ie.  before the suspend) and freed after the resume which is
more reasonable.

The patch is designed to minimize the amount of changes and there are some
nice simplifications and optimizations possible on top of it.  I am going to
implement them separately in the future.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:59 -07:00
Rafael J. Wysocki
7be9823491 swsusp: use inline functions for changing page flags
Replace direct invocations of SetPageNosave(), SetPageNosaveFree() etc.  with
calls to inline functions that can be changed in subsequent patches without
modifying the code calling them.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:58 -07:00
Oleg Nesterov
433ecb4ab3 fix refrigerator() vs thaw_process() race
refrigerator() can miss a wakeup, "wait event" loop needs a proper memory
ordering.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:58 -07:00
Roland McGrath
7324328446 Return EPERM not ECHILD on security_task_wait failure
wait* syscalls return -ECHILD even when an individual PID of a live child
was requested explicitly, when security_task_wait denies the operation.
This means that something like a broken SELinux policy can produce an
unexpected failure that looks just like a bug with wait or ptrace or
something.

This patch makes do_wait return -EACCES (or other appropriate error returned
from security_task_wait() instead of -ECHILD if some children were ruled out
solely because security_task_wait failed.

[jmorris@namei.org: switch error code to EACCES]
Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter
50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter
0a31bd5f2b KMEM_CACHE(): simplify slab cache creation
This patch provides a new macro

KMEM_CACHE(<struct>, <flags>)

to simplify slab creation. KMEM_CACHE creates a slab with the name of the
struct, with the size of the struct and with the alignment of the struct.
Additional slab flags may be specified if necessary.

Example

struct test_slab {
	int a,b,c;
	struct list_head;
} __cacheline_aligned_in_smp;

test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

will create a new slab named "test_slab" of the size sizeof(struct
test_slab) and aligned to the alignment of test slab.  If it fails then we
panic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
David Rientjes
c596d9f320 cpusets: allow TIF_MEMDIE threads to allocate anywhere
OOM killed tasks have access to memory reserves as specified by the
TIF_MEMDIE flag in the hopes that it will quickly exit.  If such a task has
memory allocations constrained by cpusets, we may encounter a deadlock if a
blocking task cannot exit because it cannot allocate the necessary memory.

We allow tasks that have the TIF_MEMDIE flag to allocate memory anywhere,
including outside its cpuset restriction, so that it can quickly die
regardless of whether it is __GFP_HARDWALL.

Cc: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter
476f35348e Safer nr_node_ids and nr_node_ids determination and initial values
The nr_cpu_ids value is currently only calculated in smp_init.  However, it
may be needed before (SLUB needs it on kmem_cache_init!) and other kernel
components may also want to allocate dynamically sized per cpu array before
smp_init.  So move the determination of possible cpus into sched_init()
where we already loop over all possible cpus early in boot.

Also initialize both nr_node_ids and nr_cpu_ids with the highest value they
could take.  If we have accidental users before these values are determined
then the current valud of 0 may cause too small per cpu and per node arrays
to be allocated.  If it is set to the maximum possible then we only waste
some memory for early boot users.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:51 -07:00
Johannes Berg
543b9fd352 [POWERPC] powermac: Suspend to disk on G5
Powermac G5 suspend to disk implementation.  The code is platform
agnostic but only tested on powermac, no other 64-bit powerpc
machines.

Because nvidiafb still breaks suspend I have marked it EXPERIMENTAL on
powermac and because I can't test it and some lowlevel code will need
changes it is BROKEN on all other 64-bit platforms.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2007-05-07 20:31:14 +10:00
Linus Torvalds
ea62ccd00f Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (231 commits)
  [PATCH] i386: Don't delete cpu_devs data to identify different x86 types in late_initcall
  [PATCH] i386: type may be unused
  [PATCH] i386: Some additional chipset register values validation.
  [PATCH] i386: Add missing !X86_PAE dependincy to the 2G/2G split.
  [PATCH] x86-64: Don't exclude asm-offsets.c in Documentation/dontdiff
  [PATCH] i386: avoid redundant preempt_disable in __unlazy_fpu
  [PATCH] i386: white space fixes in i387.h
  [PATCH] i386: Drop noisy e820 debugging printks
  [PATCH] x86-64: Fix allnoconfig error in genapic_flat.c
  [PATCH] x86-64: Shut up warnings for vfat compat ioctls on other file systems
  [PATCH] x86-64: Share identical video.S between i386 and x86-64
  [PATCH] x86-64: Remove CONFIG_REORDER
  [PATCH] x86-64: Print type and size correctly for unknown compat ioctls
  [PATCH] i386: Remove copy_*_user BUG_ONs for (size < 0)
  [PATCH] i386: Little cleanups in smpboot.c
  [PATCH] x86-64: Don't enable NUMA for a single node in K8 NUMA scanning
  [PATCH] x86: Use RDTSCP for synchronous get_cycles if possible
  [PATCH] i386: Add X86_FEATURE_RDTSCP
  [PATCH] i386: Implement X86_FEATURE_SYNC_RDTSC on i386
  [PATCH] i386: Implement alternative_io for i386
  ...

Fix up trivial conflict in include/linux/highmem.h manually.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-05 14:55:20 -07:00
Linus Torvalds
5b33991576 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  remove "struct subsystem" as it is no longer needed
  sysfs: printk format warning
  DOC: Fix wrong identifier name in Documentation/driver-model/devres.txt
  platform: reorder platform_device_del
  Driver core: fix show_uevent from taking up way too much stack
2007-05-04 18:04:48 -07:00
Michael Ellerman
7fe3730de7 MSI: arch must connect the irq and the msi_desc
set_irq_msi() currently connects an irq_desc to an msi_desc. The archs call
it at some point in their setup routine, and then the generic code sets up the
reverse mapping from the msi_desc back to the irq.

set_irq_msi() should do both connections, making it the one and only call
required to connect an irq with it's MSI desc and vice versa.

The arch code MUST call set_irq_msi(), and it must do so only once it's sure
it's not going to fail the irq allocation.

Given that there's no need for the arch to return the irq anymore, the return
value from the arch setup routine just becomes 0 for success and anything else
for failure.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 19:02:38 -07:00
Greg Kroah-Hartman
823bccfc40 remove "struct subsystem" as it is no longer needed
We need to work on cleaning up the relationship between kobjects, ksets and
ktypes.  The removal of 'struct subsystem' is the first step of this,
especially as it is not really needed at all.

Thanks to Kay for fixing the bugs in this patch.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-02 18:57:59 -07:00
Jeremy Fitzhardinge
d6dd61c831 [PATCH] x86: PARAVIRT: add hooks to intercept mm creation and destruction
Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
2007-05-02 19:27:14 +02:00
Jeremy Fitzhardinge
b6e3590f81 [PATCH] x86: Allow percpu variables to be page-aligned
Let's allow page-alignment in general for per-cpu data (wanted by Xen, and
Ingo suggested KVM as well).

Because larger alignments can use more room, we increase the max per-cpu
memory to 64k rather than 32k: it's getting a little tight.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02 19:27:12 +02:00
Jeremy Fitzhardinge
b00742d399 [PATCH] x86-64: Account for module percpu space separately from kernel percpu
Rather than using a single constant PERCPU_ENOUGH_ROOM, compute it as
the sum of kernel_percpu + PERCPU_MODULE_RESERVE.  This is now common
to all architectures; if an architecture wants to set
PERCPU_ENOUGH_ROOM to something special, then it may do so (ia64 is
the only one which does).

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@suse.de>
2007-05-02 19:27:11 +02:00
Vivek Goyal
1b29c1643c [PATCH] x86-64: do not use virt_to_page on kernel data address
o virt_to_page() call should be used on kernel linear addresses and not
  on kernel text and data addresses. Swsusp code uses it on kernel data
  (statically allocated swsusp_header).

o Allocate swsusp_header dynamically so that virt_to_page() can be used
  safely.

o I am changing this because in next few patches, __pa() on x86_64 will
  no longer support kernel text and data addresses and hibernation breaks.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Vivek Goyal
49c3df6aaa [PATCH] x86: Move swsusp __pa() dependent code to arch portion
o __pa() should be used only on kernel linearly mapped virtual addresses
  and not on kernel text and data addresses.

o Hibernation code needs to determine the physical address associated
  with kernel symbol to mark a section boundary which contains pages which
  don't have to be saved and restored during hibernate/resume operation.

o Move this piece of code in arch dependent section. So that architectures
  which don't have kernel text/data mapped into kernel linearly mapped
  region can come up with their own ways of determining physical addresses
  associated with a kernel text.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02 19:27:07 +02:00
Johannes Berg
9684e51cd1 power management: force pm_ops.valid callback to be assigned
This patch changes the docs and behaviour from "all states valid" to "no
states valid" if no .valid callback is assigned.  Users of pm_ops that only
need mem sleep can assign pm_valid_only_mem without any overhead, others
will require more elaborate callbacks.

Now that all users of pm_ops have a .valid callback this is a safe thing to
do and prevents things from getting messy again as they were before.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Looks-okay-to: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <linux-pm@lists.linux-foundation.org>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-30 16:40:40 -07:00
Johannes Berg
e8c9c50269 power management: implement pm_ops.valid for everybody
Almost all users of pm_ops only support mem sleep, don't check in .valid and
don't reject any others in .prepare so users can be confused if they check
/sys/power/state, especially when new states are added (these would then
result in s-t-r although they're supposed to be something different).

This patch implements a generic pm_valid_only_mem function that is then
exported for users and puts it to use in almost all existing pm_ops.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: David Brownell <david-b@pacbell.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: linux-pm@lists.linux-foundation.org
Cc: Len Brown <lenb@kernel.org>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-30 16:40:40 -07:00
Johannes Berg
11d77d0c01 power management: remove firmware disk mode
This patch removes the firmware disk suspend mode which is the wrong approach,
it is supposed to be used for implementing firmware-based disk suspend but
cannot actually be used for that.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: <linux-pm@lists.linux-foundation.org>
Cc: David Brownell <david-b@pacbell.net>
Cc: Len Brown <lenb@kernel.org>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-30 16:40:40 -07:00
Johannes Berg
fe0c935a6c rework pm_ops pm_disk_mode, kill misuse
This patch series cleans up some misconceptions about pm_ops.  Some users of
the pm_ops structure attempt to use it to stop the user from entering suspend
to disk, this, however, is not possible since the user can always use
"shutdown" in /sys/power/disk and then the pm_ops are never invoked.  Also,
platforms that don't support suspend to disk simply should not allow
configuring SOFTWARE_SUSPEND (read the help text on it, it only selects
suspend to disk and nothing else, all the other stuff depends on PM).

The pm_ops structure is actually intended to provide a way to enter
platform-defined sleep states (currently supported states are "standby" and
"mem" (suspend to ram)) and additionally (if SOFTWARE_SUSPEND is configured)
allows a platform to support a platform specific way to enter low-power mode
once everything has been saved to disk.  This is currently only used by ACPI
(S4).

This patch:

The pm_ops.pm_disk_mode is used in totally bogus ways since nobody really
seems to understand what it actually does.

This patch clarifies the pm_disk_mode description.

It also removes all the arm and sh users that think they can veto suspend to
disk via pm_ops; not so since the user can always do echo shutdown >
/sys/power/disk, they need to find a better way involving Kconfig or such.

ACPI is the only user left with a non-zero pm_disk_mode.

The patch also sets the default mode to shutdown again, but when a new pm_ops
is registered its pm_disk_mode is selected as default, that way the default
stays for ACPI where it is apparently required.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: David Brownell <david-b@pacbell.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: <linux-pm@lists.linux-foundation.org>
Cc: Len Brown <lenb@kernel.org>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Greg KH <greg@kroah.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-30 16:40:40 -07:00
Robert Peterson
42e380832a Extend print_symbol capability
Today's print_symbol function dumps a kernel symbol with printk.  This
patch extends the functionality of kallsyms.c so that the symbol lookup
function may be used without the printk.  This is useful for modules that
want to dump symbols elsewhere, for example, to debugfs.  I intend to use
the new function call in the GFS2 file system (which will be a separate
patch).

[akpm@linux-foundation.org: build fix]
[clameter@sgi.com: sprint_symbol should return length of string like sprintf]
Signed-off-by: Robert Peterson <rpeterso@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-30 16:40:39 -07:00
Jeff Garzik
8cdfb29c0c libata/IDE: remove combined mode quirk
Both old-IDE and libata should be able handle all controllers and
devices found using normal resource reservation methods.

This eliminates the awful, low-performing split-driver configuration
where old-IDE drove the PATA portion of a PCI device, in PIO-only mode,
and libata drove the SATA portion of the /same/ PCI device, in DMA mode.
Typically vendors would ship SATA hard drive / PATA optical
configuration, which would lend itself to slow (PIO-only) CD-ROM
performance.

For Intel users running in combined mode, it is now wholly dependent on
your driver choice (potentially link order, if you compile both drivers
in) whether old-IDE or libata will drive your hardware.

In either case, you will get full performance from both SATA and PATA
ports now, without having to pass a kernel command line parameter.

Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-04-28 14:15:59 -04:00
David Howells
b8b8fd2dc2 [NET]: Fix networking compilation errors
Fix miscellaneous networking compilation errors.

 (*) Export ktime_add_ns() for modules.

 (*) wext_proc_init() should have an ANSI declaration.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-27 15:31:24 -07:00
Linus Torvalds
d868772fff Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (46 commits)
  dev_dbg: check dev_dbg() arguments
  drivers/base/attribute_container.c: use mutex instead of binary semaphore
  mod_sysfs_setup() doesn't return errno when kobject_add_dir() failure occurs
  s2ram: add arch irq disable/enable hooks
  define platform wakeup hook, use in pci_enable_wake()
  security: prevent permission checking of file removal via sysfs_remove_group()
  device_schedule_callback() needs a module reference
  s390: cio: Delay uevents for subchannels
  sysfs: bin.c printk fix
  Driver core: use mutex instead of semaphore in DMA pool handler
  driver core: bus_add_driver should return an error if no bus
  debugfs: Add debugfs_create_u64()
  the overdue removal of the mount/umount uevents
  kobject: Comment and warning fixes to kobject.c
  Driver core: warn when userspace writes to the uevent file in a non-supported way
  Driver core: make uevent-environment available in uevent-file
  kobject core: remove rwsem from struct subsystem
  qeth: Remove usage of subsys.rwsem
  PHY: remove rwsem use from phy core
  IEEE1394: remove rwsem use from ieee1394 core
  ...
2007-04-27 12:58:54 -07:00
Akinobu Mita
240936e18b mod_sysfs_setup() doesn't return errno when kobject_add_dir() failure occurs
mod_sysfs_setup() doesn't return an errno when kobject_add_dir() for module
"holders" directory fails.  So caller of mod_sysfs_setup() will keep going
and get oops.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-04-27 10:57:34 -07:00
Johannes Berg
a53c46dc82 s2ram: add arch irq disable/enable hooks
After some more discussion this patch replaces it:

From: Johannes Berg <johannes@sipsolutions.net>
Subject: suspend: add arch irq disable/enable hooks

For powermac, we need to do some things between suspending devices and
device_power_off, for example setting the decrementer. This patch
allows architectures to define arch_s2ram_{en,dis}able_irqs in their
asm/suspend.h to have control over this step.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-04-27 10:57:33 -07:00
Ingo Molnar
39bc89fd40 make SysRq-T show all tasks again
show_state() (SysRq-T) developed the buggy habbit of not showing
TASK_RUNNING tasks.  This was due to the mistaken belief that state_filter
== -1 would be a pass-through filter - while in reality it did not let
TASK_RUNNING == 0 p->state values through.

Fix this by restoring the original '!state_filter means all tasks'
special-case i had in the original version.  Test-built and test-booted on
i686, SysRq-T now works as intended.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-27 10:46:51 -07:00
David Howells
e19dff1fdd [AF_RXRPC]: Make it possible to merely try to cancel timers from a module
Export try_to_del_timer_sync() for use by the AF_RXRPC module.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-26 15:46:56 -07:00
Patrick McHardy
af65bdfce9 [NETLINK]: Switch cb_lock spinlock to mutex and allow to override it
Switch cb_lock to mutex and allow netlink kernel users to override it
with a subsystem specific mutex for consistent locking in dump callbacks.
All netlink_dump_start users have been audited not to rely on any
side-effects of the previously used spinlock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:03 -07:00
Stephen Hemminger
85795d64ed [TCP] tcp_probe: improvements for net-2.6.22
Change tcp_probe to use ktime (needed to add one export).
Add option to only get events when cwnd changes - from Doug Leith

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:10 -07:00
Arnaldo Carvalho de Melo
b529ccf279 [NETLINK]: Introduce nlmsg_hdr() helper
For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the
number of direct accesses to skb->data and for consistency with all the other
cast skb member helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:34 -07:00
Arnaldo Carvalho de Melo
27a884dc3c [SK_BUFF]: Convert skb->tail to sk_buff_data_t
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:28 -07:00
Patrick McHardy
641b9e0e8b [NET_SCHED]: Use ktime as clocksource
Get rid of the manual clock source selection mess and use ktime. Also
use a scalar representation, which allows to clean up pkt_sched.h a bit
more and results in less ktime_to_ns() calls in most cases.

The PSCHED_US2JIFFIE/PSCHED_JIFFIE2US macros are implemented quite
inefficient by this patch, following patches will convert all qdiscs
to hrtimers and get rid of them entirely.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:04 -07:00
Eric Dumazet
b7aa0bf70c [NET]: convert network timestamps to ktime_t
We currently use a special structure (struct skb_timeval) and plain
'struct timeval' to store packet timestamps in sk_buffs and struct
sock.

This has some drawbacks :
- Fixed resolution of micro second.
- Waste of space on 64bit platforms where sizeof(struct timeval)=16

I suggest using ktime_t that is a nice abstraction of high resolution
time services, currently capable of nanosecond resolution.

As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
a 8 byte shrink of this structure on 64bit architectures. Some other
structures also benefit from this size reduction (struct ipq in
ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)

Once this ktime infrastructure adopted, we can more easily provide
nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
SO_TIMESTAMPNS/SCM_TIMESTAMPNS)

Note : this patch includes a bug correction in
compat_sock_get_timestamp() where a "err = 0;" was missing (so this
syscall returned -ENOENT instead of 0)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
CC: John find <linux.kernel@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:34 -07:00
Bastian Blank
91fcd412e9 Allow reading tainted flag as user
The commit 34f5a39899 restricted reading
of the tainted value. The attached patch changes this back to a
write-only check and restores the read behaviour of older versions.

Signed-off-by: Bastian Blank <bastian@waldi.eu.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24 08:23:08 -07:00
Randy Dunlap
fe20e581a7 [PATCH] fix kernel oops with badly formatted module option
Catch malformed kernel parameter usage of "param = value".  Spaces are not
supported, but don't cause a kernel fault on such usage, just report an
error.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Larry Finger <Larry.Finger@lwfinger.net>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-12 15:31:41 -07:00
Linus Torvalds
d354d2f4a6 sched.c: Remove unused variable 'relative'
Getting rid of the p->children printout in show_task() left behind an
unused variable.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-07 10:18:33 -07:00
Ingo Molnar
35f6f753b7 [PATCH] sched: get rid of p->children use in show_task()
the p->parent PID printout gives us all the information about the
task tree that we need - the eldest_child()/older_sibling()/
younger_sibling() printouts are mostly historic and i do not
remember ever having used those fields. (IMO in fact they confuse
the SysRq-T output.) So remove them.

This code has sentimental value though, those fields and
printouts are one of the oldest ones still surviving from
Linux v0.95's kernel/sched.c:

        if (p->p_ysptr || p->p_osptr)
                printk("   Younger sib=%d, older sib=%d\n\r",
                        p->p_ysptr ? p->p_ysptr->pid : -1,
                        p->p_osptr ? p->p_osptr->pid : -1);
        else
                printk("\n\r");

written 15 years ago, in early 1992.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus 'snif' Torvalds <torvalds@linux-foundation.org>
2007-04-07 10:06:51 -07:00
Tejun Heo
7f30e49ee1 [PATCH] irq-devres: fix failure path of devm_request_irq()
devres should be deallocated with devres_free() not kfree().  This bug
corrupts slab on IRQ request failure.  Fix it.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-07 10:05:21 -07:00
Ingo Molnar
995f054f2a [PATCH] high-res timers: resume fix
Soeren Sonnenburg reported that upon resume he is getting
this backtrace:

 [<c0119637>] smp_apic_timer_interrupt+0x57/0x90
 [<c0142d30>] retrigger_next_event+0x0/0xb0
 [<c0104d30>] apic_timer_interrupt+0x28/0x30
 [<c0142d30>] retrigger_next_event+0x0/0xb0
 [<c0140068>] __kfifo_put+0x8/0x90
 [<c0130fe5>] on_each_cpu+0x35/0x60
 [<c0143538>] clock_was_set+0x18/0x20
 [<c0135cdc>] timekeeping_resume+0x7c/0xa0
 [<c02aabe1>] __sysdev_resume+0x11/0x80
 [<c02ab0c7>] sysdev_resume+0x47/0x80
 [<c02b0b05>] device_power_up+0x5/0x10

it turns out that on resume we mistakenly re-enable interrupts too
early.  Do the timer retrigger only on the current CPU.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Soeren Sonnenburg <kernel@nn7.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-07 10:03:43 -07:00
john stultz
98de9e3ba2 [PATCH] fix jiffies clocksource inittime
In debugging a problem w/ the -rt tree, I noticed that on systems that mark
the tsc as unstable before it is registered, the TSC would still be
selected and used for a short period of time.  Digging in it looks to be a
result of the mix of the clocksource list changes and my clocksource
initialization changes.

With the -rt tree, using a bad TSC, even for a short period of time can
results in a hang at boot.  I was not able to reproduce this hang w/
mainline, but I'm not completely certain that someone won't trip on it.

This patch resolves the issue by initializing the jiffies clocksource
earlier so a bad TSC won't get selected just because nothing else is yet
registered.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
Rafael J. Wysocki
c75fd0ee6e [PATCH] swsusp: fix memory shrinker
Fix a bug in the swsusp's memory shrinker that causes some systems using
highmem to refuse to suspend to disk if image_size is set above 1/2 of
available RAM.

Special thanks to Jiri Slaby for reporting the problem and assistance in
debugging it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
Thomas Bittermann
456a09dce9 [PATCH] kernel/time.c: add missing symbol exports
This patch adds 2 missing symbol exports: jiffies_to_timeval() and
timeval_to_jiffies().  The (not yet merged) dm-raid4-5 module will need
them, and they used to be indirectly exported by virtue of being inline
functions.

Commit 8b9365d753 ("[PATCH] Uninline
jiffies.h functions") uninlined them, and thus modules now need them
explicitly exported to use them.

Signed-off-by: Thomas Bittermann <t.bittermann@online.de>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 17:35:53 -07:00
Rafael J. Wysocki
1d64b9cb1d [PATCH] Fix microcode-related suspend problem
Fix the regression resulting from the recent change of suspend code
ordering that causes systems based on Intel x86 CPUs using the microcode
driver to hang during the resume.

The problem occurs since the microcode driver uses request_firmware() in
its CPU hotplug notifier, which is called after tasks has been frozen and
hangs.  It can be fixed by telling the microcode driver to use the
microcode stored in memory during the resume instead of trying to load it
from disk.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Adrian Bunk <bunk@stusta.de>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maxim <maximlevitsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:09 -07:00
Kay Sievers
0c84ce268b [PATCH] driver core: fix built-in drivers sysfs links
built-in drivers had broken sysfs links that caused bootup hangs for
certain driver unregistry sequences.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-02 10:06:09 -07:00
Eric W. Biederman
14e9d5730a [PATCH] pid: Properly detect orphaned process groups in exit_notify
In commit 0475ac0845 when converting the
orphaned process group handling to use struct pid I made a small
mistake.  I accidentally replaced an == with a !=.

Besides just being a dumb thing to do apparently this has a bad side
effect.  The improper orphaned process group detection causes kwin to
die after a suspend/resume cycle.

I'm amazed this patch has been around as long as it has without anyone
else noticing something funny going on.

And the following people deserve credit for spotting and helping
to reproduce this.

Thanks to: Sid Boyce <g3vbv@blueyonder.co.uk>
Thanks to: "Michael Wu"

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:16:23 -07:00
Ingo Molnar
935c631db8 [PATCH] hrtimers: fix reprogramming SMP race
hrtimer_start() incorrectly set the 'reprogram' flag to enqueue_hrtimer(),
which should only be 1 if the hrtimer is queued to the current CPU.

Doing otherwise could result in a reprogramming of the current CPU's
clockevents device, with a timer that is not queued to it - resulting in a
bogus next expiry value.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-28 13:44:31 -07:00
Rafael J. Wysocki
436ce71638 [PATCH] Revert "swsusp: disable nonboot CPUs before entering platform suspend"
This reverts commit 94985134b7 and
insteads removes the WARN_ON() that caused that commit in the first
place.

The problem is that we call disable_nonboot_cpus() in swsusp before
powering down the system in order to avoid triggering the WARN_ON()
in arch/x86_64/kernel/acpi/sleep.c:init_low_mapping() and this doesn't
work well on Thomas' system.

So instead, remove the WARN_ON() in arch/x86_64/kernel/acpi/sleep.c:
init_low_mapping(), which triggers every time during the suspend to disk
in the platform mode, as the potential problem it is related to doesn't
seem to occur in practice.

[ I think we might want to disallow the case of multiple users of that
  mm, or something.  Normally, playing with the current process page
  tables on the current CPU should be fine as long as we don't have
  other threads using those tables at the same time..

  Anyway, not pretty, but better than the warning or the lockup - Linus ]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:20:03 -07:00
john stultz
d62ac21aa0 [PATCH] ntp: avoid time_offset overflows
I've been seeing some odd NTP behavior recently on a few boxes and
finally narrowed it down to time_offset overflowing when converted to
SHIFT_UPDATE units (which was a side effect from my HZfreeNTP patch).

This patch converts time_offset from a long to a s64 which resolves the
issue.

[tglx@linutronix.de: signedness fixes]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-27 09:05:15 -07:00
Thomas Gleixner
291bc047e1 [PATCH] clockevents: remove bad designed sysfs support for now
The current sysfs support of clockevents does not obey the "only one
value per file" rule.

The real fix is not 2.6.21 material. Therefor remove the sysfs support
for now.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-26 14:07:23 -07:00
Thomas Gleixner
948ac6d71c [PATCH] clocksource: Fix thinko in watchdog selection
The watchdog implementation excludes low res / non continuous
clocksources from being selected as a watchdog reference
unintentionally.

Allow using jiffies/PIT as a watchdog reference as long as no better
clocksource is available. This is necessary to detect TSC breakage on
systems, which have no pmtimer/hpet.

The main goal of the initial patch (preventing to switch to highres/nohz
when no reliable fallback clocksource is available) is still guaranteed
by the checks in clocksource_watchdog().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-25 14:57:34 -07:00
Thomas Gleixner
9501b6cf55 [PATCH] dynticks: fix hrtimer rounding error in next_timer_interrupt
The rework of next_timer_interrupt() fixed the timer wheel bugs, but
invented a rounding error versus the next hrtimer event. This is caused
by the conversion of the hrtimer internal representation to relative
jiffies.

This causes bug #8100:
http://bugzilla.kernel.org/show_bug.cgi?id=8100

next_timer_interrupt() returns "now" in such a case and causes the code
in tick_nohz_stop_sched_tick() to trigger the timer softirq, which is
bogus as no timer is due for expiry. This results in an endless context
switching between idle and ksoftirqd until a timer is due for expiry.

Modify the hrtimer evaluation so that, it returns now + 1, when the
conversion results in a delta < 1 jiffie.

It's confirmed to resolve bug #8100

Reported-by: Emil Karlson <jkarlson@cc.hut.fi>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-25 14:57:34 -07:00
James Morris
0444b3035e [PATCH] time: fix formatting in /proc/timer_list
Fix the print formatting of three unsigned long fields in /proc/timer_list,
which are currently being formatted as signed long.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-23 11:01:21 -07:00
Jarek Poplawski
9c35dd7f8b [PATCH] lockdep: debug_show_all_locks & debug_show_held_locks vs. debug_locks
lockdep's data shouldn't be used when debug_locks == 0 because it's not
updated after this, so it's more misleading than helpful.

PS: probably lockdep's current-> fields should be reset after it turns
debug_locks off: so, after printing a bug report, but before return from
exported functions, but there are really a lot of these possibilities (e.g.
 after DEBUG_LOCKS_WARN_ON), so, something could be missed.  (Of course
direct use of this fields isn't recommended either.)

Reported-by: Folkert van Heusden <folkert@vanheusden.com>
Inspired-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:06 -07:00
Pavel Machek
058560fbd7 [PATCH] fix extra BIOS invocation during resume
It causes extra moon icons blinking on x60, and breaks at least two other
systems.

During resume, we do not know that "reboot"/"shutdown" method was used, so
we assume "plaform" and call BIOS, anyway...

This is 2.6.21 material, and should fix 2 or 3 regressions from 2.6.20.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:06 -07:00
Rafael J. Wysocki
93c9a7ff50 [PATCH] swsusp: Fix SNAPSHOT_S2RAM ioctl
The SNAPSHOT_S2RAM ioctl does not disable the nonboot CPUs before entering
the suspend, although it should do this.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-22 19:39:05 -07:00
Thomas Gleixner
cd05a1f818 [PATCH] clockevents: Fix suspend/resume to disk hangs
I finally found a dual core box, which survives suspend/resume without
crashing in the middle of nowhere. Sigh, I never figured out from the
code and the bug reports what's going on.

The observed hangs are caused by a stale state transition of the clock
event devices, which keeps the RCU synchronization away from completion,
when the non boot CPU is brought back up.

The suspend/resume in oneshot mode needs the similar care as the
periodic mode during suspend to RAM. My assumption that the state
transitions during the different shutdown/bringups of s2disk would go
through the periodic boot phase and then switch over to highres resp.
nohz mode were simply wrong.

Add the appropriate suspend / resume handling for the non periodic
modes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:35:25 -07:00
Zilvinas Valinskas
e29e175b0f [PATCH] initialise pi_lock if CONFIG_RT_MUTEXES=N
Fixes a bogus lockdep warning which causes lockdep to disable itself.

Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:06 -07:00
Ingo Molnar
21778867b1 [PATCH] futex: PI state locking fix
Testing of -rt by IBM uncovered a locking bug in wake_futex_pi(): the PI
state needs to be locked before we access it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:06 -07:00
Andrew Johnson
b257bc051f [PATCH] swsusp: fix suspend when console is in VT_AUTO+KD_GRAPHICS mode
When the console is in VT_AUTO+KD_GRAPHICS mode, switching to the
SUSPEND_CONSOLE fails, resulting in vt_waitactive() waiting indefinitely or
until the task is interrupted.  This patch tests if a console switch can
occur in set_console() and returns early if a console switch is not
possible.

[akpm@linux-foundation.org: cleanup]
Signed-off-by: Andrew Johnson <ajohnson@intrinsyc.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:05 -07:00
Thomas Gleixner
ad28d94abb [PATCH] hrtimer: fix up unlocked access to wall_to_monotonic
commit f4304ab215 (HZ free NTP) moved the
access to wall_to_monotonic in hrtimer_get_softirq_time() out of the
xtime_lock protection.

Move it back into the seq_lock section.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:05 -07:00
Thomas Gleixner
13788ccc41 [PATCH] hrtimer: prevent overrun DoS in hrtimer_forward()
hrtimer_forward() does not check for the possible overflow of
timer->expires.  This can happen on 64 bit machines with large interval
values and results currently in an endless loop in the softirq because the
expiry value becomes negative and therefor the timer is expired all the
time.

Check for this condition and set the expiry value to the max.  expiry time
in the future.  The fix should be applied to stable kernel series as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:05 -07:00
Rafael J. Wysocki
94985134b7 [PATCH] swsusp: disable nonboot CPUs before entering platform suspend
Prevent the WARN_ON() in arch/x86_64/kernel/acpi/sleep.c:init_low_mapping()
from triggering by disabling nonboot CPUs before we finally enter the
platform suspend.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:03 -07:00
Rafael J. Wysocki
886c595295 [PATCH] swsusp: Fix resume error path in platform mode
If swsusp is using the platform mode during the resume and the image cannot
be read, the platform mode should be switched off before software_resume()
returns.  Make it happen.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-16 19:25:03 -07:00
Al Viro
c4823bce03 [PATCH] fix deadlock in audit_log_task_context()
GFP_KERNEL allocations in non-blocking context; fixed by killing
an idiotic use of security_getprocattr().

Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-14 15:27:48 -07:00
Greg Kroah-Hartman
161e232b88 Revert "driver core: refcounting fix"
This reverts commit 63ce18cfe6.

It was the incorrect fix and causes a reference counting bug whenever
any driver module is removed from the system. Mike Galbraith
<efault@gmx.de> is looking for the real fix for his problem.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-03-09 15:25:04 -08:00
Linus Torvalds
d694c16bc3 Merge master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6:
  sh: Kill off I/O cruft for R7780RP.
  sh: Revert lazy dcache writeback changes.
  sh: Enable SM501 support for RTS7751R2D.
  sh: Use L1_CACHE_BYTES for .data.cacheline_aligned.
  sysctl: Support vdso_enabled sysctl on SH.
  sh: Fix kernel thread stack corruption with preempt.
  doc: Add SH to vdso and earlyprintk in kernel-parameters.txt
  sh: Fix sigmask trampling in signal delivery.
  sh: Clear UBC when not in use.
2007-03-07 10:08:33 -08:00
Rafael J. Wysocki
c7276fde27 [PATCH] kconfig: Update swsusp description
Update the outdated and inaccurate description of the software suspend in
Kconfig.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:26 -08:00
Josh Triplett
d6ad67112a [PATCH] Publish rcutorture module parameters via sysfs, read-only
rcutorture's module parameters currently use permissions of 0, so they
don't show up in /sys/module/rcutorture/parameters.  Change the permissions
on all module parameters to world-readable (0444).

rcutorture does all of its initialization and thread startup when loaded
and relies on the parameters not changing during execution, so they should
not permit writing.  However, reading seems fine.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:25 -08:00
Daniel Walker
90675a27fa [PATCH] fix vsyscall settimeofday
I've only seen this on x86_64.

The vsyscall state only gets updated when a timer interrupts comes in.  So
if the time is set long before the next timer, there will be a period when
a gettimeofday() won't reflect the correct time.

I added an explicit update_vsyscall() during the settimeofday(), that way
the vsyscall state doesn't get stale.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:25 -08:00
Thomas Gleixner
f8953856eb [PATCH] highres: do not run the TIMER_SOFTIRQ after switching to highres mode
The TIMER_SOFTIRQ runs the hrtimers during bootup until a usable
clocksource and clock event sources are registered.  The switch to high
resolution mode happens inside of the TIMER_SOFTIRQ, but runs the softirq
afterwards.  That way the tick emulation timer, which was set up in the
switch to highres might be executed in the softirq context, which is a BUG.
 The rbtree has not to be touched by the softirq after the highres switch.

This BUG was observed by Andres Salomon, who provided the information to
debug it.

Return early from the softirq, when the switch was sucessful.

[dilinger@debian.org: add debug warning]
[akpm@linux-foundation.org: make debug warning compile]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andres Salomon <dilinger@debian.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andres Salomon <dilinger@debian.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:25 -08:00
Thomas Gleixner
6321dd60c7 [PATCH] Save/restore periodic tick information over suspend/resume
The programming of periodic tick devices needs to be saved/restored
across suspend/resume - otherwise we might end up with a system coming
up that relies on getting a PIT (or HPET) interrupt, while those devices
default to 'no interrupts' after powerup. (To confuse things it worked
to a certain degree on some systems because the lapic gets initialized
as a side-effect of SMP bootup.)

This suspend / resume thing was dropped unintentionally during the
last-minute -mm code reshuffling.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:24 -08:00
Heiko Carstens
e81ce1f7ec [PATCH] timer/hrtimer: take per cpu locks in sane order
Doing something like this on a two cpu system

  # echo 0 > /sys/devices/system/cpu/cpu0/online
  # echo 1 > /sys/devices/system/cpu/cpu0/online
  # echo 0 > /sys/devices/system/cpu/cpu1/online

will give me this:

  =======================================================
  [ INFO: possible circular locking dependency detected ]
  2.6.21-rc2-g562aa1d4-dirty #7
  -------------------------------------------------------
  bash/1282 is trying to acquire lock:
   (&cpu_base->lock_key){.+..}, at: [<000000000005f17e>] hrtimer_cpu_notify+0xc6/0x240

  but task is already holding lock:
   (&cpu_base->lock_key#2){.+..}, at: [<000000000005f174>] hrtimer_cpu_notify+0xbc/0x240

  which lock already depends on the new lock.

This happens because we have the following code in kernel/hrtimer.c:

  migrate_hrtimers(int cpu)
  [...]
  old_base = &per_cpu(hrtimer_bases, cpu);
  new_base = &get_cpu_var(hrtimer_bases);
  [...]
  spin_lock(&new_base->lock);
  spin_lock(&old_base->lock);

Which means the spinlocks are taken in an order which depends on which cpu
gets shut down from which other cpu. Therefore lockdep complains that there
might be an ABBA deadlock. Since migrate_hrtimers() gets only called on
cpu hotplug it's safe to assume that it isn't executed concurrently on a

The same problem exists in kernel/timer.c: migrate_timers().

As pointed out by Christian Borntraeger one possible solution to avoid
the locking order complaints would be to make sure that the locks are
always taken in the same order. E.g. by taking the lock of the cpu with
the lower number first.

To achieve this we introduce two new spinlock functions double_spin_lock
and double_spin_unlock which lock or unlock two locks in a given order.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Christian Borntraeger <cborntra@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:53 -08:00
john stultz
6bb74df481 [PATCH] clocksource init adjustments (fix bug #7426)
This patch resolves the issue found here:
http://bugme.osdl.org/show_bug.cgi?id=7426

The basic summary is:
Currently we register most of i386/x86_64 clocksources at module_init
time. Then we enable clocksource selection at late_initcall time. This
causes some problems for drivers that use gettimeofday for init
calibration routines (specifically the es1968 driver in this case),
where durring module_init, the only clocksource available is the low-res
jiffies clocksource. This may cause slight calibration errors, due to
the small sampling time used.

It should be noted that drivers that require fine grained time may not
function on architectures that do not have better then jiffies
resolution timekeeping (there are a few). However, this does not
discount the reasonable need for such fine-grained timekeeping at init
time.

Thus the solution here is to register clocksources earlier (ideally when
the hardware is being initialized), and then we enable clocksource
selection at fs_initcall (before device_initcall).

This patch should probably get some testing time in -mm, since
clocksource selection is one of the most important issues for correct
timekeeping, and I've only been able to test this on a few of my own
boxes.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:53 -08:00
Con Kolivas
69f7c0a1be [PATCH] sched: remove SMT nice
Remove the SMT-nice feature which idles sibling cpus on SMT cpus to
facilitiate nice working properly where cpu power is shared.  The idling of
cpus in the presence of runnable tasks is considered too fragile, easy to
break with outside code, and the complexity of managing this system if an
architecture comes along with many logical cores sharing cpu power will be
unworkable.

Remove the associated per_cpu_gain variable in sched_domains used only by
this code.

Also:

  The reason is that with dynticks enabled, this code breaks without yet
  further tweaks so dynticks brought on the rapid demise of this code.  So
  either we tweak this code or kill it off entirely.  It was Ingo's preference
  to kill it off.  Either way this needs to happen for 2.6.21 since dynticks
  has gone in.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-05 07:57:51 -08:00
Paul Mundt
5c36e6578d sysctl: Support vdso_enabled sysctl on SH.
All of the logic for this was already in place, we just hadn't wired it
up in the sysctl table.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-03-05 14:13:26 +09:00
Ingo Molnar
7355690ead [PATCH] sched: fix SMT scheduler bug
The SMT scheduler incorrectly skips kernel threads even if they are
runnable (but they are preempted by a higher-prio user-space task which got
SMT-delayed by an even higher-priority task running on a sibling CPU).

Fix this for now by only doing the SMT-nice optimization if the
to-be-delayed task is the only runnable task.  (This should cover most of
the real-life cases anyway.)

This bug has been in the SMT scheduler since 2.6.17 or so, but has only
been noticed now by the active check in the dynticks code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:38 -08:00
Sam Ravnborg
1499993cc7 [PATCH] fix section mismatch warning in lockdep
lockdep_init() is marked __init but used in several places
outside __init code. This causes following warnings:
$ scripts/mod/modpost kernel/lockdep.o
WARNING: kernel/built-in.o - Section mismatch: reference to .init.text:lockdep_init from .text.lockdep_init_map after 'lockdep_init_map' (at offset 0x105)
WARNING: kernel/built-in.o - Section mismatch: reference to .init.text:lockdep_init from .text.lockdep_reset_lock after 'lockdep_reset_lock' (at offset 0x35)
WARNING: kernel/built-in.o - Section mismatch: reference to .init.text:lockdep_init from .text.__lock_acquire after '__lock_acquire' (at offset 0xb2)

The warnings are less obviously due to heavy inlining by gcc - this is not
altered.

Fix the section mismatch warnings by removing the __init marking, which
seems obviously wrong.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Adrian Bunk
93a6fefe2f [PATCH] fix the SYSCTL=n compilation
/home/bunk/linux/kernel-2.6/linux-2.6.20-mm2/kernel/sysctl.c:1411: error: conflicting types for 'register_sysctl_table'
/home/bunk/linux/kernel-2.6/linux-2.6.20-mm2/include/linux/sysctl.h:1042: error: previous declaration of 'register_sysctl_table' was here
make[2]: *** [kernel/sysctl.o] Error 1

Caused by commit 0b4d414714.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Thomas Gleixner
c1e16aa279 [PATCH] Fix posix-cpu-timer breakage caused by stale p->last_ran value
Problem description at:
http://bugzilla.kernel.org/show_bug.cgi?id=8048

Commit b18ec80396
    [PATCH] sched: improve migration accuracy
optimized the scheduler time calculations, but broke posix-cpu-timers.

The problem is that the p->last_ran value is not updated after a context
switch.  So a subsequent call to current_sched_time() calculates with a
stale p->last_ran value, i.e.  accounts the full time, which the task was
scheduled away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Randy Dunlap
05fb6bf0b2 [PATCH] kernel-doc fixes for 2.6.20-git15 (non-drivers)
Fix kernel-doc warnings in 2.6.20-git15 (lib/, mm/, kernel/, include/).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:37 -08:00
Daniel Walker
9d63463114 [PATCH] update timekeeping_is_continuous comment
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-01 14:53:36 -08:00
Linus Torvalds
b0138a6cb7 Merge master.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/kyle/parisc-2.6: (78 commits)
  [PARISC] Use symbolic last syscall in __NR_Linux_syscalls
  [PARISC] Add missing statfs64 and fstatfs64 syscalls
  Revert "[PARISC] Optimize TLB flush on SMP systems"
  [PARISC] Compat signal fixes for 64-bit parisc
  [PARISC] Reorder syscalls to match unistd.h
  Revert "[PATCH] make kernel/signal.c:kill_proc_info() static"
  [PARISC] fix sys_rt_sigqueueinfo
  [PARISC] fix section mismatch warnings in harmony sound driver
  [PARISC] do not export get_register/set_register
  [PARISC] add ENTRY()/ENDPROC() and simplify assembly of HP/UX emulation code
  [PARISC] convert to use CONFIG_64BIT instead of __LP64__
  [PARISC] use CONFIG_64BIT instead of __LP64__
  [PARISC] add ASM_EXCEPTIONTABLE_ENTRY() macro
  [PARISC] more ENTRY(), ENDPROC(), END() conversions
  [PARISC] fix ENTRY() and ENDPROC() for 64bit-parisc
  [PARISC] Fixes /proc/cpuinfo cache output on B160L
  [PARISC] implement standard ENTRY(), END() and ENDPROC()
  [PARISC] kill ENTRY_SYS_CPUS
  [PARISC] clean up debugging printks in smp.c
  [PARISC] factor syscall_restart code out of do_signal
  ...

Fix conflict in include/linux/sched.h due to kill_proc_info() being made
publicly available to PARISC again.
2007-02-26 12:48:06 -08:00
Linus Torvalds
5313a20bfc Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/tick-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/tick-2.6:
  [TICK] tick-common: Fix one-shot handling in tick_handle_periodic().
  [TIME] tick-sched: Add missing asm/irq_regs.h include.
2007-02-26 11:42:10 -08:00
Linus Torvalds
a7538a7f87 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  Revert "Driver core: let request_module() send a /sys/modules/kmod/-uevent"
  Driver core: fix error by cleanup up symlinks properly
  make kernel/kmod.c:kmod_mk static
  power management: fix struct layout and docs
  power management: no valid states w/o pm_ops
  Driver core: more fallout from class_device changes for pcmcia
  sysfs: move struct sysfs_dirent to private header
  driver core: refcounting fix
  Driver core: remove class_device_rename
2007-02-26 11:41:30 -08:00
David S. Miller
3494c16676 [TICK] tick-common: Fix one-shot handling in tick_handle_periodic().
When clockevents_program_event() is given an expire time in the
past, it does not update dev->next_event, so this looping code
would loop forever once the first in-the-past expiration time
was used.

Keep advancing "next" locally to fix this bug.

Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-26 11:14:15 -08:00
David S. Miller
9e203bcc10 [TIME] tick-sched: Add missing asm/irq_regs.h include.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-02-26 11:13:49 -08:00
Eric W. Biederman
2a786b452e [PATCH] genirq: Mask irqs when migrating them.
move_native_irqs tries to do the right thing when migrating irqs
by disabling them.  However disabling them is a software logical
thing, not a hardware thing.  This has always been a little flaky
and after Ingo's latest round of changes it is guaranteed to not
mask the apic.

So this patch fixes move_native_irq to directly call the mask and
unmask chip methods to guarantee that we mask the irq when we
are migrating it.  We must do this as it is required by
all code that call into the path.

Since we don't know the masked status when IRQ_DISABLED is
set so we will not be able to restore it.   The patch makes the code
just give up and trying again the next time this routing is called.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-26 10:34:08 -08:00
Greg Kroah-Hartman
dfff0a0671 Revert "Driver core: let request_module() send a /sys/modules/kmod/-uevent"
This reverts commit c353c3fb07.

It turns out that we end up with a loop trying to load the unix
module and calling netfilter to do that.  Will redo the patch
later to not have this loop.

Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-23 14:54:57 -08:00
Adrian Bunk
4541ac94d0 make kernel/kmod.c:kmod_mk static
This patch makes the needlessly global struct kmod_mk static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-23 14:52:09 -08:00
Johannes Berg
9c372d06ce power management: no valid states w/o pm_ops
Change /sys/power/state to not advertise any valid states (except for disk
if SOFTWARE_SUSPEND is enabled) when no pm_ops have been set so userspace
can easily discover what states should be available.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Macheck <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-23 14:52:09 -08:00
Mike Galbraith
63ce18cfe6 driver core: refcounting fix
Fix a reference counting bug exposed by commit
725522b545.  If driver.mod_name exists, we
take a reference in module_add_driver(), and never release it.  Undo that
reference in module_remove_driver().

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-23 14:52:09 -08:00
Jarek Poplawski
60e114d113 [PATCH] lockdep: debug_locks check after check_chain_key
In __lock_acquire check_chain_key can turn off debug_locks, so check is
needed to assure proper return code.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:14 -08:00
Srinivasa Ds
346fd59bab [PATCH] kprobes: list all active probes in the system
This patch lists all active probes in the system by scanning through
kprobe_table[].  It takes care of aggregate handlers and prints the type of
the probe.  Letter "k" for kprobes, "j" for jprobes, "r" for kretprobes.
It also lists address of the instruction,its symbolic name(function name +
offset) and the module name.  One can access this file through
/sys/kernel/debug/kprobes/list.

Output looks like this
=====================
llm40:~/a # cat /sys/kernel/debug/kprobes/list
c0169ae3  r  sys_read+0x0
c0169ae3  k  sys_read+0x0
c01694c8  k  vfs_write+0x0
c0167d20  r  sys_open+0x0
f8e658a6  k  reiserfs_delete_inode+0x0  reiserfs
c0120f4a  k  do_fork+0x0
c0120f4a  j  do_fork+0x0
c0169b4a  r  sys_write+0x0
c0169b4a  k  sys_write+0x0
c0169622  r  vfs_read+0x0
=================================

[akpm@linux-foundation.org: cleanup]
[ananth@in.ibm.com: sparc build fix]
Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-20 17:10:14 -08:00
Thomas Gleixner
bc5393a6c9 [PATCH] NOHZ: Produce debug output instead of a BUG()
The BUG_ON() in tick_nohz_stop_sched_tick() triggers on some boxen.
Remove the BUG_ON and print information about the pending softirq
to allow better debugging of the problem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-19 14:18:43 -08:00
Ingo Molnar
6ba9b346e1 [PATCH] NOHZ: Fix RCU handling
When a CPU is needed for RCU the tick has to continue even when it was
stopped before.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-19 14:18:43 -08:00
Linus Torvalds
cb4aaf46c0 Merge branch 'audit.b37' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b37' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] AUDIT_FD_PAIR
  [PATCH] audit config lockdown
  [PATCH] minor update to rule add/delete messages (ver 2)
2007-02-19 13:29:54 -08:00
Linus Torvalds
874ff01bd9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
  Documentation/kernel-docs.txt update.
  arch/cris: typo in KERN_INFO
  Storage class should be before const qualifier
  kernel/printk.c: comment fix
  update I/O sched Kconfig help texts - CFQ is now default, not AS.
  Remove duplicate listing of Cris arch from README
  kbuild: more doc. cleanups
  doc: make doc. for maxcpus= more visible
  drivers/net/eexpress.c: remove duplicate comment
  add a help text for BLK_DEV_GENERIC
  correct a dead URL in the IP_MULTICAST help text
  fix the BAYCOM_SER_HDX help text
  fix SCSI_SCAN_ASYNC help text
  trivial documentation patch for platform.txt
  Fix typos concerning hierarchy
  Fix comment typo "spin_lock_irqrestore".
  Fix misspellings of "agressive".
  drivers/scsi/a100u2w.c: trivial typo patch
  Correct trivial typo in log2.h.
  Remove useless FIND_FIRST_BIT() macro from cardbus.c.
  ...
2007-02-19 13:29:02 -08:00
Al Viro
db3495099d [PATCH] AUDIT_FD_PAIR
Provide an audit record of the descriptor pair returned by pipe() and
socketpair().  Rewritten from the original posted to linux-audit by
John D. Ramsdell <ramsdell@mitre.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-02-17 21:30:15 -05:00
Steve Grubb
6a01b07fae [PATCH] audit config lockdown
The following patch adds a new mode to the audit system. It uses the
audit_enabled config option to introduce the idea of audit enabled, but
configuration is immutable. Any attempt to change the configuration
while in this mode is audited. To change the audit rules, you'd need to
reboot the machine.

To use this option, you'd need a modified version of auditctl and use "-e 2".
This is intended to go at the end of the audit.rules file for people that
want an immutable configuration.

This patch also adds "res=" to a number of configuration commands that did not
have it before.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-02-17 21:30:12 -05:00
Steve Grubb
a17b4ad778 [PATCH] minor update to rule add/delete messages (ver 2)
I was looking at parsing some of these messages and found that I wanted what
it was doing next to an op= for the parser to key on. Also missing was the list
number and results.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2007-02-17 21:30:09 -05:00
Patrick Pletscher
0bbfb7c2e4 kernel/printk.c: comment fix
Signed-off-by: Patrick Pletscher <pat@pletscher.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2007-02-17 20:10:16 +01:00
Matthew Wilcox
c3de4b3815 Revert "[PATCH] make kernel/signal.c:kill_proc_info() static"
This reverts commit d3228a887c.
DeBunk this code.  We need it for compat_sys_rt_sigqueueinfo.

Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
2007-02-17 01:20:07 -05:00
Randy Dunlap
ef665c1a06 sysfs: fix build errors: uevent with CONFIG_SYSFS=n
Fix source files to build with CONFIG_SYSFS=n.
module_subsys is not available.

SYSFS=n, MODULES=y:	T:y
SYSFS=n, MODULES=n:	T:y

SYSFS=y, MODULES=y:	T:y
SYSFS=y, MODULES=n:	T:y

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-16 15:19:18 -08:00
Mariusz Kozlowski
b92be9f1ec Driver: remove redundant kobject_unregister checks
Here is a patch that removes all redundant kobject_unregister argument checks.

Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-16 15:19:17 -08:00
Kay Sievers
c353c3fb07 Driver core: let request_module() send a /sys/modules/kmod/-uevent
On recent systems, calls to /sbin/modprobe are handled by udev depending
on the kind of device the kernel has discovered. This patch creates an
uevent for the kernels internal request_module(), to let udev take control
over the request, instead of forking the binary directly by the kernel.
The direct execution of /sbin/modprobe can be disabled by setting:
  /sys/module/kmod/mod_request_helper (/proc/sys/kernel/modprobe)
to an empty string, the same way /proc/sys/kernel/hotplug is disabled on an
udev system.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-16 15:19:15 -08:00
Jan Beulich
5575ddf75c [PATCH] small irq management simplification
Use mask_ack_irq() where possible.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:00 -08:00
Randy Dunlap
472900b8b0 [PATCH] IRQ kernel-doc fixes
Fix kernel-doc warnings in IRQ management.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:00 -08:00
Ingo Molnar
76d2160147 [PATCH] genirq: do not mask interrupts by default
Never mask interrupts immediately upon request.  Disabling interrupts in
high-performance codepaths is rare, and on the other hand this change could
recover lost edges (or even other types of lost interrupts) by conservatively
only masking interrupts after they happen.  (NOTE: with this change the
highlevel irq-disable code still soft-disables this IRQ line - and if such an
interrupt happens then the IRQ flow handler keeps the IRQ masked.)

Mark i8529A controllers as 'never loses an edge'.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:00 -08:00
Paul E. McKenney
1f2ea0837d [PATCH] posix timers: RCU optimization for clock_gettime()
Use RCU to avoid the need to acquire tasklist_lock in the single-threaded
case of clock_gettime().  It still acquires tasklist_lock when for a
(potentially multithreaded) process.  This change allows realtime
applications to frequently monitor CPU consumption of individual tasks, as
requested (and now deployed) by some off-list users.

This has been in Ingo Molnar's -rt patchset since late 2005 with no
problems reported, and tests successfully on 2.6.20-rc6, so I believe that
it is long-since ready for mainline adoption.

[paulmck@linux.vnet.ibm.com: fix exit()/posix_cpu_clock_get() race spotted by Oleg]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:00 -08:00
john stultz
c37e7bb5d2 [PATCH] time: x86_64: split x86_64/kernel/time.c up
In preparation for the x86_64 generic time conversion, this patch splits out
TSC and HPET related code from arch/x86_64/kernel/time.c into respective
hpet.c and tsc.c files.

[akpm@osdl.org: fix printk timestamps]
[akpm@osdl.org: cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:00 -08:00
john stultz
acc9a9dcdd [PATCH] generic: vsyscall-gtod support for GENERIC_TIME
Provides generic infrastructure for vsyscall-gtod.

[akpm@osdl.org: cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Roman Zippel <zippel@linux-m68k.org>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:14:00 -08:00
Ingo Molnar
289f480af8 [PATCH] Add debugging feature /proc/timer_list
add /proc/timer_list, which prints all currently pending (high-res) timers,
all clock-event sources and their parameters in a human-readable form.

Sample output:

Timer List Version: v0.1
HRTIMER_MAX_CLOCK_BASES: 2
now at 4246046273872 nsecs

cpu: 0
 clock 0:
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get_real
  .offset:     1273998312645738432 nsecs
active timers:
 clock 1:
  .index:      1
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <f5a90ec8>, hrtimer_sched_tick, hrtimer_stop_sched_tick, swapper/0
 # expires at 4246432689566 nsecs [in 386415694 nsecs]
 #1: <f5a90ec8>, hrtimer_wakeup, do_nanosleep, pcscd/2050
 # expires at 4247018194689 nsecs [in 971920817 nsecs]
 #2: <f5a90ec8>, hrtimer_wakeup, do_nanosleep, irqbalance/1909
 # expires at 4247351358392 nsecs [in 1305084520 nsecs]
 #3: <f5a90ec8>, hrtimer_wakeup, do_nanosleep, crond/2157
 # expires at 4249097614968 nsecs [in 3051341096 nsecs]
 #4: <f5a90ec8>, it_real_fn, do_setitimer, syslogd/1888
 # expires at 4251329900926 nsecs [in 5283627054 nsecs]
  .expires_next   : 4246432689566 nsecs
  .hres_active    : 1
  .check_clocks   : 0
  .nr_events      : 31306
  .idle_tick      : 4246020791890 nsecs
  .tick_stopped   : 1
  .idle_jiffies   : 986504
  .idle_calls     : 40700
  .idle_sleeps    : 36014
  .idle_entrytime : 4246019418883 nsecs
  .idle_sleeptime : 4178181972709 nsecs

cpu: 1
 clock 0:
  .index:      0
  .resolution: 1 nsecs
  .get_time:   ktime_get_real
  .offset:     1273998312645738432 nsecs
active timers:
 clock 1:
  .index:      1
  .resolution: 1 nsecs
  .get_time:   ktime_get
  .offset:     0 nsecs
active timers:
 #0: <f5a90ec8>, hrtimer_sched_tick, hrtimer_restart_sched_tick, swapper/0
 # expires at 4246050084568 nsecs [in 3810696 nsecs]
 #1: <f5a90ec8>, hrtimer_wakeup, do_nanosleep, atd/2227
 # expires at 4261010635003 nsecs [in 14964361131 nsecs]
 #2: <f5a90ec8>, hrtimer_wakeup, do_nanosleep, smartd/2332
 # expires at 5469485798970 nsecs [in 1223439525098 nsecs]
  .expires_next   : 4246050084568 nsecs
  .hres_active    : 1
  .check_clocks   : 0
  .nr_events      : 24043
  .idle_tick      : 4246046084568 nsecs
  .tick_stopped   : 0
  .idle_jiffies   : 986510
  .idle_calls     : 26360
  .idle_sleeps    : 22551
  .idle_entrytime : 4246043874339 nsecs
  .idle_sleeptime : 4170763761184 nsecs

tick_broadcast_mask: 00000003
event_broadcast_mask: 00000001

CPU#0's local event device:

Clock Event Device: lapic
 capabilities:   0000000e
 max_delta_ns:   807385544
 min_delta_ns:   1443
 mult:           44624025
 shift:          32
 set_next_event: lapic_next_event
 set_mode:       lapic_timer_setup
 event_handler:  hrtimer_interrupt
  .installed:  1
  .expires:    4246432689566 nsecs

CPU#1's local event device:

Clock Event Device: lapic
 capabilities:   0000000e
 max_delta_ns:   807385544
 min_delta_ns:   1443
 mult:           44624025
 shift:          32
 set_next_event: lapic_next_event
 set_mode:       lapic_timer_setup
 event_handler:  hrtimer_interrupt
  .installed:  1
  .expires:    4246050084568 nsecs

Clock Event Device: hpet
 capabilities:   00000007
 max_delta_ns:   2147483647
 min_delta_ns:   3352
 mult:           61496110
 shift:          32
 set_next_event: hpet_next_event
 set_mode:       hpet_set_mode
 event_handler:  handle_nextevt_broadcast

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Ingo Molnar
82f67cd9fc [PATCH] Add debugging feature /proc/timer_stat
Add /proc/timer_stats support: debugging feature to profile timer expiration.
Both the starting site, process/PID and the expiration function is captured.
This allows the quick identification of timer event sources in a system.

Sample output:

# echo 1 > /proc/timer_stats
# cat /proc/timer_stats
Timer Stats Version: v0.1
Sample period: 4.010 s
  24,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
  11,     0 swapper          sk_reset_timer (tcp_delack_timer)
   6,     0 swapper          hrtimer_stop_sched_tick (hrtimer_sched_tick)
   2,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
  17,     0 swapper          hrtimer_restart_sched_tick (hrtimer_sched_tick)
   2,     1 swapper          queue_delayed_work_on (delayed_work_timer_fn)
   4,  2050 pcscd            do_nanosleep (hrtimer_wakeup)
   5,  4179 sshd             sk_reset_timer (tcp_write_timer)
   4,  2248 yum-updatesd     schedule_timeout (process_timeout)
  18,     0 swapper          hrtimer_restart_sched_tick (hrtimer_sched_tick)
   3,     0 swapper          sk_reset_timer (tcp_delack_timer)
   1,     1 swapper          neigh_table_init_no_netlink (neigh_periodic_timer)
   2,     1 swapper          e1000_up (e1000_watchdog)
   1,     1 init             schedule_timeout (process_timeout)
100 total events, 25.24 events/sec

[ cleanups and hrtimers support from Thomas Gleixner <tglx@linutronix.de> ]
[bunk@stusta.de: nr_entries can become static]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
8bfd9a7a22 [PATCH] hrtimers: prevent possible itimer DoS
Fix potential setitimer DoS with high-res timers by pushing itimer rearm
processing to process context.

[Fixes from: Ingo Molnar <mingo@elte.hu>]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
54cdfdb47f [PATCH] hrtimers: add high resolution timer support
Implement high resolution timers on top of the hrtimers infrastructure and the
clockevents / tick-management framework.  This provides accurate timers for
all hrtimer subsystem users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
79bf2bb335 [PATCH] tick-management: dyntick / highres functionality
With Ingo Molnar <mingo@elte.hu>

Add functions to provide dynamic ticks and high resolution timers.  The code
which keeps track of jiffies and handles the long idle periods is shared
between tick based and high resolution timer based dynticks.  The dyntick
functionality can be disabled on the kernel commandline.  Provide also the
infrastructure to support high resolution timers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
f8381cba04 [PATCH] tick-management: broadcast functionality
With Ingo Molnar <mingo@elte.hu>

Add broadcast functionality, so per cpu clock event devices can be registered
as dummy devices or switched from/to broadcast on demand.  The broadcast
function distributes the events via the broadcast function of the clock event
device.  This is primarily designed to replace the switch apic timer to / from
IPI in power states, where the apic stops.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
906568c9c6 [PATCH] tick-management: core functionality
With Ingo Molnar <mingo@elte.hu>

The tick-management code is the first user of the clockevents layer.  It takes
clock event devices from the clock events core and uses them to provide the
periodic tick.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
d316c57ff6 [PATCH] clockevents: add core functionality
Architectures register their clock event devices, in the clock events core.
Users of the clockevents core can get clock event devices for their use.  The
clockevents core code provides notification mechanisms for various clock
related management events.

This allows to control the clock event devices without the architectures
having to worry about the details of function assignment.  This is also a
preliminary for high resolution timers and dynamic ticks to allow the core
code to control the clock functionality without intrusive changes to the
architecture code.

[Fixes-by: Ingo Molnar <mingo@elte.hu>]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:59 -08:00
Thomas Gleixner
5cfb6de7cd [PATCH] hrtimers: clean up callback tracking
Reintroduce ktimers feature "optimized away" by the ktimers review process:
remove the curr_timer pointer from the cpu-base and use the hrtimer state.

No functional changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Thomas Gleixner
303e967ff9 [PATCH] hrtimers; add state tracking
Reintroduce ktimers feature "optimized away" by the ktimers review process:
multiple hrtimer states to enable the running of hrtimers without holding the
cpu-base-lock.

(The "optimized" rbtree hack carried only 2 states worth of information and we
need 4 for high resolution timers and dynamic ticks.)

No functional changes.

Build-fixes-from: Andrew Morton <akpm@osdl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Thomas Gleixner
3c8aa39d7c [PATCH] hrtimers: cleanup locking
Improve kernel/hrtimers.c locking: use a per-CPU base with a lock to control
locking of all clocks belonging to a CPU.  This simplifies code that needs to
lock all clocks at once.  This makes life easier for high-res timers and
dyntick.

No functional changes.

[ optimization change from Andrew Morton <akpm@osdl.org> ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Thomas Gleixner
c9cb2e3d7c [PATCH] hrtimers: namespace and enum cleanup
- hrtimers did not use the hrtimer_restart enum and relied on the implict
  int representation. Fix the prototypes and the functions using the enums.
- Use seperate name spaces for the enumerations
- Convert hrtimer_restart macro to inline function
- Add comments

No functional changes.

[akpm@osdl.org: fix input driver]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Thomas Gleixner
fd064b9b77 [PATCH] Extend next_timer_interrupt() to use a reference jiffie
For CONFIG_NO_HZ we need to calculate the next timer wheel event based on a
given jiffie value.  Extend the existing code to allow the extra 'now'
argument.  Provide a compability function for the existing implementations to
call the function with now == jiffies.  (This also solves the racyness of the
original code vs.  jiffies changing during the iteration.)

No functional changes to existing users of this infrastructure.

[ remove WARN_ON() that triggered on s390, by Carsten Otte <cotte@de.ibm.com> ]
[ made new helper static, Adrian Bunk <bunk@stusta.de> ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Thomas Gleixner
1cfd68496e [PATCH] Fix cascade lookup of next_timer_interrupt
When searching for the next pending timer in the timer wheel we need to take
the cascade into account.  The current code has several problems:

 1. it looks into the previous cascade
 2. it ignores a pending cascade
 3. it ignores multiple cascades

Change the cascade lookup, so it calculates the array index from the point of
the next cascade and always look at the cascade buckets, when the cascade is
pending, i.e.  gets executed in the next timer softirq.  When multiple
cascades are pending, then lookup the next buckets too.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Ingo Molnar
dde4b2b5f4 [PATCH] uninline irq_enter()
Uninline irq_enter().  [dynticks adds more stuff to it]

No functional changes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:58 -08:00
Thomas Gleixner
5d8b34fdcb [PATCH] clocksource: Add verification (watchdog) helper
The TSC needs to be verified against another clocksource.  Instead of using
hardwired assumptions of available hardware, provide a generic verification
mechanism.  The verification uses the best available clocksource and handles
the usability for high resolution timers / dynticks of the clocksource which
needs to be verified.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:57 -08:00
Thomas Gleixner
7e69f2b1ea [PATCH] clocksource: Remove the update callback
The clocksource code allows direct updates of the rating of a given
clocksource now.  Change TSC unstable tracking to use this interface and
remove the update callback.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:57 -08:00
Thomas Gleixner
73b08d2aa4 [PATCH] clocksource: replace is_continuous by a flag field
Using a flag filed allows to encode more than one information into a variable.
Preparatory patch for the generic clocksource verification.

[mingo@elte.hu: convert vmitime.c to the new clocksource flag]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:57 -08:00
Thomas Gleixner
92c7e00254 [PATCH] Simplify the registration of clocksources
Enqueue clocksources in rating order to make selection of the clocksource
easier.  Also check the match with an user override at enqueue time.

Preparatory patch for the generic clocksource verification.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:57 -08:00
John Stultz
411187fb05 [PATCH] GTOD: persistent clock support
Persistent clock support: do proper timekeeping across suspend/resume.

[bunk@stusta.de: cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:57 -08:00
Ingo Molnar
41cf54455d [PATCH] Fix multiple conversion bugs in msecs_to_jiffies
Fix multiple conversion bugs in msecs_to_jiffies().

The main problem is that this condition:

	if (m > jiffies_to_msecs(MAX_JIFFY_OFFSET))

overflows if HZ is smaller than 1000!

This change is user-visible: for HZ=250 SUS-compliant poll()-timeout
value of -20 is mistakenly converted to 'immediate timeout'.

(The new dyntick code also triggered this, that's how we noticed.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:56 -08:00
Ingo Molnar
8b9365d753 [PATCH] Uninline jiffies.h functions
There are loads of fat functions hidden in jiffies.h.  Uninline them.  No code
changes.

[jeremy@goop.org: export fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:56 -08:00
john stultz
f4304ab215 [PATCH] HZ free ntp
Distangle the NTP update from HZ.  This is necessary for dynamic tick enabled
kernels.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:56 -08:00
Thomas Gleixner
771ee3b04e [PATCH] Add a function to handle interrupt affinity setting
Provide funtions to:
 - check, whether an interrupt can set the affinity
 - pin the interrupt to a given cpu

Necessary for the ability to setup clocksources more flexible (e.g.  use the
different HPET channels per CPU)

[akpm@osdl.org: alpha build fix]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:56 -08:00
Thomas Gleixner
950f4427c2 [PATCH] Add irq flag to disable balancing for an interrupt
Add a flag so we can prevent the irq balancing of an interrupt.  Move the
bits, so we have room for more :)

Necessary for the ability to setup clocksources more flexible (e.g.  use the
different HPET channels per CPU)

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-16 08:13:56 -08:00
Linus Torvalds
414f827c46 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (94 commits)
  [PATCH] x86-64: Remove mk_pte_phys()
  [PATCH] i386: Fix broken CONFIG_COMPAT_VDSO on i386
  [PATCH] i386: fix 32-bit ioctls on x64_32
  [PATCH] x86: Unify pcspeaker platform device code between i386/x86-64
  [PATCH] i386: Remove extern declaration from mm/discontig.c, put in header.
  [PATCH] i386: Rename cpu_gdt_descr and remove extern declaration from smpboot.c
  [PATCH] i386: Move mce_disabled to asm/mce.h
  [PATCH] i386: paravirt unhandled fallthrough
  [PATCH] x86_64: Wire up compat epoll_pwait
  [PATCH] x86: Don't require the vDSO for handling a.out signals
  [PATCH] i386: Fix Cyrix MediaGX detection
  [PATCH] i386: Fix warning in cpu initialization
  [PATCH] i386: Fix warning in microcode.c
  [PATCH] x86: Enable NMI watchdog for AMD Family 0x10 CPUs
  [PATCH] x86: Add new CPUID bits for AMD Family 10 CPUs in /proc/cpuinfo
  [PATCH] i386: Remove fastcall in paravirt.[ch]
  [PATCH] x86-64: Fix wrong gcc check in bitops.h
  [PATCH] x86-64: survive having no irq mapping for a vector
  [PATCH] i386: geode configuration fixes
  [PATCH] i386: add option to show more code in oops reports
  ...
2007-02-14 09:46:06 -08:00
Eric W. Biederman
d912b0cc1a [PATCH] sysctl: add a parent entry to ctl_table and set the parent entry
Add a parent entry into the ctl_table so you can walk the list of parents and
find the entire path to a ctl_table entry.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:10:00 -08:00
Eric W. Biederman
77b14db502 [PATCH] sysctl: reimplement the sysctl proc support
With this change the sysctl inodes can be cached and nothing needs to be done
when removing a sysctl table.

For a cost of 2K code we will save about 4K of static tables (when we remove
de from ctl_table) and 70K in proc_dir_entries that we will not allocate, or
about half that on a 32bit arch.

The speed feels about the same, even though we can now cache the sysctl
dentries :(

We get the core advantage that we don't need to have a 1 to 1 mapping between
ctl table entries and proc files.  Making it possible to have /proc/sys vary
depending on the namespace you are in.  The currently merged namespaces don't
have an issue here but the network namespace under /proc/sys/net needs to have
different directories depending on which network adapters are visible.  By
simply being a cache different directories being visible depending on who you
are is trivial to implement.

[akpm@osdl.org: fix uninitialised var]
[akpm@osdl.org: fix ARM build]
[bunk@stusta.de: make things static]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:10:00 -08:00
Eric W. Biederman
1ff007eb8e [PATCH] sysctl: allow sysctl_perm to be called from outside of sysctl.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
805b5d5e06 [PATCH] sysctl: factor out sysctl_head_next from do_sysctl
The current logic to walk through the list of sysctl table headers is slightly
painful and implement in a way it cannot be used by code outside sysctl.c

I am in the process of implementing a version of the sysctl proc support that
instead of using the proc generic non-caching monster, just uses the existing
sysctl data structure as backing store for building the dcache entries and for
doing directory reads.  To use the existing data structures however I need a
way to get at them.

[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
0b4d414714 [PATCH] sysctl: remove insert_at_head from register_sysctl
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name.  Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
ae83681026 [PATCH] sysctl: remove support for directory strategy routines
parse_table has support for calling a strategy routine when descending into a
directory.  To date no one has used this functionality and the /proc/sys
interface has no analog to it.

So no one is using this functionality kill it and make the binary sysctl code
easier to follow.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
6703ddfcce [PATCH] sysctl: remove support for CTL_ANY
There are currently no users in the kernel for CTL_ANY and it only has effect
on the binary interface which is practically unused.

So this complicates sysctl lookups for no good reason so just remove it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
2abc26fc6b [PATCH] sysctl: create sys/fs/binfmt_misc as an ordinary sysctl entry
binfmt_misc has a mount point in the middle of the sysctl and that mount point
is created as a proc_generic directory.

Doing it that way gets in the way of cleaning up the sysctl proc support as it
continues the existence of a horrible hack.  So instead simply create the
directory as an ordinary sysctl directory.  At least that removes the magic
special case.

[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
a5494dcd8b [PATCH] sysctl: move SYSV IPC sysctls to their own file
This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded
with special cases, and by keeping all of the ipc logic to together it makes
the code a little more readable.

[gcoady.lk@gmail.com: build fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Grant Coady <gcoady.lk@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:59 -08:00
Eric W. Biederman
39732acd96 [PATCH] sysctl: move utsname sysctls to their own file
This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded
with special cases, and by keeping all of the utsname logic to together it
makes the code a little more readable.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:58 -08:00
Eric W. Biederman
b04c3afb2b [PATCH] sysctl: move init_irq_proc into init/main where it belongs
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:58 -08:00
Thomas Gleixner
38515e908b [PATCH] Scheduled removal of SA_xxx interrupt flags fixups
The obsolete SA_xxx interrupt flags have been used despite the scheduled
removal.  Fixup the remaining users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Wim Van Sebroeck <wim@iguana.be>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Cc: Dave Airlie <airlied@linux.ie>
Cc: James Simmons <jsimmons@infradead.org>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:54 -08:00
Tim Schmielau
cd354f1ae7 [PATCH] remove many unneeded #includes of sched.h
After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there.  Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm.  I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-14 08:09:54 -08:00
Andi Kleen
a98f0dd34d [PATCH] x86-64: Allow to run a program when a machine check event is detected
When a machine check event is detected (including a AMD RevF threshold
overflow event) allow to run a "trigger" program. This allows user space
to react to such events sooner.

The trigger is configured using a new trigger entry in the
machinecheck sysfs interface. It is currently shared between
all CPUs.

I also fixed the AMD threshold handler to run the machine
check polling code immediately to actually log any events
that might have caused the threshold interrupt.

Also added some documentation for the mce sysfs interface.

Signed-off-by: Andi Kleen <ak@suse.de>
2007-02-13 13:26:23 +01:00
Zachary Amsden
9226d125d9 [PATCH] i386: paravirt CPU hypercall batching mode
The VMI ROM has a mode where hypercalls can be queued and batched.  This turns
out to be a significant win during context switch, but must be done at a
specific point before side effects to CPU state are visible to subsequent
instructions.  This is similar to the MMU batching hooks already provided.
The same hooks could be used by the Xen backend to implement a context switch
multicall.

To explain a bit more about lazy modes in the paravirt patches, basically, the
idea is that only one of lazy CPU or MMU mode can be active at any given time.
 Lazy MMU mode is similar to this lazy CPU mode, and allows for batching of
multiple PTE updates (say, inside a remap loop), but to avoid keeping some
kind of state machine about when to flush cpu or mmu updates, we just allow
one or the other to be active.  Although there is no real reason a more
comprehensive scheme could not be implemented, there is also no demonstrated
need for this extra complexity.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2007-02-13 13:26:21 +01:00
Eric Dumazet
5809f9d442 [PATCH] x86-64: get rid of ARCH_HAVE_XTIME_LOCK
ARCH_HAVE_XTIME_LOCK is used by x86_64 arch .  This arch needs to place a
read only copy of xtime_lock into vsyscall page.  This read only copy is
named __xtime_lock, and xtime_lock is defined in
arch/x86_64/kernel/vmlinux.lds.S as an alias.  So the declaration of
xtime_lock in kernel/timer.c was guarded by ARCH_HAVE_XTIME_LOCK define,
defined to true on x86_64.

We can get same result with _attribute__((weak)) in the declaration. linker
should do the job.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2007-02-13 13:26:21 +01:00
Arjan van de Ven
92e1d5be91 [PATCH] mark struct inode_operations const 2
Many struct inode_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
Arjan van de Ven
9a32144e9d [PATCH] mark struct file_operations const 7
Many struct file_operations in the kernel can be "const".  Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data.  In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:46 -08:00
Nick Piggin
ff91691bcc [PATCH] sched: avoid div in rebalance_tick
Avoid expensive integer divide 3 times per CPU per tick.

A userspace test of this loop went from 26ns, down to 19ns on a G5; and
from 123ns down to 28ns on a P3.

(Also avoid a variable bit shift, as suggested by Alan. The effect
of this wasn't noticable on the CPUs I tested with).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:37 -08:00
Eric W. Biederman
27b0b2f44a [PATCH] pid: remove the now unused kill_pg kill_pg_info and __kill_pg_info
Now that I have changed all of the in-tree users remove the old version of
these functions.  This should make it clear to any out of tree users that they
should be using kill_pgrp kill_pgrp_info or __kill_pgrp_info instead.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:32 -08:00
Eric W. Biederman
41487c65bf [PATCH] pid: replace do/while_each_task_pid with do/while_each_pid_task
There isn't any real advantage to this change except that it allows the old
functions to be removed.  Which is easier on maintenance and puts the code in
a more uniform style.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:32 -08:00
Eric W. Biederman
ab521dc0f8 [PATCH] tty: update the tty layer to work with struct pid
Of kernel subsystems that work with pids the tty layer is probably the largest
consumer.  But it has the nice virtue that the assiation with a session only
lasts until the session leader exits.  Which means that no reference counting
is required.  So using struct pid winds up being a simple optimization to
avoid hash table lookups.

In the long term the use of pid_nr also ensures that when we have multiple pid
spaces mixed everything will work correctly.

Signed-off-by: Eric W. Biederman <eric@maxwell.lnxi.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:32 -08:00
Eric W. Biederman
3e7cd6c413 [PATCH] pid: replace is_orphaned_pgrp with is_current_pgrp_orphaned
Every call to is_orphaned_pgrp passed in process_group(current) which is racy
with respect to another thread changing our process group.  It didn't bite us
because we were dealing with integers and the worse we would get would be a
stale answer.

In switching the checks to use struct pid to be a little more efficient and
prepare the way for pid namespaces this race became apparent.

So I simplified the calls to the more specialized is_current_pgrp_orphaned so
I didn't have to worry about making logic changes to avoid the race.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:32 -08:00
Eric W. Biederman
0475ac0845 [PATCH] pid: use struct pid for talking about process groups in exitc
Modify has_stopped_jobs and will_become_orphan_pgrp to use struct pid based
process groups.  This reduces the number of hash tables looks ups and paves
the way for multiple pid spaces.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:32 -08:00
Eric W. Biederman
04a2e6a5cb [PATCH] pid: make session_of_pgrp use struct pid instead of pid_t
To properly implement a pid namespace I need to deal exclusively in terms of
struct pid, because pid_t values become ambiguous.

To this end session_of_pgrp is transformed to take and return a struct pid
pointer.  To avoid the need to worry about reference counting I now require my
caller to hold the appropriate locks.  Leaving callers repsonsible for
increasing the reference count if they need access to the result outside of
the locks.

Since session_of_pgrp currently only has one caller and that caller simply
uses only test the result for equality with another process group, the locking
change means I don't actually have to acquire the tasklist_lock at all.

tiocspgrp is also modified to take and release the lock.  The logic there is a
little more complicated but nothing I won't need when I convert pgrp of a tty
to a struct pid pointer.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:31 -08:00
Eric W. Biederman
8d42db189c [PATCH] signal: rewrite kill_something_info so it uses newer helpers
The goal is to remove users of the old signal helper functions so they can be
removed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:31 -08:00
Ingo Molnar
944be0b224 [PATCH] close_files(): add scheduling point
close_files() can sometimes take long enough to trigger the soft lockup
detector.

Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:30 -08:00
Alan Cox
3f05044715 [PATCH] kernel: shut up the IRQ mismatch messages
The problem is various drivers legally validly and sensibly try to claim
IRQs but the kernel insists on vomiting forth a giant irrelevant debugging
spew when the types clash.

Edit kernel/irq/manage.c go down to mismatch: in setup_irq() and ifdef out
the if clause that checks for mismatches.  It'll then just do the right
thing and work sanely.

For the current -mm kernel this will do the trick (and moves it into shared
irq debugging as in debug mode the info spew is useful).  I've had a
variant of this in my private tree for some time as I got fed up on the
mess on boxes where old legacy IRQs get reused.

Signed-off-by: Alan Cox <alan@redhat.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:28 -08:00
David Woodhouse
a304e1b828 [PATCH] Debug shared irqs
Drivers registering IRQ handlers with SA_SHIRQ really ought to be able to
handle an interrupt happening before request_irq() returns.  They also
ought to be able to handle an interrupt happening during the start of their
call to free_irq().  Let's test that hypothesis....

[bunk@stusta.de: Kconfig fixes]
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:28 -08:00
Al Viro
5ea8176994 [PATCH] sort the devres mess out
* Split the implementation-agnostic stuff in separate files.
* Make sure that targets using non-default request_irq() pull
  kernel/irq/devres.o
* Introduce new symbols (HAS_IOPORT and HAS_IOMEM) defaulting to positive;
  allow architectures to turn them off (we needed these symbols anyway for
  dependencies of quite a few drivers).
* protect the ioport-related parts of lib/devres.o with CONFIG_HAS_IOPORT.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:07 -08:00
Alexey Dobriyan
4b98d11b40 [PATCH] ifdef ->rchar, ->wchar, ->syscr, ->syscw from task_struct
They are fat: 4x8 bytes in task_struct.
They are uncoditionally updated in every fork, read, write and sendfile.
They are used only if you have some "extended acct fields feature".

And please, please, please, read(2) knows about bytes, not characters,
why it is called "rchar"?

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:07 -08:00
Oleg Nesterov
8d06087714 [PATCH] _proc_do_string(): fix short reads
If you try to read things like /proc/sys/kernel/osrelease with single-byte
reads, you get just one byte and then EOF.  This is because _proc_do_string()
assumes that the caller is read()ing into a buffer which is large enough to
fit the whole string in a single hit.

Fix.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:07 -08:00
Robert P. J. Day
501b9ebf43 [PATCH] Fix apparent typo CONFIG_LOCKDEP_DEBUG
Replace the apparent typo CONFIG_LOCKDEP_DEBUG with the correct
CONFIG_DEBUG_LOCKDEP.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:06 -08:00
Mathieu Desnoyers
1efc5da3cf [PATCH] order of lockdep off/on in vprintk() should be changed
The order of locking between lockdep_off/on() and local_irq_save/restore() in
vprintk() should be changed.

* In kernel/printk.c :

vprintk() does :

preempt_disable()
local_irq_save()
lockdep_off()
spin_lock(&logbuf_lock)
spin_unlock(&logbuf_lock)
if(!down_trylock(&console_sem))
   up(&console_sem)
lockdep_on()
local_irq_restore()
preempt_enable()

The goals here is to make sure we do not call printk() recursively from
kernel/lockdep.c:__lock_acquire() (called from spin_* and down/up) nor from
kernel/lockdep.c:trace_hardirqs_on/off() (called from local_irq_restore/save).
It can then potentially call printk() through mark_held_locks/mark_lock.

It correctly protects against the spin_lock call and the up/down call, but it
does not protect against local_irq_restore. It could cause infinite recursive
printk/trace_hardirqs_on() calls when printk() is called from the
mark_lock() error handing path.

We should change the locking so it becomes correct :

preempt_disable()
lockdep_off()
local_irq_save()
spin_lock(&logbuf_lock)
spin_unlock(&logbuf_lock)
if(!down_trylock(&console_sem))
   up(&console_sem)
local_irq_restore()
lockdep_on()
preempt_enable()

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 11:18:06 -08:00
Kirill Korotaev
e3e8a75d2a [PATCH] Extract and use wake_up_klogd()
Remove hack with printing space to wake up klogd.  Use explicit
wake_up_klogd().

See earlier discussion
http://groups.google.com/group/fa.linux.kernel/browse_frm/thread/75f496668409f58d/1a8f28983a51e1ff?lnk=st&q=wake_up_klogd+group%3Afa.linux.kernel&rnum=2#1a8f28983a51e1ff

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Ingo Molnar
11f57cedcf [PATCH] audit: fix audit_filter_user_rules() initialization bug
gcc emits this warning:

 kernel/auditfilter.c: In function 'audit_filter_user':
 kernel/auditfilter.c:1611: warning: 'state' is used uninitialized in this function

I tend to agree with gcc - there are a couple of plausible exit paths from
audit_filter_user_rules() where it does not set 'state', keeping the
variable uninitialized.  For example if a filter rule has an AUDIT_POSSIBLE
action.  Initialize to 'wont audit'.  Fix whitespace damage too.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Kyle McMartin
d4d23add3a [PATCH] Common compat_sys_sysinfo
I noticed that almost all architectures implemented exactly the same
sys32_sysinfo...  except parisc, where a bug was to be found in handling of
the uptime.  So let's remove a whole whack of code for fun and profit.
Cribbed compat_sys_sysinfo from x86_64's implementation, since I figured it
would be the best tested.

This patch incorporates Arnd's suggestion of not using set_fs/get_fs, but
instead extracting out the common code from sys_sysinfo.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:32 -08:00
Robert P. J. Day
72fd4a35a8 [PATCH] Numerous fixes to kernel-doc info in source files.
A variety of (mostly) innocuous fixes to the embedded kernel-doc content in
source files, including:

  * make multi-line initial descriptions single line
  * denote some function names, constants and structs as such
  * change erroneous opening '/*' to '/**' in a few places
  * reword some text for clarity

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:32 -08:00
Alexey Dobriyan
b653d081c1 [PATCH] proc: remove useless (and buggy) ->nlink settings
Bug: pnx8550 code creates directory but resets ->nlink to 1.

create_proc_entry() et al will correctly set ->nlink for you.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:32 -08:00
Andrew Morton
cb799b8988 [PATCH] sysctl warning fix
kernel/sysctl.c:2816: warning: 'sysctl_ipc_data' defined but not used

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:31 -08:00
Helge Deller
3db5db4fcd [PATCH] use cycle_t instead of u64 in struct time_interpolator
The 32bit and 64bit PARISC Linux kernels suffers from the problem, that the
gettimeofday() call sometimes returns non-monotonic times.

The easiest way to fix this, is to drop the PARISC-specific implementation
and switch over to the generic TIME_INTERPOLATION framework.

But in order to make it even compile on 32bit PARISC, the patch below which
touches the generic Linux code, is mandatory.

More information and the full patch with the parisc-specific changes is included in this thread: http://lists.parisc-linux.org/pipermail/parisc-linux/2006-December/031003.html

As far as I could see, this patch does not change anything for the existing
architectures which use this framework (IA64 and SPARC64), since "cycles_t"
is defined there as unsigned 64bit-integer anyway (which then makes this
patch a no-change for them).

Signed-off-by: Helge Deller <deller@gmx.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:31 -08:00
Theodore Ts'o
34f5a39899 [PATCH] Add TAINT_USER and ability to set taint flags from userspace
Allow taint flags to be set from userspace by writing to
/proc/sys/kernel/tainted, and add a new taint flag, TAINT_USER, to be used
when userspace has potentially done something dangerous that might
compromise the kernel.  This will allow support personnel to ask further
questions about what may have caused the user taint flag to have been set.

For example, they might examine the logs of the realtime JVM to see if the
Java program has used the really silly, stupid, dangerous, and
completely-non-portable direct access to physical memory feature which MUST
be implemented according to the Real-Time Specification for Java (RTSJ).
Sigh.  What were those silly people at Sun thinking?

[akpm@osdl.org: build fix]
[bunk@stusta.de: cleanup]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:29 -08:00
Alexey Dobriyan
b035b6de24 [PATCH] Consolidate default sched_clock()
Use attribute(weak).

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:28 -08:00
Mathieu Desnoyers
23c887522e [PATCH] Relay: add CPU hotplug support
Mathieu originally needed to add this for tracing Xen, but it's something
that's needed for any application that can be tracing while cpus are added.

unplug isn't supported by this patch.  The thought was that at minumum a new
buffer needs to be added when a cpu comes up, but it wasn't worth the effort
to remove buffers on cpu down since they'd be freed soon anyway when the
channel was closed.

[zanussi@us.ibm.com: avoid lock_cpu_hotplug deadlock]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:28 -08:00
Robert P. J. Day
c376222960 [PATCH] Transform kmem_cache_alloc()+memset(0) -> kmem_cache_zalloc().
Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:27 -08:00
Jason Baron
068135e635 [PATCH] lockdep: add graph depth information to /proc/lockdep
Generate locking graph information into /proc/lockdep, for lock hierarchy
documentation and visualization purposes.

sample output:

 c089fd5c OPS:     138 FD:   14 BD:    1 --..: &tty->termios_mutex
  -> [c07a3430] tty_ldisc_lock
  -> [c07a37f0] &port_lock_key
  -> [c07afdc0] &rq->rq_lock_key#2

The lock classes listed are all the first-hop lock dependencies that
lockdep has seen so far.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:26 -08:00
Jarek Poplawski
381a229209 [PATCH] lockdep: more unlock-on-error fixes
- returns after DEBUG_LOCKS_WARN_ON added in 3 places

- debug_locks checking after lookup_chain_cache() added in
  __lock_acquire()

- locking for testing and changing global variable max_lockdep_depth
  added in __lock_acquire()

From: Ingo Molnar <mingo@elte.hu>

My __acquire_lock() cleanup introduced a locking bug: on SMP systems we'd
release a non-owned graph lock.  Fix this by moving the graph unlock back,
and by leaving the max_lockdep_depth variable update possibly racy.  (we
dont care, it's just statistics)

Also add some minimal debugging code to graph_unlock()/graph_lock(),
which caught this locking bug.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:26 -08:00
Oleg Nesterov
0c12b51712 [PATCH] kill_pid_info: kill acquired_tasklist_lock
Kill acquired_tasklist_lock, sig_needs_tasklist() is very cheap nowadays.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:26 -08:00
Alexey Dobriyan
3ee75ac3c0 [PATCH] sysctl_{,ms_}jiffies: fix oldlen semantics
currently it's
1) if *oldlenp == 0,
	don't writeback anything

2) if *oldlenp >= table->maxlen,
	don't writeback more than table->maxlen bytes and rewrite *oldlenp
	don't look at underlying type granularity

3) if 0 < *oldlenp < table->maxlen,
		*cough*
	string sysctls don't writeback more than *oldlenp bytes.
	OK, that's because sizeof(char) == 1

	int sysctls writeback anything in (0, table->maxlen] range
	Though accept integers divisible by sizeof(int) for writing.

sysctl_jiffies and sysctl_ms_jiffies don't writeback anything but
sizeof(int), which violates 1) and 2).

So, make sysctl_jiffies and sysctl_ms_jiffies accept
a) *oldlenp == 0, not doing writeback
b) *oldlenp >= sizeof(int), writing one integer.

-EINVAL still returned for *oldlenp == 1, 2, 3.

Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:24 -08:00
Mathieu Desnoyers
dc29a3657b [PATCH] kernel/time/clocksource.c needs struct task_struct on m68k
kernel/time/clocksource.c needs struct task_struct on m68k.

Because it uses spin_unlock_irq(), which, on m68k, uses hardirq_count(), which
uses preempt_count(), which needs to dereference struct task_struct, we
have to include sched.h. Because it would cause a loop inclusion, we
cannot include sched.h in any other of asm-m68k/system.h,
linux/thread_info.h, linux/hardirq.h, which leaves this ugly include in
a C file as the only simple solution.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:20 -08:00
Rafael J. Wysocki
2b5b09b3b5 [PATCH] swsusp: Change pm_ops handling by userland interface
Make the userland interface of swsusp call pm_ops->finish() after
enable_nonboot_cpus() and before resume_device(), as indicated by the recent
discussion on Linux-PM (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

This patch changes the SNAPSHOT_PMOPS ioctl so that its first function,
PMOPS_PREPARE, only sets a switch turning the platform suspend mode on, and
its last function, PMOPS_FINISH, only checks if the platform mode is enabled.
This should allow the older userland tools to work with new kernels without
any modifications.

The changes here only affect the userland interface of swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:20 -08:00
Andrew Morton
d12c610e08 [PATCH] swsusp-change-code-ordering-in-userc-sanity
The compiler will do that.  And if it doesn't, we don't want to either ;)

Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Rafael J. Wysocki
259130526c [PATCH] swsusp: Change code ordering in user.c
Change the ordering of code in kernel/power/user.c so that device_suspend() is
called before disable_nonboot_cpus() and device_resume() is called after
enable_nonboot_cpus().  This is needed to make the userland suspend call
pm_ops->finish() after enable_nonboot_cpus() and before device_resume(), as
indicated by the recent discussion on Linux-PM (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

The changes here only affect the userland interface of swsusp.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Rafael J. Wysocki
ed746e3b18 [PATCH] swsusp: Change code ordering in disk.c
Change the ordering of code in kernel/power/disk.c so that device_suspend() is
called before disable_nonboot_cpus() and platform_finish() is called after
enable_nonboot_cpus() and before device_resume(), as indicated by the recent
discussion on Linux-PM (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

The changes here only affect the built-in swsusp.

[alexey.y.starikovskiy@linux.intel.com: fix LED blinking during image load]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Cc: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Rafael J. Wysocki
e3c7db621b [PATCH] PM: Change code ordering in main.c
As indicated in a recent thread on Linux-PM, it's necessary to call
pm_ops->finish() before devce_resume(), but enable_nonboot_cpus() has to be
called before pm_ops->finish() (cf.
http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).  For
consistency, it seems reasonable to call disable_nonboot_cpus() after
device_suspend().

This way the suspend code will remain symmetrical with respect to the resume
code and it may allow us to speed up things in the future by suspending and
resuming devices and/or saving the suspend image in many threads.

The following series of patches reorders the suspend and resume code so that
nonboot CPUs are disabled after devices have been suspended and enabled before
the devices are resumed.  It also causes pm_ops->finish() to be called after
enable_nonboot_cpus() wherever necessary.

This patch:

Change the ordering of code in kernel/power/main.c so that device_suspend()
is called before disable_nonboot_cpus() and pm_ops->finish() is called after
enable_nonboot_cpus() and before device_resume(), as indicated by recent
discussion on Linux-PM
(cf. http://lists.osdl.org/pipermail/linux-pm/2006-November/004164.html).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Greg KH <greg@kroah.com>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: Patrick Mochel <mochel@digitalimplant.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Eric Paris
6ff1b4426e [PATCH] make reading /proc/sys/kernel/cap-bould not require CAP_SYS_MODULE
Reading /proc/sys/kernel/cap-bound requires CAP_SYS_MODULE.  (see
proc_dointvec_bset in kernel/sysctl.c)

sysctl appears to drive all over proc reading everything it can get it's
hands on and is complaining when it is being denied access to read
cap-bound.  Clearly writing to cap-bound should be a sensitive operation
but requiring CAP_SYS_MODULE to read cap-bound seems a bit to strong.  I
believe the information could with reasonable certainty be obtained by
looking at a bunch of the output of /proc/pid/status which has very low
security protection, so at best we are just getting a little obfuscation of
information.

Currently SELinux policy has to 'dontaudit' capability checks for
CAP_SYS_MODULE for things like sysctl which just want to read cap-bound.
In doing so we also as a byproduct have to hide warnings of potential
exploits such as if at some time that sysctl actually tried to load a
module.  I wondered if anyone would have a problem opening cap-bound up to
read from anyone?

Acked-by: Chris Wright <chrisw@sous-sol.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:19 -08:00
Christoph Lameter
9617729941 [PATCH] Drop free_pages()
nr_free_pages is now a simple access to a global variable.  Make it a macro
instead of a function.

The nr_free_pages now requires vmstat.h to be included.  There is one
occurrence in power management where we need to add the include.  Directly
refrer to global_page_state() there to clarify why the #include was added.

[akpm@osdl.org: arm build fix]
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:18 -08:00
Christoph Lameter
d23ad42324 [PATCH] Use ZVC for free_pages
This is again simplifies some of the VM counter calculations through the use
of the ZVC consolidated counters.

[michal.k.k.piotrowski@gmail.com: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:17 -08:00
Tejun Heo
9ac7849e35 devres: device resource management
Implement device resource management, in short, devres.  A device
driver can allocate arbirary size of devres data which is associated
with a release function.  On driver detach, release function is
invoked on the devres data, then, devres data is freed.

devreses are typed by associated release functions.  Some devreses are
better represented by single instance of the type while others need
multiple instances sharing the same release function.  Both usages are
supported.

devreses can be grouped using devres group such that a device driver
can easily release acquired resources halfway through initialization
or selectively release resources (e.g. resources for port 1 out of 4
ports).

This patch adds devres core including documentation and the following
managed interfaces.

* alloc/free	: devm_kzalloc(), devm_kzfree()
* IO region	: devm_request_region(), devm_release_region()
* IRQ		: devm_request_irq(), devm_free_irq()
* DMA		: dmam_alloc_coherent(), dmam_free_coherent(),
		  dmam_declare_coherent_memory(), dmam_pool_create(),
		  dmam_pool_destroy()
* PCI		: pcim_enable_device(), pcim_pin_device(), pci_is_managed()
* iomap		: devm_ioport_map(), devm_ioport_unmap(), devm_ioremap(),
		  devm_ioremap_nocache(), devm_iounmap(), pcim_iomap_table(),
		  pcim_iomap(), pcim_iounmap()

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-02-09 17:39:36 -05:00
Ralf Baechle
7726942fb1 [APM] Add shared version of APM emulation
Currently ARM and MIPS both have nearly identical copies of the APM
emulation code in their arch code.  Add yet another copy of it to
drivers char and make it selectable through SYS_SUPPORTS_APM_EMULATION.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2007-02-09 17:08:57 +00:00
Linus Torvalds
78149df6d5 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6: (41 commits)
  Revert "PCI: remove duplicate device id from ata_piix"
  msi: Make MSI useable more architectures
  msi: Kill the msi_desc array.
  msi: Remove attach_msi_entry.
  msi: Fix msi_remove_pci_irq_vectors.
  msi: Remove msi_lock.
  msi: Kill msi_lookup_irq
  MSI: Combine pci_(save|restore)_msi/msix_state
  MSI: Remove pci_scan_msi_device()
  MSI: Replace pci_msi_quirk with calls to pci_no_msi()
  PCI: remove duplicate device id from ipr
  PCI: remove duplicate device id from ata_piix
  PCI: power management: remove noise on non-manageable hw
  PCI: cleanup MSI code
  PCI: make isa_bridge Alpha-only
  PCI: remove quirk_sis_96x_compatible()
  PCI: Speed up the Intel SMBus unhiding quirk
  PCI Quirk: 1k I/O space IOBL_ADR fix on P64H2
  shpchp: delete trailing whitespace
  shpchp: remove DBG_XXX_ROUTINE
  ...
2007-02-07 19:23:44 -08:00
Eric W. Biederman
5b912c108c msi: Kill the msi_desc array.
We need to be able to get from an irq number to a struct msi_desc.
The msi_desc array in msi.c had several short comings the big one was
that it could not be used outside of msi.c.  Using irq_data in struct
irq_desc almost worked except on some architectures irq_data needs to
be used for something else.

So this patch adds a msi_desc pointer to irq_desc, adds the appropriate
wrappers and changes all of the msi code to use them.

The dynamic_irq_init/cleanup code was tweaked to ensure the new
field is left in a well defined state.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-07 15:50:08 -08:00
Kay Sievers
270a6c4cad /sys/modules/*/holders
/sys/module/usbcore/
  |-- drivers
  |   |-- usb:hub -> ../../../subsystem/usb/drivers/hub
  |   |-- usb:usb -> ../../../subsystem/usb/drivers/usb
  |   `-- usb:usbfs -> ../../../subsystem/usb/drivers/usbfs
  |-- holders
  |   |-- ehci_hcd -> ../../../module/ehci_hcd
  |   |-- uhci_hcd -> ../../../module/uhci_hcd
  |   |-- usb_storage -> ../../../module/usb_storage
  |   `-- usbhid -> ../../../module/usbhid
  |-- initstate

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-07 10:37:12 -08:00
Greg Kroah-Hartman
fe480a2675 Modules: only add drivers/ direcory if needed
This changes the module core to only create the drivers/ directory if we
are going to put something in it.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-07 10:37:12 -08:00
Kay Sievers
f30c53a873 MODULES: add the module name for built in kernel drivers
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-02-07 10:37:12 -08:00
Al Viro
9abcf40b1d [PATCH] fork_idle() should be __cpuinit, not __devinit
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-01 16:17:06 -08:00
Masami Hiramatsu
ab40c5c6b6 [PATCH] kprobes: replace magic numbers with enum
Replace the magic numbers with an enum, and gets rid of a warning on the
specific architectures (ex.  powerpc) on which the compiler considers
'char' as 'unsigned char'.

Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 16:01:35 -08:00
Serge E. Hallyn
0f2452855d [PATCH] namespaces: fix task exit disaster
This is based on a patch by Eric W.  Biederman, who pointed out that pid
namespaces are still fake, and we only have one ever active.

So for the time being, we can modify any code which could access
tsk->nsproxy->pid_ns during task exit to just use &init_pid_ns instead,
and move the exit_task_namespaces call in do_exit() back above
exit_notify(), so that an exiting nfs server has a valid tsk->sighand to
work with.

Long term, pulling pid_ns out of nsproxy might be the cleanest solution.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

[ Eric's patch fixed to take care of free_pid() too ]

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 13:40:36 -08:00
Linus Torvalds
444f378b23 Revert "[PATCH] namespaces: fix exit race by splitting exit"
This reverts commit 7a238fcba0 in
preparation for a better and simpler fix proposed by Eric Biederman
(and fixed up by Serge Hallyn)

Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 13:35:18 -08:00
Serge E. Hallyn
7a238fcba0 [PATCH] namespaces: fix exit race by splitting exit
Fix exit race by splitting the nsproxy putting into two pieces.  First
piece reduces the nsproxy refcount.  If we dropped the last reference, then
it puts the mnt_ns, and returns the nsproxy as a hint to the caller.  Else
it returns NULL.  The second piece of exiting task namespaces sets
tsk->nsproxy to NULL, and drops the references to other namespaces and
frees the nsproxy only if an nsproxy was passed in.

A little awkward and should probably be reworked, but hopefully it fixes
the NFS oops.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Daniel Hokka Zakrisson <daniel@hozac.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30 08:26:44 -08:00
Linus Torvalds
8528b0f1de Clear spurious irq stat information when adding irq handler
Any newly added irq handler may obviously make any old spurious irq
status invalid, since the new handler may well be the thing that is
supposed to handle any interrupts that came in.

So just clear the statistics when adding handlers.

Pointed-out-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 14:16:31 -08:00
Ingo Molnar
1b5180b651 [PATCH] notifiers: fix blocking_notifier_call_chain() scalability
while lock-profiling the -rt kernel i noticed weird contention during
mmap-intense workloads, and the tracer showed the following gem, in one
of our MM hotpaths:

 threaded-2771  1....   65us : sys_munmap (sysenter_do_call)
 threaded-2771  1....   66us : profile_munmap (sys_munmap)
 threaded-2771  1....   66us : blocking_notifier_call_chain (profile_munmap)
 threaded-2771  1....   66us : rt_down_read (blocking_notifier_call_chain)

ouch! a global rw-semaphore taken in one of the most performance-
sensitive codepaths of the kernel.  And i dont even have oprofile
enabled! All distro kernels have CONFIG_PROFILING enabled, so this
scalability problem affects the majority of Linux users.

The fix is to enhance blocking_notifier_call_chain() to only take the
lock if there appears to be work on the call-chain.

With this patch applied i get nicely saturated system, and much higher
munmap performance, on SMP systems.

And as a bonus this also fixes a similar scalability bottleneck in the
thread-exit codepath: profile_task_exit() ...

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 11:08:03 -08:00
Andrew Morton
bbe1a59b3a [PATCH] fix "kvm: add vm exit profiling"
export profile_hits() on !SMP too.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23 07:52:05 -08:00
Ingo Molnar
07031e14c1 [PATCH] KVM: add VM-exit profiling
This adds the profile=kvm boot option, which enables KVM to profile VM
exits.

Use: "readprofile -m ./System.map | sort -n" to see the resulting
output:

   [...]
   18246 serial_out                               148.3415
   18945 native_flush_tlb                         378.9000
   23618 serial_in                                212.7748
   29279 __spin_unlock_irq                        622.9574
   43447 native_apic_write                        2068.9048
   52702 enable_8259A_irq                         742.2817
   54250 vgacon_scroll                             89.3740
   67394 ide_inb                                  6126.7273
   79514 copy_page_range                           98.1654
   84868 do_wp_page                                86.6000
  140266 pit_read                                 783.6089
  151436 ide_outb                                 25239.3333
  152668 native_io_delay                          21809.7143
  174783 mask_and_ack_8259A                       783.7803
  362404 native_set_pte_at                        36240.4000
 1688747 total                                      0.5009

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:21 -08:00
Gautham R Shenoy
b282b6f8a8 [PATCH] Change cpu_up and co from __devinit to __cpuinit
Compiling the kernel with CONFIG_HOTPLUG = y and CONFIG_HOTPLUG_CPU = n
with CONFIG_RELOCATABLE = y generates the following modpost warnings

WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141b7d) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141b9c) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.text:__cpu_up
from .text between '_cpu_up' (at offset 0xc0141bd8) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141c05) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141c26) and 'cpu_up'
WARNING: vmlinux - Section mismatch: reference to .init.data: from
.text between '_cpu_up' (at offset 0xc0141c37) and 'cpu_up'

This is because cpu_up, _cpu_up and __cpu_up (in some architectures) are
defined as __devinit
AND
__cpu_up calls some __cpuinit functions.

Since __cpuinit would map to __init with this kind of a configuration,
we get a .text refering .init.data warning.

This patch solves the problem by converting all of __cpu_up, _cpu_up
and cpu_up from __devinit to __cpuinit. The approach is justified since
the callers of cpu_up are either dependent on CONFIG_HOTPLUG_CPU or
are of __init type.

Thus when CONFIG_HOTPLUG_CPU=y, all these cpu up functions would land up
in .text section, and when CONFIG_HOTPLUG_CPU=n, all these functions would
land up in .init section.

Tested on a i386 SMP machine running linux-2.6.20-rc3-mm1.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:20 -08:00
Nathan Lynch
e5e5673f82 [PATCH] sched: tasks cannot run on cpus onlined after boot
Commit 5c1e176781 ("sched: force /sbin/init
off isolated cpus") sets init's cpus_allowed to a subset of cpu_online_map
at boot time, which means that tasks won't be scheduled on cpus that are
added to the system later.

Make init's cpus_allowed a subset of cpu_possible_map instead.  This should
still preserve the behavior that Nick's change intended.

Thanks to Giuliano Pochini for reporting this and testing the fix:

http://ozlabs.org/pipermail/linuxppc-dev/2006-December/029397.html

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11 18:18:20 -08:00
Vivek Goyal
343cde51b3 [PATCH] x86-64: Make noirqdebug_setup function non init to fix modpost warning
o noirqdebug_setup() is __init but it is being called by
  quirk_intel_irqbalance() which if of type __devinit. If CONFIG_HOTPLUG=y,
  quirk_intel_irqbalance() is put into text section and it is wrong to
  call a function in __init section.

o MODPOST flags this on i386 if CONFIG_RELOCATABLE=y

WARNING: vmlinux - Section mismatch: reference to .init.text:noirqdebug_setup from .text between 'quirk_intel_irqbalance' (at offset 0xc010969e) and 'i8237A_suspend'

o Make noirqdebug_setup() non-init.

Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2007-01-11 01:52:44 +01:00
Linus Torvalds
d0abc451a6 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6:
  [PATCH] Driver core: Fix prefix driver links in /sys/module by bus-name
2007-01-06 00:10:55 -08:00
Ingo Molnar
a75acf850c [PATCH] profiling: fix sched profiling typo
Fix sched profiling typo, introduced by the sleep profiling patch.  This
bug caused profile=sched to not work.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:22 -08:00
Rafael J. Wysocki
7bf2368742 [PATCH] swsusp: Do not fail if resume device is not set
In the kernels later than 2.6.19 there is a regression that makes swsusp
fail if the resume device is not explicitly specified.

It can be fixed by adding an additional parameter to
mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device
*) corresponding to the first available swap back to the caller.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:22 -08:00
Ard van Breemen
a416aba637 [PATCH] kernelparams: detect if and which parameter parsing enabled irq's
The parsing of some kernel parameters seem to enable irq's at a stage that
irq's are not supposed to be enabled (Particularly the ide kernel parameters).

Having irq's enabled before the irq controller is initialized might lead to a
kernel panic.  This patch only detects this behaviour and warns about wich
parameter caused it.

[akpm@osdl.org: cleanups]
Signed-off-by: Ard van Breemen <ard@telegraafnet.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05 23:55:21 -08:00
Kay Sievers
7f422e2e84 [PATCH] Driver core: Fix prefix driver links in /sys/module by bus-name
Modules may have drivers with the same name on different buses.
This patch fixes this problem.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-01-05 12:33:38 -08:00
Oleg Nesterov
241ceee0b4 [PATCH] restore ->pdeath_signal behaviour
Commit b2b2cbc4b2 introduced a user-
visible change: ->pdeath_signal is sent only when the entire thread
group exits.

While this change is imho good, it may break things.  So restore the
old behaviour for now.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
To: Albert Cahalan <acahalan@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Linus Torvalds <torvalds@osdl.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Qi Yong <qiyong@fc-cn.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-31 14:41:18 -08:00
Andrew Morton
755cd90029 [PATCH] lockdep: printk warning fix
kernel/lockdep.c: In function `lookup_chain_cache':
kernel/lockdep.c:1339: warning: long long unsigned int format, u64 arg (arg 2)
kernel/lockdep.c:1344: warning: long long unsigned int format, u64 arg (arg 2)

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:43 -08:00
Andrew Morton
089e34b600 [PATCH] cpuset procfs warning fix
fs/proc/base.c:1869: warning: initialization discards qualifiers from pointer target type
fs/proc/base.c:2150: warning: initialization discards qualifiers from pointer target type

Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:43 -08:00
Akinobu Mita
a3f99f8ba8 [PATCH] module: fix mod_sysfs_setup() return value
mod_sysfs_setup() doesn't return error when kobject_add_dir() failed.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:41 -08:00
Ingo Molnar
9414232fa0 [PATCH] sched: fix cond_resched_softirq() offset
Remove the __resched_legal() check: it is conceptually broken.  The biggest
problem it had is that it can mask buggy cond_resched() calls.  A
cond_resched() call is only legal if we are not in an atomic context, with
two narrow exceptions:

 - if the system is booting
 - a reacquire_kernel_lock() down() done while PREEMPT_ACTIVE is set

But __resched_legal() hid this and just silently returned whenever
these primitives were called from invalid contexts. (Same goes for
cond_resched_locked() and cond_resched_softirq()).

Furthermore, the __legal_resched(0) call was buggy in that it caused
unnecessarily long softirq latencies via cond_resched_softirq().  (which is
only called from softirq-off sections, hence the code did nothing.)

The fix is to resurrect the efficiency of the might_sleep checks and to
only allow the narrow exceptions.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:56:41 -08:00
Ingo Molnar
e4e6bdbb42 [PATCH] rcu: rcutorture suspend fix
Fix suspend hang: rcutorture threads need to be nofreeze.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30 10:55:55 -08:00
Ingo Molnar
e1d9fd2e3d [PATCH] suspend: fix suspend on single-CPU systems
Clark Williams reported that suspend doesnt work on his laptop on
2.6.20-rc1-rt kernels. The bug was introduced by the following cleanup
commit:

 commit 112cecb2cc
 Author: Siddha, Suresh B <suresh.b.siddha@intel.com>
 Date:   Wed Dec 6 20:34:31 2006 -0800

    [PATCH] suspend: don't change cpus_allowed for task initiating the suspend

because with this change 'error' is not initialized to 0 anymore, if
there are no other online CPUs. (i.e. if the system is single-CPU).

the fix is the initialize it to 0. The really weird thing is that my
version of gcc does not warn about this non-initialized variable
situation ...

(also fix the kernel printk in the error branch, it was missing a
 newline)

Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-23 13:59:33 -08:00
Linus Torvalds
18ed1c0513 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (68 commits)
  ACPI: replace kmalloc+memset with kzalloc
  ACPI: Add support for acpi_load_table/acpi_unload_table_id
  fbdev: update after backlight argument change
  ACPI: video: Add dev argument for backlight_device_register
  ACPI: Implement acpi_video_get_next_level()
  ACPI: Kconfig - depend on PM rather than selecting it
  ACPI: fix NULL check in drivers/acpi/osl.c
  ACPI: make drivers/acpi/ec.c:ec_ecdt static
  ACPI: prevent processor module from loading on failures
  ACPI: fix single linked list manipulation
  ACPI: ibm_acpi: allow clean removal
  ACPI: fix git automerge failure
  ACPI: ibm_acpi: respond to workqueue update
  ACPI: dock: add uevent to indicate change in device status
  ACPI: ec: Lindent once again
  ACPI: ec: Change #define to enums there possible.
  ACPI: ec: Style changes.
  ACPI: ec: Acquire Global Lock under EC mutex.
  ACPI: ec: Drop udelay() from poll mode. Loop by reading status field instead.
  ACPI: ec: Rename gpe_bit to gpe
  ...
2006-12-22 18:46:56 -08:00
Eric W. Biederman
b2b2cbc4b2 [PATCH] Fix reparenting to the same thread group. (take 2)
This patch fixes the case when we reparent to a different thread in the
same thread group.  This modifies the code so that we do not send
signals and do not change the signal to send to SIGCHLD unless we have
change the thread group of our parents.  It also suppresses sending
pdeath_sig in this cas as well since the result of geppid doesn't
change.

Thanks to Oleg for spotting my bug of only fixing this for non-ptraced
tasks.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Albert Cahalan <acahalan@gmail.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Coywolf Qi Hunt <qiyong@fc-cn.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 09:03:41 -08:00
Andrew Morton
192636ad90 [PATCH] relay: remove inlining
text    data     bss     dec     hex filename
before:   4036      44       0    4080     ff0 kernel/relay.o
after:    3727      44       0    3771     ebb kernel/relay.o

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
Vadim Lobanov
01b2d93ca4 [PATCH] fdtable: Provide free_fdtable() wrapper
Christoph Hellwig has expressed concerns that the recent fdtable changes
expose the details of the RCU methodology used to release no-longer-used
fdtable structures to the rest of the kernel.  The trivial patch below
addresses these concerns by introducing the appropriate free_fdtable()
calls, which simply wrap the release RCU usage.  Since free_fdtable() is a
one-liner, it makes sense to promote it to an inline helper.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:50 -08:00
Andrew Morton
5b149bcc23 [PATCH] schedule_timeout(): improve warning message
Kyle is hitting this warning, and we don't have a clue what it's caused by.
Add the obligatory dump_stack().

Cc: kyle <kylewong@southa.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:49 -08:00
Akinobu Mita
3e1fbd12c9 [PATCH] audit: fix kstrdup() error check
kstrdup() returns NULL on error.

Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:49 -08:00
Thomas Gleixner
9d7ac8be4b [PATCH] genirq: fix irq flow handler uninstall
The sanity check for no_irq_chip in __set_irq_hander() is unconditional on
both install and uninstall of an handler.  This triggers false warnings and
replaces no_irq_chip by dummy_irq_chip in the uninstall case.

Check only, when a real handler is installed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Sylvain Munaut <tnt@246tNt.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:48 -08:00
Tim Chen
67af63a6ab [PATCH] sched: remove __cpuinitdata anotation to cpu_isolated_map
The structure cpu_isolated_map is used not only during initialization.
Multi-core scheduler configuration changes and exclusive cpusets
use this during run time.  During setting of sched_mc_power_savings
 policy, this structure is accessed to update sched_domains.

Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Adrian Bunk
99eea6a105 [PATCH] make kernel/printk.c:ignore_loglevel_setup() static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Randy Dunlap
af9997e426 [PATCH] fix kernel-doc warnings in 2.6.20-rc1
Fix kernel-doc warnings in 2.6.20-rc1.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:47 -08:00
Mark Fasheh
ba0084048a [PATCH] Conditionally check expected_preempt_count in __resched_legal()
Commit 2d7d253548 ("fix cond_resched() fix")
introduced an 'expected_preempt_count' parameter to __resched_legal() to
fix a bug where it was returning a false negative when called from
cond_resched_lock() and preemption was enabled.

Unfortunately this broke things for when preemption is disabled.
preempt_count() will always return zero, thus failing the check against any
value of expected_preempt_count not equal to zero.  cond_resched_lock() for
example, passes an expected_preempt_count value of 1.

So fix the fix for the cond_resched() fix by skipping the check of
preempt_count() against expected_preempt_count when preemption is disabled.

Credit should go to Sunil Mushran for spotting the bug during testing.

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:46 -08:00
Ingo Molnar
9bfb18392e [PATCH] workqueue: fix schedule_on_each_cpu()
fix the schedule_on_each_cpu() implementation: __queue_work() is now
stricter, hence set the work-pending bit before passing in the new work.

(found in the -rt tree, using Peter Zijlstra's files-lock scalability
patchset)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 00:20:01 -08:00
Peter Williams
bc947631d1 [PATCH] sched: improve efficiency of sched_fork()
Problem:
  sched_fork() has always called scheduler_tick() in some (unlikely)
  circumstances in order to update the current task in light of those
  circumstances.  It has always been the case that the work done by
  scheduler_tick() was more than was required to handle the problem in
  hand but no harm was done except for the waste of a few CPU cycles.

  However, the splitting of scheduler_tick() into two procedures in
  2.6.20-rc1 enables the wasted cycles to be saved as the new procedure
  task_running_tick() does all the work that is required to rectify the
  problem being handled.

Solution:
  Replace the call to scheduler_tick() in sched_fork() with a call to
  task_running_tick().

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 00:11:51 -08:00
Geert Uytterhoeven
b039db8eea [PATCH] __set_irq_handler bogus space
__set_irq_handler: Kill a bogus space

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-21 00:08:27 -08:00
Len Brown
9774f33841 merge linus into test branch 2006-12-20 02:53:13 -05:00
Linus Torvalds
a08727bae7 Make workqueue bit operations work on "atomic_long_t"
On architectures where the atomicity of the bit operations is handled by
external means (ie a separate spinlock to protect concurrent accesses),
just doing a direct assignment on the workqueue data field (as done by
commit 4594bf159f) can cause the
assignment to be lost due to lack of serialization with the bitops on
the same word.

So we need to serialize the assignment with the locks on those
architectures (notably older ARM chips, PA-RISC and sparc32).

So rather than using an "unsigned long", let's use "atomic_long_t",
which already has a safe assignment operation (atomic_long_set()) on
such architectures.

This requires that the atomic operations use the same atomicity locks as
the bit operations do, but that is largely the case anyway.  Sparc32
will probably need fixing.

Architectures (including modern ARM with LL/SC) that implement sane
atomic operations for SMP won't see any of this matter.

Cc: Russell King <rmk+lkml@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: David Miller <davem@davemloft.com>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Linux Arch Maintainers <linux-arch@vger.kernel.org>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-16 09:53:50 -08:00
Len Brown
cfee47f99b Pull bugfix into test branch
Conflicts:

	kernel/power/disk.c
2006-12-16 01:01:18 -05:00
Linus Torvalds
d1526e2cda Remove stack unwinder for now
It has caused more problems than it ever really solved, and is
apparently not getting cleaned up and fixed.  We can put it back when
it's stable and isn't likely to make warning or bug events worse.

In the meantime, enable frame pointers for more readable stack traces.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-15 08:47:51 -08:00
David Brownell
f89bce3d9a Driver core: deprecate PM_LEGACY, default it to N
Deprecate the old "legacy" PM API, and more importantly default it to "n".
Virtually nothing in-tree uses it any more.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-12-13 15:38:46 -08:00
Kay Sievers
1f71740ab9 Driver core: show "initstate" of module
Show the initialization state(live, coming, going) of the module:
  $ cat /sys/module/usbcore/initstate
  live

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-12-13 15:38:45 -08:00
Ralf Baechle
ec8c0446b6 [PATCH] Optimize D-cache alias handling on fork
Virtually index, physically tagged cache architectures can get away
without cache flushing when forking.  This patch adds a new cache
flushing function flush_cache_dup_mm(struct mm_struct *) which for the
moment I've implemented to do the same thing on all architectures
except on MIPS where it's a no-op.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:27:08 -08:00
Eric Dumazet
cd7175edf9 [PATCH] Optimize calc_load()
calc_load() is called by timer interrupt to update avenrun[].  It currently
calls nr_active() at each timer tick (HZ per second), while the update of
avenrun[] is done only once every 5 seconds.  (LOAD_FREQ=5 Hz)

nr_active() is quite expensive on SMP machines, since it has to sum up
nr_running and nr_uninterruptible of all online CPUS, bringing foreign
dirty cache lines.

This patch is an optimization of calc_load() so that nr_active() is called
only if we need it.

The use of unlikely() is welcome since the condition is true only once every
5*HZ time.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:55 -08:00
Robert P. J. Day
cd86128088 [PATCH] Fix numerous kcalloc() calls, convert to kzalloc()
All kcalloc() calls of the form "kcalloc(1,...)" are converted to the
equivalent kzalloc() calls, and a few kcalloc() calls with the incorrect
ordering of the first two arguments are fixed.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:52 -08:00
Ingo Molnar
74c383f140 [PATCH] lockdep: fix possible races while disabling lock-debugging
Jarek Poplawski noticed that lockdep global state could be accessed in a
racy way if one CPU did a lockdep assert (shutting lockdep down), while the
other CPU would try to do something that changes its global state.

This patch fixes those races and cleans up lockdep's internal locking by
adding a graph_lock()/graph_unlock()/debug_locks_off_graph_unlock helpers.

(Also note that as we all know the Linux kernel is, by definition, bug-free
and perfect, so this code never triggers, so these fixes are highly
theoretical.  I wrote this patch for aesthetic reasons alone.)

[akpm@osdl.org: build fix]
[jarkao2@o2.pl: build fix's refix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
3117df0453 [PATCH] lockdep: print irq-trace info on asserts
When we print an assert due to scheduling-in-atomic bugs, and if lockdep
is enabled, then the IRQ tracing information of lockdep can be printed
to pinpoint the code location that disabled interrupts. This saved me
quite a bit of debugging time in cases where the backtrace did not
identify the irq-disabling site well enough.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
27c3b23226 [PATCH] lockdep: use chain hash on CONFIG_DEBUG_LOCKDEP too
CONFIG_DEBUG_LOCKDEP is unacceptably slow because it does not utilize
the chain-hash. Turn the chain-hash back on in this case too.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
33e94e960b [PATCH] lockdep: clean up VERY_VERBOSE define
Cleanup: the VERY_VERBOSE define was unnecessarily dependent on #ifdef VERBOSE
- while the VERBOSE switch is 0 or 1 (always defined).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
23d95a03d6 [PATCH] lockdep: improve lockdep_reset()
Clear all the chains during lockdep_reset().  This fixes some locking-selftest
false positives i saw on -rt.  (never saw those on mainline though, but it
could happen.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
81fc685a89 [PATCH] lockdep: improve verbose messages
Make verbose lockdep messages (off by default) more informative by printing
out the hash chain key.  (this patch was what helped me catch the earlier
lockdep hash-collision bug)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
a664089741 [PATCH] lockdep: filter off by default
Fix typo in the class_filter() function.  (filtering is not used by default so
this only affects lockdep-internal debugging cases)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Ingo Molnar
5d6f647fc6 [PATCH] debug: add sysrq_always_enabled boot option
Most distributions enable sysrq support but set it to 0 by default.  Add a
sysrq_always_enabled boot option to always-enable sysrq keys.  Useful for
debugging - without having to modify the disribution's config files (which
might not be possible if the kernel is on a live CD, etc.).

Also, while at it, clean up the sysrq interfaces.

[bunk@stusta.de: make sysrq_always_enabled_setup() static]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:50 -08:00
Rafael J. Wysocki
8a102eed9c [PATCH] PM: Fix SMP races in the freezer
Currently, to tell a task that it should go to the refrigerator, we set the
PF_FREEZE flag for it and send a fake signal to it.  Unfortunately there
are two SMP-related problems with this approach.  First, a task running on
another CPU may be updating its flags while the freezer attempts to set
PF_FREEZE for it and this may leave the task's flags in an inconsistent
state.  Second, there is a potential race between freeze_process() and
refrigerator() in which freeze_process() running on one CPU is reading a
task's PF_FREEZE flag while refrigerator() running on another CPU has just
set PF_FROZEN for the same task and attempts to reset PF_FREEZE for it.  If
the refrigerator wins the race, freeze_process() will state that PF_FREEZE
hasn't been set for the task and will set it unnecessarily, so the task
will go to the refrigerator once again after it's been thawed.

To solve first of these problems we need to stop using PF_FREEZE to tell
tasks that they should go to the refrigerator.  Instead, we can introduce a
special TIF_*** flag and use it for this purpose, since it is allowed to
change the other tasks' TIF_*** flags and there are special calls for it.

To avoid the freeze_process()-refrigerator() race we can make
freeze_process() to always check the task's PF_FROZEN flag after it's read
its "freeze" flag.  We should also make sure that refrigerator() will
always reset the task's "freeze" flag after it's set PF_FROZEN for it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Rafael J. Wysocki
3df494a32b [PATCH] PM: Fix freezing of stopped tasks
Currently, if a task is stopped (ie.  it's in the TASK_STOPPED state), it
is considered by the freezer as unfreezeable.  However, there may be a race
between the freezer and the delivery of the continuation signal to the task
resulting in the task running after we have finished freezing the other
tasks.  This, in turn, may lead to undesirable effects up to and including
data corruption.

To prevent this from happening we first need to make the freezer consider
stopped tasks as freezeable.  For this purpose we need to make freezeable()
stop returning 0 for these tasks and we need to force them to enter the
refrigerator.  However, if there's no continuation signal in the meantime,
the stopped tasks should remain stopped after all processes have been
thawed, so we need to send an additional SIGSTOP to each of them before
waking it up.

Also, a stopped task that has just been woken up should first check if
there's a freezing request for it and go to the refrigerator if that's the
case.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Paul Jackson
02a0e53d82 [PATCH] cpuset: rework cpuset_zone_allowed api
Elaborate the API for calling cpuset_zone_allowed(), so that users have to
explicitly choose between the two variants:

  cpuset_zone_allowed_hardwall()
  cpuset_zone_allowed_softwall()

Until now, whether or not you got the hardwall flavor depended solely on
whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
argument.

If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
version.

Unfortunately, this meant that users would end up with the softwall version
without thinking about it.  Since only the softwall version might sleep,
this led to bugs with possible sleeping in interrupt context on more than
one occassion.

The hardwall version requires that the current tasks mems_allowed allows
the node of the specified zone (or that you're in interrupt or that
__GFP_THISNODE is set or that you're on a one cpuset system.)

The softwall version, depending on the gfp_mask, might allow a node if it
was allowed in the nearest enclusing cpuset marked mem_exclusive (which
requires taking the cpuset lock 'callback_mutex' to evaluate.)

This patch removes the cpuset_zone_allowed() call, and forces the caller to
explicitly choose between the hardwall and the softwall case.

If the caller wants the gfp_mask to determine this choice, they should (1)
be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
cpuset_zone_allowed_softwall() routine.

This adds another 100 or 200 bytes to the kernel text space, due to the few
lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
routines.  It should save a few instructions executed for the calls that
turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
set (before the call) then check (within the call) the __GFP_HARDWALL flag.

For the most critical call, from get_page_from_freelist(), the same
instructions are executed as before -- the old cpuset_zone_allowed()
routine it used to call is the same code as the
cpuset_zone_allowed_softwall() routine that it calls now.

Not a perfect win, but seems worth it, to reduce this chance of hitting a
sleeping with irq off complaint again.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:49 -08:00
Eric W. Biederman
5f8442edfb [PATCH] Revert "[PATCH] identifier to nsproxy"
This reverts commit 373beb35cd.

No one is using this identifier yet.  The purpose of this identifier is to
export nsproxy to user space which is wrong.  nsproxy is an internal
implementation optimization, which should keep our fork times from getting
slower as we increase the number of global namespaces you don't have to
share.

Adding a global identifier like this is inappropriate because it makes
namespaces inherently non-recursive, greatly limiting what we can do with
them in the future.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-13 09:05:47 -08:00
Daniel Walker
f5f1a24a2c [PATCH] clocksource: small cleanup
Mostly changing alignment.  Just some general cleanup.

[akpm@osdl.org: build fix]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Daniel Walker
2b0137001d [PATCH] clocksource: add usage of CONFIG_SYSFS
Simply adds some ifdefs to remove clocksoure sysfs code when CONFIG_SYSFS
isn't turn on.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Arjan van de Ven
4c36a5dec2 [PATCH] round_jiffies infrastructure
Introduce a round_jiffies() function as well as a round_jiffies_relative()
function.  These functions round a jiffies value to the next whole second.
The primary purpose of this rounding is to cause all "we don't care exactly
when" timers to happen at the same jiffy.

This avoids multiple timers firing within the second for no real reason;
with dynamic ticks these extra timers cause wakeups from deep sleep CPU
sleep states and thus waste power.

The exact wakeup moment is skewed by the cpu number, to avoid all cpus from
waking up at the exact same time (and hitting the same lock/cachelines
there)

[akpm@osdl.org: fix variable type]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov
4fd45812cb [PATCH] fdtable: Remove the free_files field
An fdtable can either be embedded inside a files_struct or standalone (after
being expanded).  When an fdtable is being discarded after all RCU references
to it have expired, we must either free it directly, in the standalone case,
or free the files_struct it is contained within, in the embedded case.

Currently the free_files field controls this behavior, but we can get rid of
it entirely, as all the necessary information is already recorded.  We can
distinguish embedded and standalone fdtables using max_fds, and if it is
embedded we can divine the relevant files_struct using container_of().

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov
bbea9f6966 [PATCH] fdtable: Make fdarray and fdsets equal in size
Currently, each fdtable supports three dynamically-sized arrays of data: the
fdarray and two fdsets.  The code allows the number of fds supported by the
fdarray (fdtable->max_fds) to differ from the number of fds supported by each
of the fdsets (fdtable->max_fdset).

In practice, it is wasteful for these two sizes to differ: whenever we hit a
limit on the smaller-capacity structure, we will reallocate the entire fdtable
and all the dynamic arrays within it, so any delta in the memory used by the
larger-capacity structure will never be touched at all.

Rather than hogging this excess, we shouldn't even allocate it in the first
place, and keep the capacities of the fdarray and the fdsets equal.  This
patch removes fdtable->max_fdset.  As an added bonus, most of the supporting
code becomes simpler.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:22 -08:00
Vadim Lobanov
f3d19c90fb [PATCH] fdtable: Delete pointless code in dup_fd()
The dup_fd() function creates a new files_struct and fdtable embedded inside
that files_struct, and then possibly expands the fdtable using expand_files().

The out_release error path is invoked when expand_files() returns an error
code.  However, when this attempt to expand fails, the fdtable is left in its
original embedded form, so it is pointless to try to free the associated
fdarray and fdsets.

Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:21 -08:00
Miguel Ojeda Sandonis
33859f7f97 [PATCH] kernel/sched.c: whitespace cleanups
[akpm@osdl.org: additional cleanups]
Signed-off-by: Miguel Ojeda Sandonis <maxextreme@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:57:20 -08:00
Chen, Kenneth W
62ab616d54 [PATCH] sched: optimize activate_task for RT task
RT task does not participate in interactiveness priority and thus shouldn't
be bothered with timestamp and p->sleep_type manipulation when task is
being put on run queue.  Bypass all of the them with a single if (rt_task)
test.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:43 -08:00
Chen, Kenneth W
06066714f6 [PATCH] sched: remove lb_stopbalance counter
Remove scheduler stats lb_stopbalance counter.  This counter can be
calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no need to
create gazillion counters while we can derive the value.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:43 -08:00
Siddha, Suresh B
783609c6cb [PATCH] sched: decrease number of load balances
Currently at a particular domain, each cpu in the sched group will do a
load balance at the frequency of balance_interval.  More the cores and
threads, more the cpus will be in each sched group at SMP and NUMA domain.
And we endup spending quite a bit of time doing load balancing in those
domains.

Fix this by making only one cpu(first idle cpu or first cpu in the group if
all the cpus are busy) in the sched group do the load balance at that
particular sched domain and this load will slowly percolate down to the
other cpus with in that group(when they do load balancing at lower
domains).

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:43 -08:00
Mike Galbraith
b18ec80396 [PATCH] sched: improve migration accuracy
Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation
reference timestamp at both tick and sched times to prevent said reference,
formerly rq->timestamp_last_tick, from being behind task->last_ran at
evaluation time, and to move said reference closer to current time on the
remote processor, intent being to improve cache hot evaluation and
timestamp adjustment accuracy for task migration.

Fix minor sched_time double accounting error which occurs when a task
passing through schedule() does not schedule off, and takes the next timer
tick.

[kenneth.w.chen@intel.com: cleanup]
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Don Mullis <dwm@meer.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:43 -08:00
Christoph Lameter
08c183f31b [PATCH] sched: add option to serialize load balancing
Large sched domains can be very expensive to scan.  Add an option SD_SERIALIZE
to the sched domain flags.  If that flag is set then we make sure that no
other such domain is being balanced.

[akpm@osdl.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:43 -08:00
Christoph Lameter
1bd77f2da5 [PATCH] sched: call tasklet less frequently
Trigger softirq less frequently

We trigger the softirq before this patch using offset of sd->interval.
However, if the queue is busy then it is sufficient to schedule the softirq
with sd->interval * busy_factor.

So we modify the calculation of the next time to balance by taking
the interval added to last_balance again. This is only the
right value if the idle/busy situation continues as is.

There are two potential trouble spots:
- If the queue was idle and now gets busy then we call rebalance
  early. However, that is not a problem because we will then use
  the longer interval for the next period.

- If the queue was busy and becomes idle then we potentially
  wait too long before rebalancing. However, when the task
  goes idle then idle_balance is called. We add another calculation
  of the next balance time based on sd->interval in idle_balance
  so that we will rebalance soon.

V2->V3:
- Calculate rebalance time based on current jiffies and not
  based on the jiffies at the last time we load balanced.
  We no longer rely on staggering and therefore we can
  affort to do this now.

V3->V4:
- Use functions to do jiffy comparisons.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:43 -08:00
Christoph Lameter
c9819f4593 [PATCH] sched: use softirq for load balancing
Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
softirq.

We calculate the earliest time for each layer of sched domains to be rescanned
(this is the rescan time for idle) and use the earliest of those to schedule
the softirq via a new field "next_balance" added to struct rq.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Christoph Lameter
e418e1c2bf [PATCH] sched: move idle status calculation into rebalance_tick()
Perform the idle state determination in rebalance_tick.

If we separate balancing from sched_tick then we also need to determine the
idle state in rebalance_tick.

V2->V3
	Remove useless idlle != 0 check. Checking nr_running seems
	to be sufficient. Thanks Suresh.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Christoph Lameter
7835b98bc6 [PATCH] sched: extract load calculation from rebalance_tick
A load calculation is always done in rebalance_tick() in addition to the real
load balancing activities that only take place when certain jiffie counts have
been reached.  Move that processing into a separate function and call it
directly from scheduler_tick().

Also extract the time slice handling from scheduler_tick and put it into a
separate function.  Then we can clean up scheduler_tick significantly.  It
will no longer have any gotos.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Christoph Lameter
fe2eea3faf [PATCH] sched: disable interrupts for locking in load_balance()
Interrupts must be disabled for request queue locks if we want to run
load_balance() with interrupts enabled.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Christoph Lameter
4211a9a2e9 [PATCH] sched: remove staggering of load balancing
Timer interrupts already are staggered.  We do not need an additional layer of
time staggering for short load balancing actions that take a reasonably small
portion of the time slice.

For load balancing on large sched_domains we will add a serialization later
that avoids concurrent load balance operations and thus has the same effect as
load staggering.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Christoph Lameter
571f6d2fb0 [PATCH] sched: avoid taking rq lock in wake_priority_sleeper
Avoid taking the request queue lock in wake_priority_sleeper if there are no
running processes.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Kirill Korotaev
054b9108e0 [PATCH] move_task_off_dead_cpu() should be called with disabled ints
move_task_off_dead_cpu() requires interrupts to be disabled, while
migrate_dead() calls it with enabled interrupts.  Added appropriate
comments to functions and added BUG_ON(!irqs_disabled()) into
double_rq_lock() and double_lock_balance() which are the origin sources of
such bugs.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Siddha, Suresh B
6711cab43e [PATCH] ched domain: move sched group allocations to percpu area
Move the sched group allocations to percpu area.  This will minimize cross
node memory references and also cleans up the sched groups allocation for
allnodes sched domain.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Robert P. J. Day
cc2a73b5ca [PATCH] sched.c: correct comment for this_rq_lock()
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:42 -08:00
Andrew Morton
4a7864ca63 [PATCH] io-accounting: via taskstats
Deliver IO accounting via taskstats.

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Andrew Morton
7c3ab7381e [PATCH] io-accounting: core statistics
The present per-task IO accounting isn't very useful.  It simply counts the
number of bytes passed into read() and write().  So if a process reads 1MB
from an already-cached file, it is accused of having performed 1MB of I/O,
which is wrong.

(David Wright had some comments on the applicability of the present logical IO accounting:

  For billing purposes it is useless but for workload analysis it is very
  useful

  read_bytes/read_calls  average read request size
  write_bytes/write_calls average write request size

  read_bytes/read_blocks ie logical/physical can indicate hit rate or thrashing
  write_bytes/write_blocks  ie logical/physical  guess since pdflush writes can
                                                be missed

  I often look for logical larger than physical to see filesystem cache
  problems.  And the bytes/cpusec can help find applications that are
  dominating the cache and causing slow interactive response from page cache
  contention.

  I want to find the IO intensive applications and make sure they are doing
  efficient IO.  Thus the acctcms(sysV) or csacms command would give the high
  IO commands).

This patchset adds new accounting which tries to be more accurate.  We account
for three things:

reads:

  attempt to count the number of bytes which this process really did cause
  to be fetched from the storage layer.  Done at the submit_bio() level, so it
  is accurate for block-backed filesystems.  I also attempt to wire up NFS and
  CIFS.

writes:

  attempt to count the number of bytes which this process caused to be sent
  to the storage layer.  This is done at page-dirtying time.

  The big inaccuracy here is truncate.  If a process writes 1MB to a file
  and then deletes the file, it will in fact perform no writeout.  But it will
  have been accounted as having caused 1MB of write.

  So...

cancelled_writes:

  account the number of bytes which this process caused to not happen, by
  truncating pagecache.

  We _could_ just subtract this from the process's `write' accounting.  But
  that means that some processes would be reported to have done negative
  amounts of write IO, which is silly.

  So we just report the raw number and punt this decision up to userspace.

Now, we _could_ account for writes at the physical I/O level.  But

- This would require that we track memory-dirtying tasks at the per-page
  level (would require a new pointer in struct page).

- It would mean that IO statistics for a process are usually only available
  long after that process has exitted.  Which means that we probably cannot
  communicate this info via taskstats.

This patch:

Wire up the kernel-private data structures and the accessor functions to
manipulate them.

Cc: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: David Wright <daw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Alexey Dobriyan
1f29bcd739 [PATCH] sysctl: remove unused "context" param
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:41 -08:00
Alexey Dobriyan
98d7340c36 [PATCH] sysctl: remove some OPs
kernel.cap-bound uses only OP_SET and OP_AND

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:40 -08:00
Randy Dunlap
d53ef07ab4 [PATCH] ipc-procfs-sysctl mixups
When CONFIG_PROC_FS=n and CONFIG_PROC_SYSCTL=n but CONFIG_SYSVIPC=y, we get
this build error:

kernel/built-in.o:(.data+0xc38): undefined reference to `proc_ipc_doulongvec_minmax'
kernel/built-in.o:(.data+0xc88): undefined reference to `proc_ipc_doulongvec_minmax'
kernel/built-in.o:(.data+0xcd8): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xd28): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xd78): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xdc8): undefined reference to `proc_ipc_dointvec'
kernel/built-in.o:(.data+0xe18): undefined reference to `proc_ipc_dointvec'
make: *** [vmlinux] Error 1

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10 09:55:39 -08:00
David Howells
4594bf159f [PATCH] WorkStruct: Use direct assignment rather than cmpxchg()
Use direct assignment rather than cmpxchg() as the latter is unavailable
and unimplementable on some platforms and is actually unnecessary.

The use of cmpxchg() was to guard against two possibilities, neither of
which can actually occur:

 (1) The pending flag may have been unset or may be cleared.  However, given
     where it's called, the pending flag is _always_ set.  I don't think it
     can be unset whilst we're in set_wq_data().

     Once the work is enqueued to be actually run, the only way off the queue
     is for it to be actually run.

     If it's a delayed work item, then the bit can't be cleared by the timer
     because we haven't started the timer yet.  Also, the pending bit can't be
     cleared by cancelling the delayed work _until_ the work item has had its
     timer started.

 (2) The workqueue pointer might change.  This can only happen in two cases:

     (a) The work item has just been queued to actually run, and so we're
         protected by the appropriate workqueue spinlock.

     (b) A delayed work item is being queued, and so the timer hasn't been
     	 started yet, and so no one else knows about the work item or can
     	 access it (the pending bit protects us).

     Besides, set_wq_data() _sets_ the workqueue pointer unconditionally, so
     it can be assigned instead.

So, replacing the set_wq_data() with a straight assignment would be okay
in most cases.

The problem is where we end up tangling with test_and_set_bit() emulated
using spinlocks, and even then it's not a problem _provided_
test_and_set_bit() doesn't attempt to modify the word if the bit was
set.

If that's a problem, then a bitops-proofed assignment will be required -
equivalent to atomic_set() vs other atomic_xxx() ops.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-09 12:25:08 -08:00
Eric W. Biederman
6b49a25785 [PATCH] sysctl: fix sys_sysctl interface of ipc sysctls
Currently there is a regression and the ipc sysctls don't show up in the
binary sysctl namespace.

This patch adds sysctl_ipc_data to read data/write from the appropriate
namespace and deliver it in the expected manner.

[akpm@osdl.org: warning fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Eric W. Biederman
9bc9a6bd3c [PATCH] sysctl: simplify ipc ns specific sysctls
Refactor the ipc sysctl support so that it is simpler, more readable, and
prepares for fixing the bug with the wrong values being returned in the
sys_sysctl interface.

The function proc_do_ipc_string() was misnamed as it never handled strings.
It's magic of when to work with strings and when to work with longs belonged
in the sysctl table.  I couldn't tell if the code would work if you disabled
the ipc namespace but it certainly looked like it would have problems.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Eric W. Biederman
c4b8b769fa [PATCH] sysctl: implement sysctl_uts_string()
The problem: When using sys_sysctl we don't read the proper values for the
variables exported from the uts namespace, nor do we do the proper locking.

This patch introduces sysctl_uts_string which properly fetches the values and
does the proper locking.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Eric W. Biederman
cf9f151c72 [PATCH] sysctl: simplify sysctl_uts_string
The binary interface to the namespace sysctls was never implemented resulting
in some really weird things if you attempted to use sys_sysctl to read your
hostname for example.

This patch series simples the code a little and implements the binary sysctl
interface.

In testing this patch series I discovered that our 32bit compatibility for the
binary sysctl interface is imperfect.  In particular KERN_SHMMAX and
KERN_SMMALL are size_t sized quantities and are returned as 8 bytes on to
32bit binaries using a x86_64 kernel.  However this has existing for a long
time so it is not a new regression with the namespace work.

Gads the whole sysctl thing needs work before it stops being easy to shoot
yourself in the foot.

Looking forward a little bit we need a better way to handle sysctls and
namespaces as our current technique will not work for the network namespace.
I think something based on the current overlapping sysctl trees will work but
the proc side needs to be redone before we can use it.

This patch:

Introduce get_uts() and put_uts() (used later) and remove most of the special
cases for when UTS namespace is compiled in.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:29:03 -08:00
Oleg Nesterov
62dfb5541a [PATCH] session_of_pgrp: kill unnecessary do_each_task_pid(PIDTYPE_PGID)
All members of the process group have the same sid and it can't be == 0.

NOTE: this code (and a similar one in sys_setpgid) was needed because it
was possibe to have ->session == 0. It's not possible any longer since

	[PATCH] pidhash: don't use zero pids
	Commit: c7c6464117

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Oleg Nesterov
f020bc468f [PATCH] sys_setpgid: eliminate unnecessary do_each_task_pid(PIDTYPE_PGID)
All tasks in the process group have the same sid, we don't need to iterate
them all to check that the caller of sys_setpgid() doesn't change its
session.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Sukadev Bhattiprolu
84d737866e [PATCH] add child reaper to pid_namespace
Add a per pid_namespace child-reaper.  This is needed so processes are reaped
within the same pid space and do not spill over to the parent pid space.  Its
also needed so containers preserve existing semantic that pid == 1 would reap
orphaned children.

This is based on Eric Biederman's patch: http://lkml.org/lkml/2006/2/6/285

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Cedric Le Goater
6cc1b22a4a [PATCH] use current->nsproxy->pid_ns
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Cedric Le Goater
9a575a92db [PATCH] to nsproxy
Add the pid namespace framework to the nsproxy object.  The copy of the pid
namespace only increases the refcount on the global pid namespace,
init_pid_ns, and unshare is not implemented.

There is no configuration option to activate or deactivate this feature
because this not relevant for the moment.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Sukadev Bhattiprolu
61a58c6c23 [PATCH] rename struct pspace to struct pid_namespace
Rename struct pspace to struct pid_namespace for consistency with other
namespaces (uts_namespace and ipc_namespace).  Also rename
include/linux/pspace.h to include/linux/pid_namespace.h and variables from
pspace to pid_ns.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Cedric Le Goater
373beb35cd [PATCH] identifier to nsproxy
Add an identifier to nsproxy.  The default init_ns_proxy has identifier 0 and
allocated nsproxies are given -1.

This identifier will be used by a new syscall sys_bind_ns.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:52 -08:00
Kirill Korotaev
6b3286ed11 [PATCH] rename struct namespace to struct mnt_namespace
Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
other namespaces being developped for the containers : pid, uts, ipc, etc.
'namespace' variables and attributes are also renamed to 'mnt_ns'

Signed-off-by: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Cedric Le Goater
1ec320afdc [PATCH] add process_session() helper routine: deprecate old field
Add an anonymous union and ((deprecated)) to catch direct usage of the
session field.

[akpm@osdl.org: fix various missed conversions]
[jdike@addtoit.com: fix UML bug]
Signed-off-by: Jeff Dike <jdike@addtoit.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Cedric Le Goater
937949d9ed [PATCH] add process_session() helper routine
Replace occurences of task->signal->session by a new process_session() helper
routine.

It will be useful for pid namespaces to abstract the session pid number.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:51 -08:00
Josef Sipek
a7a005fd12 [PATCH] struct path: convert kernel
Signed-off-by: Josef Sipek <jsipek@fsl.cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:46 -08:00
Josef "Jeff" Sipek
f3a43f3f64 [PATCH] kernel: change uses of f_{dentry, vfsmnt} to use f_path
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in
linux/kernel/.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:42 -08:00
NeilBrown
d63a5a74de [PATCH] lockdep: avoid lockdep warning in md
md_open takes ->reconfig_mutex which causes lockdep to complain.  This
(normally) doesn't have deadlock potential as the possible conflict is with a
reconfig_mutex in a different device.

I say "normally" because if a loop were created in the array->member hierarchy
a deadlock could happen.  However that causes bigger problems than a deadlock
and should be fixed independently.

So we flag the lock in md_open as a nested lock.  This requires defining
mutex_lock_interruptible_nested.

Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:39 -08:00
Oleg Nesterov
dae3c5a0b7 [PATCH] sys_unshare: remove a broken CLONE_SIGHAND code
sys_unshare(CLONE_SIGHAND) is broken, the code under 'if (new_sigh)' is
never executed but very wrong. Just remove it to avoid a confusion,
task_lock() has nothing to do with ->sighand changing.

Also, change the comment in unshare_sighand(). Yes, CLONE_THREAD implies
CLONE_SIGHAND, but still it looks confusing. Also, we don't need to check
current->sighand != NULL.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Oleg Nesterov
ae424ae4b5 [PATCH] make set_special_pids() static
Make set_special_pids() static, the only caller is daemonize().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Oleg Nesterov
7bcfa95e56 [PATCH] do_acct_process(): don't take tty_mutex
No need to take the global tty_mutex, signal->tty->driver can't go away while
we are holding ->siglock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Peter Zijlstra
24ec839c43 [PATCH] tty: ->signal->tty locking
Fix the locking of signal->tty.

Use ->sighand->siglock to protect ->signal->tty; this lock is already used
by most other members of ->signal/->sighand.  And unless we are 'current'
or the tasklist_lock is held we need ->siglock to access ->signal anyway.

(NOTE: sys_unshare() is broken wrt ->sighand locking rules)

Note that tty_mutex is held over tty destruction, so while holding
tty_mutex any tty pointer remains valid.  Otherwise the lifetime of ttys
are governed by their open file handles.  This leaves some holes for tty
access from signal->tty (or any other non file related tty access).

It solves the tty SLAB scribbles we were seeing.

(NOTE: the change from group_send_sig_info to __group_send_sig_info needs to
       be examined by someone familiar with the security framework, I think
       it is safe given the SEND_SIG_PRIV from other __group_send_sig_info
       invocations)

[schwidefsky@de.ibm.com: 3270 fix]
[akpm@osdl.org: various post-viro fixes]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:38 -08:00
Hidetoshi Seto
6e2ac66470 [PATCH] CPEI gets warning at kernel/irq/migration.c:27/move_masked_irq()
While running my MCA test (hardware error injection) on 2.6.19,
I got some warning like following:

> BUG: warning at kernel/irq/migration.c:27/move_masked_irq()
>
> Call Trace:
>  [<a000000100013d20>] show_stack+0x40/0xa0
>                                 sp=e00000006b2578d0 bsp=e00000006b2510b0
>  [<a000000100013db0>] dump_stack+0x30/0x60
>                                 sp=e00000006b257aa0 bsp=e00000006b251098
>  [<a0000001000de430>] move_masked_irq+0xb0/0x240
>                                 sp=e00000006b257aa0 bsp=e00000006b251070
>  [<a0000001000de6a0>] move_native_irq+0xe0/0x180
>                                 sp=e00000006b257aa0 bsp=e00000006b251040
>  [<a00000010004ff50>] iosapic_end_level_irq+0x30/0xe0
>                                 sp=e00000006b257aa0 bsp=e00000006b251020
>  [<a0000001000d94d0>] __do_IRQ+0x170/0x400
>                                 sp=e00000006b257aa0 bsp=e00000006b250fd8
>  [<a0000001000116f0>] ia64_handle_irq+0x1b0/0x260
>                                 sp=e00000006b257aa0 bsp=e00000006b250fa8
>  [<a00000010000c3a0>] ia64_leave_kernel+0x0/0x280
>                                 sp=e00000006b257aa0 bsp=e00000006b250fa8
>  [<a000000100690cf0>] _spin_unlock_irqrestore+0x30/0x60
>                                 sp=e00000006b257c70 bsp=e00000006b250f90

It comes from:

[kernel/irq/migration.c]
  26         if (CHECK_IRQ_PER_CPU(desc->status)) {
  27                 WARN_ON(1);
  28                 return;
  29         }

By putting some printk in kernel, I found that irqbalance is trying to
move CPEI which is handled as PER_CPU irq. That's why.

CPEI(Corrected Platform Error Interrupt) is ia64 specific irq, is
allowed to pin to particular processor which selected by the platform, and
even it is PER_CPU but it has set_affinity handler (=iosapic_set_affinity)
as same as other IO-SAPIC-level interrupts. (I don't know why, but
I guess that there would be typical situation where the handler for
migration is needed, such as hotplug - the processor going to be
offline/hot-removed.)

To shut up this warning, there are 2 way at least:
 a) fix CPEI stuff
 b) prohibit setting affinity to PER_CPU irq

I'm not sure what stuff of CPEI need to be fixed, but I think that
returning error to attempting move PER_CPU irq is useful for all
applications since it will never work.

Following small patch takes b) style.
It works, the warning disappeared and irqbalance still runs well.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:37 -08:00
Jan Beulich
aad094701c [PATCH] move kallsyms data to .rodata
Kallsyms data is never written to, so it can as well benefit from
CONFIG_DEBUG_RODATA.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-08 08:28:37 -08:00
Linus Torvalds
6ee7e78e7c Merge branch 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of master.kernel.org:/pub/scm/linux/kernel/git/aegl/linux-2.6:
  [IA64] replace kmalloc+memset with kzalloc
  [IA64] resolve name clash by renaming is_available_memory()
  [IA64] Need export for csum_ipv6_magic
  [IA64] Fix DISCONTIGMEM without VIRTUAL_MEM_MAP
  [PATCH] Add support for type argument in PAL_GET_PSTATE
  [IA64] tidy up return value of ip_fast_csum
  [IA64] implement csum_ipv6_magic for ia64.
  [IA64] More Itanium PAL spec updates
  [IA64] Update processor_info features
  [IA64] Add se bit to Processor State Parameter structure
  [IA64] Add dp bit to cache and bus check structs
  [IA64] SN: Correctly update smp_affinty mask
  [IA64] sparse cleanups
  [IA64] IA64 Kexec/kdump
2006-12-07 15:39:22 -08:00
Zou Nan hai
a79561134f [IA64] IA64 Kexec/kdump
Changes and updates.

1. Remove fake rendz path and related code according to discuss with Khalid Aziz.
2. fc.i offset fix in relocate_kernel.S.
3. iospic shutdown code eoi and mask race fix from Fujitsu.
4. Warm boot hook in machine_kexec to SN SAL code from Jack Steiner.
5. Send slave to SAL slave loop patch from Jay Lan.
6. Kdump on non-recoverable MCA event patch from Jay Lan
7. Use CTL_UNNUMBERED in kdump_on_init sysctl.

Signed-off-by: Zou Nan hai <nanhai.zou@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2006-12-07 09:51:35 -08:00
Linus Torvalds
68380b5813 Add "run_scheduled_work()" workqueue function
This allows workqueue users to run just their own pending work, rather
than wait for the whole workqueue to finish running.  This solves the
deadlock with networking libphy that was due to other workqueue entries
possibly needing a lock that was held by the routine that wanted to
flush its own work.

It's not wonderful: if you absolutely need to synchronize with the work
function having been executed, any user strictly speaking should have
its own completion tracking logic, since when we run things explicitly
by hand, the generic workqueue layer can no longer help us synchronize.

Also, this is strictly only usable for work that has been scheduled
without any delayed timers.  You can not mix the new interface with
schedule_delayed_work().

But it's better than what we had currently.

Acked-by: Maciej W. Rozycki <macro@linux-mips.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 09:28:19 -08:00
Linus Torvalds
2685b267bc Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (48 commits)
  [NETFILTER]: Fix non-ANSI func. decl.
  [TG3]: Identify Serdes devices more clearly.
  [TG3]: Use msleep.
  [TG3]: Use netif_msg_*.
  [TG3]: Allow partial speed advertisement.
  [TG3]: Add TG3_FLG2_IS_NIC flag.
  [TG3]: Add 5787F device ID.
  [TG3]: Fix Phy loopback.
  [WANROUTER]: Kill kmalloc debugging code.
  [TCP] inet_twdr_hangman: Delete unnecessary memory barrier().
  [NET]: Memory barrier cleanups
  [IPSEC]: Fix inetpeer leak in ipv4 xfrm dst entries.
  audit: disable ipsec auditing when CONFIG_AUDITSYSCALL=n
  audit: Add auditing to ipsec
  [IRDA] irlan: Fix compile warning when CONFIG_PROC_FS=n
  [IrDA]: Incorrect TTP header reservation
  [IrDA]: PXA FIR code device model conversion
  [GENETLINK]: Fix misplaced command flags.
  [NETLIK]: Add a pointer to the Generic Netlink wiki page.
  [IPV6] RAW: Don't release unlocked sock.
  ...
2006-12-07 09:05:15 -08:00
Linus Torvalds
4522d58275 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (156 commits)
  [PATCH] x86-64: Export smp_call_function_single
  [PATCH] i386: Clean up smp_tune_scheduling()
  [PATCH] unwinder: move .eh_frame to RODATA
  [PATCH] unwinder: fully support linker generated .eh_frame_hdr section
  [PATCH] x86-64: don't use set_irq_regs()
  [PATCH] x86-64: check vector in setup_ioapic_dest to verify if need setup_IO_APIC_irq
  [PATCH] x86-64: Make ix86 default to HIGHMEM4G instead of NOHIGHMEM
  [PATCH] i386: replace kmalloc+memset with kzalloc
  [PATCH] x86-64: remove remaining pc98 code
  [PATCH] x86-64: remove unused variable
  [PATCH] x86-64: Fix constraints in atomic_add_return()
  [PATCH] x86-64: fix asm constraints in i386 atomic_add_return
  [PATCH] x86-64: Correct documentation for bzImage protocol v2.05
  [PATCH] x86-64: replace kmalloc+memset with kzalloc in MTRR code
  [PATCH] x86-64: Fix numaq build error
  [PATCH] x86-64: include/asm-x86_64/cpufeature.h isn't a userspace header
  [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder
  [PATCH] x86-64: Clarify error message in GART code
  [PATCH] x86-64: Fix interrupt race in idle callback (3rd try)
  [PATCH] x86-64: Remove unwind stack pointer alignment forcing again
  ...

Fixed conflict in include/linux/uaccess.h manually

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:59:11 -08:00
Paul Menage
d3ed11c356 [PATCH] cpuset: allow a larger buffer for writes to cpuset files
When using fake NUMA setup, the number of memory nodes can greatly exceed
the number of CPUs.  So the current limit in cpuset_common_file_write() is
insufficient.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:48 -08:00
Ingo Molnar
7929082250 [PATCH] add ignore_loglevel boot option
Sometimes the kernel prints something interesting while userspace bootup
keeps messages turned off via loglevel.  Enable the printing of /all/
kernel messages via the "ignore_loglevel" boot option.  Off by default.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:47 -08:00
Ingo Molnar
70e4506765 [PATCH] lockdep: register_lock_class() fix
The hash_lock must only ever be taken with irqs disabled.  This happens in
all the important places, except one codepath: register_lock_class().  The
race should trigger rarely because register_lock_class() is quite rare and
single-threaded (happens during init most of the time).

The fix is to disable irqs.

( bug found live in -rt: there preemption is alot more agressive and
  preempting with the hash-lock held caused a lockup.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Magnus Damm
85916f8166 [PATCH] Kexec / Kdump: Unify elf note code
The elf note saving code is currently duplicated over several
architectures.  This cleanup patch simply adds code to a common file and
then replaces the arch-specific code with calls to the newly added code.

The only drawback with this approach is that s390 doesn't fully support
kexec-on-panic which for that arch leads to introduction of unused code.

Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Helge Deller
15ad7cdcfd [PATCH] struct seq_operations and struct file_operations constification
- move some file_operations structs into the .rodata section

 - move static strings from policy_types[] array into the .rodata section

 - fix generic seq_operations usages, so that those structs may be defined
   as "const" as well

[akpm@osdl.org: couple of fixes]
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:46 -08:00
Ralf Baechle
ccdea2f88b [PATCH] futex: remove unneeded barrier
When disassembling a kernel I found around over 90 sync Instructions from
mb, rmb and wmb calls in the kernel and only few of those make any sense to
me.  So here's the first one - I think the wmb() in kernel/futex.c is not
needed on uniprocessors so should become an smb_wmb().

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:45 -08:00
Josh Triplett
012d3ca8d8 [PATCH] Add Sparse annotations to SRCU wrapper functions in rcutorture
The SRCU wrapper functions srcu_torture_read_lock and
srcu_torture_read_unlock in rcutorture intentionally change the SRCU
context; annotate them accordingly, to avoid a warning.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
David Rientjes
a09c17a6fd [PATCH] sys: remove unused variable
Remove unused 'new_ruid' variable.

Reported by David Binderman <dcb314@hotmail.com>.

Signed-off-by: David Rientjes <rientjes@cs.washington.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Adrian Bunk
045f147f32 [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols
In time for 2.6.20, we can get rid of this junk.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Zachary Amsden
3908fd2ed9 [PATCH] softirq: remove BUG_ONs which can incorrectly trigger
It is possible to have tasklets get scheduled before softirqd has had a chance
to spawn on all CPUs.  This is totally harmless; after success during action
CPU_UP_PREPARE, action CPU_ONLINE will be called, which immediately wakes
softirqd on the appropriate CPU to process the already pending tasklets.  So
there is no danger of having a missed wakeup for any tasklets that were
already pending.

In particular, i386 is affected by this during startup, and is visible when
using a very large initrd; during the time it takes for the initrd to be
decompressed, a timer IRQ can come in and schedule RCU callbacks.  It is also
possible that resending of a hardware IRQ via a softirq triggers the same bug.

Because of different timing conditions, this shows up in all emulators and
virtual machines tested, including Xen, VMware, Virtual PC, and Qemu.  It is
also possible to trigger on native hardware with a large enough initrd,
although I don't have a reliable case demonstrating that.

Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: <caglar@pardus.org.tr>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:43 -08:00
Ingo Molnar
2ee91f197c [PATCH] lockdep: show more details about self-test failures
Make the locking self-test failures (of 'FAILURE' type) easier to debug by
printing more information.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:43 -08:00
Ingo Molnar
50cc670aeb [PATCH] lockdep: more chains
Some have reported a chain-table overflow - double its size.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:43 -08:00
Chris Caputo
301827acbe [PATCH] sched: correct output of show_state()
At present show_state prints a header the does not match the output of
show_task, as follows:

-
                                               sibling
  task             PC      pid father child younger older
init          S 00000000     0     1      0     2               (NOTLB)
-

This patch corrects the output of show_state so that the header is
aligned with the data, ala:

-
                         free                        sibling
  task             PC    stack   pid father child younger older
init          S 00000000     0     1      0     2               (NOTLB)
-

Signed-off-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:42 -08:00
BP, Praveen
bd9b0bac6f [PATCH] sysctl: string length calculated is wrong if it contains negative numbers
In the functions do_proc_dointvec() and do_proc_doulongvec_minmax(),
there seems to be a bug in string length calculation if string contains
negative integer.

The console log given below explains the bug. Setting negative values
may not be a right thing to do for "console log level" but then the test
(given below) can be used to demonstrate the bug in the code.

# echo "-1 -1 -1 -123456" > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
-1      -1      -1      -1234
#
# echo "-1 -1 -1 123456" > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
-1      -1      -1      1234
#

(akpm: the bug is that 123456 gets truncated)

It works as expected if string contains all +ve integers

# echo "1 2 3 4" > /proc/sys/kernel/printk
# cat /proc/sys/kernel/printk
1       2       3       4
#

The patch given below fixes the issue.

Signed-off-by: Praveen BP <praveenbp@ti.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:42 -08:00
Akinobu Mita
95362fa903 [PATCH] futex: init error check
Check register_filesystem() and kern_mount() return values.

Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:41 -08:00
Burman Yan
4668edc334 [PATCH] kernel core: replace kmalloc+memset with kzalloc
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:41 -08:00
Eric Dumazet
1c69d921ed [PATCH] rcu: add a prefetch() in rcu_do_batch()
On some workloads, (for example when lot of close() syscalls are done), RCU
qlen can be quite large, and RCU heads are no longer in cpu cache when
rcu_do_batch() is called.

This patch adds a prefetch() in rcu_do_batch() to give CPU a hint to bring
back cache lines containing 'struct rcu_head's.

Most list manipulations macros include prefetch(), but not open coded ones
(at least with current C compilers :) )

I got a nice speedup on a trivial benchmark (3.48 us per iteration instead
of 3.95 us on a 1.6 GHz Pentium-M)

while (1) { pipe(p); close(fd[0]); close(fd[1]);}

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:40 -08:00
Adrian Bunk
d3228a887c [PATCH] make kernel/signal.c:kill_proc_info() static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Adrian Bunk
ebe7e5fe4b [PATCH] remove kernel/lockdep.c:lockdep_internal
Remove the no longer used lockdep_internal().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Ingo Molnar
0231606785 [PATCH] hotplug CPU: clean up hotcpu_notifier() use
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.

the compiler can skip truly unused functions just fine:

    text    data     bss     dec     hex filename
 1624412  728710 3674856 6027978  5bfaca vmlinux.before
 1624412  728710 3674856 6027978  5bfaca vmlinux.after

[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:39 -08:00
Masami Hiramatsu
b4c6c34a53 [PATCH] kprobes: enable booster on the preemptible kernel
When we are unregistering a kprobe-booster, we can't release its
instruction buffer immediately on the preemptive kernel, because some
processes might be preempted on the buffer.  The freeze_processes() and
thaw_processes() functions can clean most of processes up from the buffer.
There are still some non-frozen threads who have the PF_NOFREEZE flag.  If
those threads are sleeping (not preempted) at the known place outside the
buffer, we can ensure safety of freeing.

However, the processing of this check routine takes a long time.  So, this
patch introduces the garbage collection mechanism of insn_slot.  It also
introduces the "dirty" flag to free_insn_slot because of efficiency.

The "clean" instruction slots (dirty flag is cleared) are released
immediately.  But the "dirty" slots which are used by boosted kprobes, are
marked as garbages.  collect_garbage_slots() will be invoked to release
"dirty" slots if there are more than INSNS_PER_PAGE garbage slots or if
there are no unused slots.

Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "bibo,mao" <bibo.mao@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Yumiko Sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi Oshima <soshima@redhat.com>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:38 -08:00
Mike Galbraith
c36264dfb2 [PATCH] remove the syslog interface when printk is disabled
Attempts to read() from the non-existent dmesg buffer will return zero and
userspace tends to get stuck in a busyloop.

So just remove /dev/kmsg altogether if CONFIG_PRINTK=n.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:38 -08:00
Alan Cox
40fcfc8722 [PATCH] HZ: 300Hz support
Fix two things.  Firstly the unit is "Hz" not "HZ".  Secondly it is useful
to have 300Hz support when doing multimedia work.  250 is fine for us in
Europe but the US frame rate is 30fps (29.99 blah for pedants).  300 gives
us a tick divisible by both 25 and 30, and for interlace work 50 and 60.
It's also giving similar performance to 250Hz.

I'd argue we should remove 250 and add 300, but that might be excess
disruption for now.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Peter Zijlstra
d5abe66917 [PATCH] debug: workqueue locking sanity
Workqueue functions should not leak locks, assert so, printing the
last function ran.

Use macros in lockdep.h to avoid include dependency pains.

[akpm@osdl.org: build fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Ingo Molnar
ece8a684c7 [PATCH] sleep profiling
Implement prof=sleep profiling.  TASK_UNINTERRUPTIBLE sleeps will be taken
as a profile hit, and every millisecond spent sleeping causes a profile-hit
for the call site that initiated the sleep.

Sample readprofile output on i386:

   306 ps2_sendbyte                               1.3973
   432 call_usermodehelper_keys                   1.9548
   484 ps2_command                                0.6453
   790 __driver_attach                            4.7879
  1593 msleep                                    44.2500
  3976 sync_buffer                               64.1290
  4076 do_lookup                                 12.4648
  8587 sync_page                                122.6714
 20820 total                                      0.0067

(NOTE: architectures need to check whether get_wchan() can be called from
deep within the wakeup path.)

akpm: we need to mark more functions __sched.  lock_sock(), msleep(), others..

akpm: the contention in do_lookup() is a surprise.  Presumably doing disk
reads for directory contents while holding i_mutex.

[akpm@osdl.org: various fixes]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Peter Zijlstra
6cfd76a26d [PATCH] lockdep: name some old style locks
Name some of the remaning 'old_style_spin_init' locks

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Peter Zijlstra
a4c410f00f [PATCH] lockdep: print current locks on in_atomic warnings
Add debug_show_held_locks(current) to __might_sleep() and schedule(); this
makes finding the offending lock leak easier.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:36 -08:00
Oleg Nesterov
3716748530 [PATCH] taskstats: cleanup reply assembling
Thomas Graf wrote:
>
> nla_nest_start() may return NULL, either rely on prepare_reply() to be
> correct and BUG() on failure or do proper error handling for all
> functions.

nla_put() in taskstat.c can fail only if the 'size' argument of alloc_skb()
was not right. This is a kernel bug, we should not hide it. So add 'BUG()'
on error path and check for 'na == NULL'.

> genlmsg_cancel() is only required in error paths for dumping
> procedures.

So we can remove 'genlmsg_cancel()' calls and 'void *reply' (saves 227 bytes).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
51de4d9085 [PATCH] taskstats: use nla_reserve() for reply assembling
Currently taskstats_user_cmd()/taskstats_exit() do:

	1) allocate stats
	2) fill stats
	3) make a temporary copy on stack (236 bytes)
	4) copy that copy to skb
	5) free stats

With the help of nla_reserve() we can operate on skb->data directly,
thus avoiding all these steps except 2).

So, before this patch:

	// copy *stats to skb->data
	int mk_reply(skb, ..., struct taskstats *stats);

	fill_pid(stats);
	mk_reply(skb, ..., stats);

After:
	// return a pointer to skb->data
	struct taskstats *mk_reply(skb, ...);

	stat = mk_reply(skb, ...);
	fill_pid(stats);

Shrinks taskatsks.o by 162 bytes.

A stupid benchmark (send one million TASKSTATS_CMD_ATTR_PID) shows the

		real user sys
	before:
		4.02 0.06 3.96
		4.02 0.04 3.98
		4.02 0.04 3.97
	after:
		3.86 0.08 3.78
		3.88 0.10 3.77
		3.89 0.09 3.80

but this looks suspiciously good.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
68062b86fc [PATCH] taskstats: factor out reply assembling
Introduce mk_reply() helper which does all nla_put()s on reply.

Saves 453 bytes and a preparation for the next patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
34ec12349c [PATCH] taskstats: cleanup ->signal->stats allocation
Allocate ->signal->stats on demand in taskstats_exit(), this allows us to
remove taskstats_tgid_alloc() (the last non-trivial inline) from taskstat's
public interface.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
115085ea07 [PATCH] taskstats: cleanup do_exit() path
do_exit:
	taskstats_exit_alloc()
	...
	taskstats_exit_send()
	taskstats_exit_free()

I think this is not good, let it be a single function exported to the core
kernel, taskstats_exit(), which does alloc + send + free itself.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Oleg Nesterov
128fb95650 [PATCH] taskstats_exit_alloc: optimize/simplify
If there are no listeners, every task does unneeded kmem_cache alloc/free on
exit. We don't need listeners->sem for 'if (!list_empty())' check. Yes, we may
have a false positive, but this doesn't differ from the case when the listener
is unregistered after we drop the semaphore. So we don't need to do allocation
beforehand.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:34 -08:00
Heiko Carstens
064b022c7a [PATCH] profile: fix uaccess handling
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Roland McGrath
fec1d01152 [PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit
The CLONE_CHILD_CLEARTID flag is used by NPTL to have its threads
communicate via memory/futex when they exit, so pthread_join can
synchronize using a simple futex wait.  The word of user memory where NPTL
stores a thread's own TID is what it passes; this gets reset to zero at
thread exit.

It is not desireable to touch this user memory when threads are dying due
to a fatal signal.  A core dump is more usefully representative of the
dying program state if the threads live at the time of the crash have their
NPTL data structures unperturbed.  The userland expectation of
CLONE_CHILD_CLEARTID has only ever been that it works for a thread making
an _exit system call.

This problem was identified by Ernie Petrides <petrides@redhat.com>.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Ernie Petrides <petrides@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Jarek Poplawski
b23984d0a1 [PATCH] lockdep: misc fixes in lockdep.c
- numeric string size replaced with constant in print_lock_name and
   print_lockdep_cache,

 - return on null pointer in print_lock_dependencies,

 - one more lockdep return with 0 with unlocking fix in mark_lock.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Jarek Poplawski
910b1b2e6d [PATCH] lockdep: internal locking fixes
Here are mainly some lockdep returns with 0 with unlocking fixes.

Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:33 -08:00
Paul Jackson
696040670a [PATCH] cpuset: minor code refinements
A couple of minor code simplifications to the kernel/cpuset.c code.  No
functional change.  Just a little less code and a little more readable.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Ralf Baechle
4cf303487d [PATCH] Export pm_suspend for the shared APM emulation
The new shared APM emulation just like its ARM and MIPS predecessors uses
pm_suspend() which was only exported on SH.  Move export to close to it's
definition where it really should be anyway.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Ingo Molnar
e59e2ae2c2 [PATCH] SysRq-X: show blocked tasks
Add SysRq-X support: show blocked (TASK_UNINTERRUPTIBLE) tasks only.

Useful for debugging IO stalls.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:32 -08:00
Adam B. Jerome
07354a0090 [PATCH] /proc/kallsyms reports lower-case types for some non-exported symbols
This patch addresses incorrect symbol type information reported through
/proc/kallsyms.  A lowercase character should designate the symbol as local
(or non-exported).  An uppercase character should designate the symbol as
global (or external).

Without this patch, some non-exported symbols are incorrectly assigned an
upper-case designation in /proc/kallsyms.  This patch corrects this
condition by converting non-exported symbols types to lower case when
appropriate and eliminates the superfluous upcase_if_global function

Signed-off-by: Adam B. Jerome <abj@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:31 -08:00
Peter Zijlstra
ed07536ed6 [PATCH] lockdep: annotate nfs/nfsd in-kernel sockets
Stick NFS sockets in their own class to avoid some lockdep warnings.  NFS
sockets are never exposed to user-space, and will hence not trigger certain
code paths that would otherwise pose deadlock scenarios.

[akpm@osdl.org: cleanups]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Steven Dickson <SteveD@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Acked-by: Neil Brown <neilb@suse.de>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
[ Fixed patch corruption by quilt, pointed out by Peter Zijlstra ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:30 -08:00
Rafael J. Wysocki
341a595850 [PATCH] Support for freezeable workqueues
Make it possible to create a workqueue the worker thread of which will be
frozen during suspend, along with other kernel threads.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:29 -08:00
Pavel Machek
5045cfc103 [PATCH] swsusp: kill write-only variable
Cleanup write-only variable, suggested by D Binderman.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:29 -08:00
Rafael J. Wysocki
2d87595ea6 [PATCH] PM: Fix swsusp debug mode testproc
The 'testproc' swsusp debug mode thaws tasks twice in a row, which is _very_
confusing.  Fix that.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
59a493350e [PATCH] swsusp: Fix labels
Move all labels in the swsusp code to the second column, so that they won't
fool diff -p.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
5b6d15de2d [PATCH] swsusp: Fix coding style in suspend.c
Fix coding style in suspend.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
11b2ce2ba9 [PATCH] swsusp: Untangle freeze_processes
Move the loop from freeze_processes() to a separate function and call it
independently for user space processes and kernel threads so that the order
of freezing tasks is clearly visible.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
a9b6f562f1 [PATCH] swsusp: Untangle thaw_processes
Move the loop from thaw_processes() to a separate function and call it
independently for kernel threads and user space processes so that the order
of thawing tasks is clearly visible.

Drop thaw_kernel_threads() which is never used.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Stephen Hemminger
a6d7098060 [PATCH] convert pm_sem to a mutex
The power management semaphore is only used as mutex, so convert it.

[akpm@osdl.org: fix rotten bug]
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
3eb1b3a407 [PATCH] suspend to disk fails if gdb is suspended with a traced child
Fix http://bugzilla.kernel.org/show_bug.cgi?id=7534

Fix the freezing of processes so that it won't fail if there is a traced
process the parent of which has been stopped.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: maurice barnum <pixi+kbug@burble.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
0d3a9abe8a [PATCH] swsusp: Measure memory shrinking time
Make swsusp measure and print the time needed to shrink memory during the
suspend.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Siddha, Suresh B
112cecb2cc [PATCH] suspend: don't change cpus_allowed for task initiating the suspend
Don't modify the cpus_allowed of the task initiating the suspend.
_cpu_down() already makes sure that the task doing the suspend doesn't run
on dying cpu.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Rafael J. Wysocki
2d4a34c936 [PATCH] swsusp: Support i386 systems with PAE or without PSE
Make swsusp support i386 systems with PAE or without PSE.

This is done by creating temporary page tables located in resume-safe page
frames before the suspend image is restored in the same way as x86_64 does
it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Nigel Cunningham <ncunningham@linuxmail.org>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Nigel Cunningham
ff39593ad0 [PATCH] swsusp: thaw userspace and kernel space separately
Modify process thawing so that we can thaw kernel space without thawing
userspace, and thaw kernelspace first.  This will be useful in later
patches, where I intend to get swsusp thawing kernel threads only before
seeking to free memory.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Nigel Cunningham
14b5b7cfaa [PATCH] swsusp: clean up whitespace in freezer output
Minor whitespace and formatting modifications for the freezer.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:28 -08:00
Nigel Cunningham
32d50f57da [PATCH] swsusp: quieten Freezer if !CONFIG_PM_DEBUG
The freezer currently prints an '=' for every process that is frozen.  This
is pretty pointless, as the equals sign says nothing about which process is
frozen, and makes logs look messier (especially if there were a large
number of processes running).  All we really need to know is that we
started trying to freeze processes and what processes (if any) failed to
freeze, or that we succeeded.

Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Nigel Cunningham
7dfb71030f [PATCH] Add include/linux/freezer.h and move definitions from sched.h
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.

[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Stefan Seyfried
8a05aac263 [PATCH] swsusp: fix platform mode
At some point after 2.6.13, in-kernel software suspend got "incomplete" for
the so-called "platform" mode.  pm_ops->prepare() is never called.  A
visible sign of this is the "moon" light on thinkpads not flashing during
suspend.  Fix by readding the pm_ops->prepare call during suspend.

Signed-off-by: Stefan Seyfried <seife@suse.de>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
8594912187 [PATCH] swsusp: use __GFP_WAIT
swsusp uses GFP_ATOMIC, but it can afford to use __GFP_WAIT, which will
permit it to reclaim clean pagecache instead of emitting scary
page-allocation-failure messages.

Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
8357376d3d [PATCH] swsusp: Improve handling of highmem
Currently swsusp saves the contents of highmem pages by copying them to the
normal zone which is quite inefficient (eg.  it requires two normal pages
to be used for saving one highmem page).  This may be improved by using
highmem for saving the contents of saveable highmem pages.

Namely, during the suspend phase of the suspend-resume cycle we try to
allocate as many free highmem pages as there are saveable highmem pages.
If there are not enough highmem image pages to store the contents of all of
the saveable highmem pages, some of them will be stored in the "normal"
memory.  Next, we allocate as many free "normal" pages as needed to store
the (remaining) image data.  We use a memory bitmap to mark the allocated
free pages (ie.  highmem as well as "normal" image pages).

Now, we use another memory bitmap to mark all of the saveable pages
(highmem as well as "normal") and the contents of the saveable pages are
copied into the image pages.  Then, the second bitmap is used to save the
pfns corresponding to the saveable pages and the first one is used to save
their data.

During the resume phase the pfns of the pages that were saveable during the
suspend are loaded from the image and used to mark the "unsafe" page
frames.  Next, we try to allocate as many free highmem page frames as to
load all of the image data that had been in the highmem before the suspend
and we allocate so many free "normal" page frames that the total number of
allocated free pages (highmem and "normal") is equal to the size of the
image.  While doing this we have to make sure that there will be some extra
free "normal" and "safe" page frames for two lists of PBEs constructed
later.

Now, the image data are loaded, if possible, into their "original" page
frames.  The image data that cannot be written into their "original" page
frames are loaded into "safe" page frames and their "original" kernel
virtual addresses, as well as the addresses of the "safe" pages containing
their copies, are stored in one of two lists of PBEs.

One list of PBEs is for the copies of "normal" suspend pages (ie.  "normal"
pages that were saveable during the suspend) and it is used in the same way
as previously (ie.  by the architecture-dependent parts of swsusp).  The
other list of PBEs is for the copies of highmem suspend pages.  The pages
in this list are restored (in a reversible way) right before the
arch-dependent code is called.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
37b2ba12df [PATCH] swsusp: add ioctl for swap files support
To be able to use swap files as suspend storage from the userland suspend
tools we need an additional ioctl() that will allow us to provide the kernel
with both the swap header's offset and the identification of the resume
partition.

The new ioctl() should be regarded as a replacement for the
SNAPSHOT_SET_SWAP_FILE ioctl() that from now on will be considered as
obsolete, but has to stay for backwards compatibility of the interface.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
9a154d9d95 [PATCH] swsusp: add resume_offset command line parameter
Add the kernel command line parameter "resume_offset=" allowing us to specify
the offset, in <PAGE_SIZE> units, from the beginning of the partition pointed
to by the "resume=" parameter at which the swap header is located.

This offset can be determined, for example, by an application using the FIBMAP
ioctl to obtain the swap header's block number for given file.

[akpm@osdl.org: we don't know what type sector_t is]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
3aef83e0ef [PATCH] swsusp: use block device offsets to identify swap locations
Make swsusp use block device offsets instead of swap offsets to identify swap
locations and make it use the same code paths for writing as well as for
reading data.

This allows us to use the same code for handling swap files and swap
partitions and to simplify the code, eg.  by dropping rw_swap_page_sync().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
3fc6b34f48 [PATCH] swsusp: rearrange swap-handling code
Rearrange the code in kernel/power/swap.c so that the next patch is more
readable.

[This patch only moves the existing code.]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Rafael J. Wysocki
915bae9ebe [PATCH] swsusp: use partition device and offset to identify swap areas
The Linux kernel handles swap files almost in the same way as it handles swap
partitions and there are only two differences between these two types of swap
areas:

(1) swap files need not be contiguous,

(2) the header of a swap file is not in the first block of the partition
    that holds it.  From the swsusp's point of view (1) is not a problem,
    because it is already taken care of by the swap-handling code, but (2) has
    to be taken into consideration.

In principle the location of a swap file's header may be determined with the
help of appropriate filesystem driver.  Unfortunately, however, it requires
the filesystem holding the swap file to be mounted, and if this filesystem is
journaled, it cannot be mounted during a resume from disk.  For this reason we
need some other means by which swap areas can be identified.

For example, to identify a swap area we can use the partition that holds the
area and the offset from the beginning of this partition at which the swap
header is located.

The following patch allows swsusp to identify swap areas this way.  It changes
swap_type_of() so that it takes an additional argument representing an offset
of the swap header within the partition represented by its first argument.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:27 -08:00
Stefan Seyfried
3592695c36 [PATCH] uswsusp: add pmops->{prepare,enter,finish} support (aka "platform mode")
Add an ioctl to the userspace swsusp code that enables the usage of the
pmops->prepare, pmops->enter and pmops->finish methods (the in-kernel
suspend knows these as "platform method").  These are needed on many
machines to (among others) speed up resuming by letting the BIOS skip some
steps or let my hp nx5000 recognise the correct ac_adapter state after
resume again.

It also ensures on many machines, that changed hardware (unplugged AC
adapters) gets correctly detected and that kacpid does not run wild after
resume.

Signed-off-by: Stefan Seyfried <seife@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:26 -08:00
Christoph Lameter
e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter
e94b176609 [PATCH] slab: remove SLAB_KERNEL
SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:24 -08:00
Peter Zijlstra
a866374aec [PATCH] mm: pagefault_{disable,enable}()
Introduce pagefault_{disable,enable}() and use these where previously we did
manual preempt increments/decrements to make the pagefault handler do the
atomic thing.

Currently they still rely on the increased preempt count, but do not rely on
the disabled preemption, this might go away in the future.

(NOTE: the extra barrier() in pagefault_disable might fix some holes on
       machines which have too many registers for their own good)

[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Ashwin Chaugule
7602bdf2fd [PATCH] new scheme to preempt swap token
The new swap token patches replace the current token traversal algo.  The old
algo had a crude timeout parameter that was used to handover the token from
one task to another.  This algo, transfers the token to the tasks that are in
need of the token.  The urgency for the token is based on the number of times
a task is required to swap-in pages.  Accordingly, the priority of a task is
incremented if it has been badly affected due to swap-outs.  To ensure that
the token doesnt bounce around rapidly, the token holders are given a priority
boost.  The priority of tasks is also decremented, if their rate of swap-in's
keeps reducing.  This way, the condition to check whether to pre-empt the swap
token, is a matter of comparing two task's priority fields.

[akpm@osdl.org: cleanups]
Signed-off-by: Ashwin Chaugule <ashwin.chaugule@celunite.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:21 -08:00
Joy Latten
161a09e737 audit: Add auditing to ipsec
An audit message occurs when an ipsec SA
or ipsec policy is created/deleted.

Signed-off-by: Joy Latten <latten@austin.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-06 20:14:22 -08:00
Jan Beulich
b65780e123 [PATCH] unwinder: move .eh_frame to RODATA
The .eh_frame section contents is never written to, so it can as well
benefit from CONFIG_DEBUG_RODATA.

Diff-ed against firstfloor tree.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:19 +01:00
Jan Beulich
c65f38d911 [PATCH] unwinder: fully support linker generated .eh_frame_hdr section
Now that binutils' ld is able to properly populate .eh_frame_hdr in the
Linux kernel case, here's a patch to add some functionality to the Dwarf2
unwinder to actually be able to make use of this (applies on firstfloor
tree with the previously sent patch to add debug output, but not on plain
2.6.19).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:19 +01:00
Jan Beulich
6d0185ea61 [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder
Add debugging printks to the unwinder to allow easier debugging
when something goes wrong with it.

This can be controlled with the new unwinder_debug=N option
Most output is given by N=1

AK: Added documentation of unwinder_debug=

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:13 +01:00
Jan Beulich
359ad0d401 [PATCH] unwinder: more sanity checks in Dwarf2 unwinder
Tighten the requirements on both input to and output from the Dwarf2
unwinder.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:13 +01:00
Andi Kleen
eef5e0d185 [PATCH] unwinder: Remove lockdep disabling of nested locks for unwinder
Shouldn't be needed anymore since __kernel_text_address
is used unconditionally on x86-64

Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:12 +01:00
Andi Kleen
e2124bb8d3 [PATCH] unwinder: Use probe_kernel_address instead of __get_user in kernel/unwind.c
This avoids trouble with the page fault handler if the fault
happens inside an interrupt context.

Suggested by Linus

Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:12 +01:00
Chuck Ebbert
0741f4d207 [PATCH] x86: add sysctl for kstack_depth_to_print
Add sysctl for kstack_depth_to_print. This lets users change
the amount of raw stack data printed in dump_stack() without
having to reboot.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07 02:14:11 +01:00
Jeremy Fitzhardinge
f95d47caae [PATCH] i386: Use %gs as the PDA base-segment in the kernel
This patch is the meat of the PDA change.  This patch makes several related
changes:

1: Most significantly, %gs is now used in the kernel.  This means that on
   entry, the old value of %gs is saved away, and it is reloaded with
   __KERNEL_PDA.

2: entry.S constructs the stack in the shape of struct pt_regs, and this
   is passed around the kernel so that the process's saved register
   state can be accessed.

   Unfortunately struct pt_regs doesn't currently have space for %gs
   (or %fs). This patch extends pt_regs to add space for gs (no space
   is allocated for %fs, since it won't be used, and it would just
   complicate the code in entry.S to work around the space).

3: Because %gs is now saved on the stack like %ds, %es and the integer
   registers, there are a number of places where it no longer needs to
   be handled specially; namely context switch, and saving/restoring the
   register state in a signal context.

4: And since kernel threads run in kernel space and call normal kernel
   code, they need to be created with their %gs == __KERNEL_PDA.

Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
2006-12-07 02:14:02 +01:00
David Howells
9db7372445 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/ata/libata-scsi.c
	include/linux/libata.h

Futher merge of Linus's head and compilation fixups.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 17:01:28 +00:00
David Howells
4c1ac1b491 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:

	drivers/infiniband/core/iwcm.c
	drivers/net/chelsio/cxgb2.c
	drivers/net/wireless/bcm43xx/bcm43xx_main.c
	drivers/net/wireless/prism54/islpci_eth.c
	drivers/usb/core/hub.h
	drivers/usb/input/hid-core.c
	net/core/netpoll.c

Fix up merge failures with Linus's head and fix new compilation failures.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-12-05 14:37:56 +00:00
Al Viro
a1f8e7f7fb [PATCH] severing skbuff.h -> highmem.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:29 -05:00
Al Viro
f6a570333e [PATCH] severing module.h->sched.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-12-04 02:00:22 -05:00
Thomas Graf
17c157c889 [GENL]: Add genlmsg_put_reply() to simplify building reply headers
By modyfing genlmsg_put() to take a genl_family and by adding
genlmsg_put_reply() the process of constructing the netlink
and generic netlink headers is simplified.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:42 -08:00
Thomas Graf
3dabc71578 [GENL]: Add genlmsg_new() to allocate generic netlink messages
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:40 -08:00
Thomas Graf
339bf98ffc [NETLINK]: Do precise netlink message allocations where possible
Account for the netlink message header size directly in nlmsg_new()
instead of relying on the caller calculate it correctly.

Replaces error handling of message construction functions when
constructing notifications with bug traps since a failure implies
a bug in calculating the size of the skb.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02 21:22:11 -08:00
Kay Sievers
e17e0f51ae Driver core: show drivers in /sys/module/
Show the drivers, which belong to the module:
  $ ls -l /sys/module/usbcore/drivers/
  hub -> ../../../bus/usb/drivers/hub
  usb -> ../../../bus/usb/drivers/usb
  usbfs -> ../../../bus/usb/drivers/usbfs

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-12-01 14:52:02 -08:00
Linus Torvalds
707badb80b Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6:
  [PATCH] x86-64: Use stricter in process stack check for unwinder
  [PATCH] i386: Fix compilation with UP genericarch
  [PATCH] x86-64: Fix warning in io_apic.c
  [PATCH] x86-64: work around gcc4 issue with -Os in Dwarf2 stack unwind
  [PATCH] x86_64: Align data segment to PAGE_SIZE boundary
2006-11-28 17:28:41 -08:00
Akinobu Mita
3cce4856ff [PATCH] fix create_write_pipe() error check
The return value of create_write_pipe()/create_read_pipe() should be
checked by IS_ERR().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-28 17:26:50 -08:00
Jan Beulich
ff0a538d8b [PATCH] x86-64: work around gcc4 issue with -Os in Dwarf2 stack unwind
This fixes a problem with gcc4 mis-compiling the stack unwind code under
-Os, which resulted in 'stuck' messages whenever an assembly routine was
encountered.

(The second hunk is trivial cleanup.)

Signed-off-by: Jan Beulich <jbeulich@novell.com>
2006-11-28 20:12:59 +01:00
Arjan van de Ven
cfd3ef2346 [PATCH] lockdep: spin_lock_irqsave_nested()
Introduce spin_lock_irqsave_nested(); implementation from:
 http://lkml.org/lkml/2006/6/1/122
Patch from:
 http://lkml.org/lkml/2006/9/13/258

[akpm@osdl.org: two compile fixes]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jiri Kosina <jikos@jikos.cz>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-25 13:28:34 -08:00
Akinobu Mita
753ca4f312 [PATCH] fix copy_process() error check
The return value of copy_process() should be checked by IS_ERR().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-25 13:28:34 -08:00
Linus Torvalds
b42172fc7b Don't call "note_interrupt()" with irq descriptor lock held
This reverts commit f72fa70760, and solves
the problem that it tried to fix by simply making "__do_IRQ()" call the
note_interrupt() function without the lock held, the way everybody else
does.

It should be noted that all interrupt handling code must never allow the
descriptor actors to be entered "recursively" (that's why we do all the
magic IRQ_PENDING stuff in the first place), so there actually is
exclusion at that much higher level, even in the absense of locking.

Acked-by: Vivek Goyal <vgoyal@in.ibm.com>
Acked-by:Pavel Emelianov <xemul@openvz.org>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-22 09:32:06 -08:00
David Howells
c4028958b6 WorkStruct: make allyesconfig
Fix up for make allyesconfig.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:57:56 +00:00
David Howells
65f27f3844 WorkStruct: Pass the work_struct pointer instead of context data
Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct.  This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function.  This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated..  This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems.  But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default.  Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:55:48 +00:00
David Howells
365970a1ea WorkStruct: Merge the pending bit into the wq_data pointer
Reclaim a word from the size of the work_struct by folding the pending bit and
the wq_data pointer together.  This shouldn't cause misalignment problems as
all pointers should be at least 4-byte aligned.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:49 +00:00
David Howells
6bb49e5965 WorkStruct: Typedef the work function prototype
Define a type for the work function prototype.  It's not only kept in the
work_struct struct, it's also passed as an argument to several functions.

This makes it easier to change it.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:45 +00:00
David Howells
52bad64d95 WorkStruct: Separate delayable and non-delayable events.
Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.

The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
architecture it's nearly 100 bytes in size.  This reduces that by half for the
non-delayable type of event.

Signed-Off-By: David Howells <dhowells@redhat.com>
2006-11-22 14:54:01 +00:00
Ingo Molnar
1ff5683043 [PATCH] lockdep: fix static keys in module-allocated percpu areas
lockdep got confused by certain locks in modules:

 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.

 Call Trace:
  [<ffffffff8026f40d>] dump_trace+0xaa/0x3f2
  [<ffffffff8026f78f>] show_trace+0x3a/0x60
  [<ffffffff8026f9d1>] dump_stack+0x15/0x17
  [<ffffffff802abfe8>] __lock_acquire+0x724/0x9bb
  [<ffffffff802ac52b>] lock_acquire+0x4d/0x67
  [<ffffffff80267139>] rt_spin_lock+0x3d/0x41
  [<ffffffff8839ed3f>] :ip_conntrack:__ip_ct_refresh_acct+0x131/0x174
  [<ffffffff883a1334>] :ip_conntrack:udp_packet+0xbf/0xcf
  [<ffffffff8839f9af>] :ip_conntrack:ip_conntrack_in+0x394/0x4a7
  [<ffffffff8023551f>] nf_iterate+0x41/0x7f
  [<ffffffff8025946a>] nf_hook_slow+0x64/0xd5
  [<ffffffff802369a2>] ip_rcv+0x24e/0x506
  [...]

Steven Rostedt found the bug: static_obj() check did not take
PERCPU_ENOUGH_ROOM into account, so in-module DEFINE_PER_CPU-area locks
were triggering this message.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-17 11:10:37 -08:00
Zhang, Yanmin
b86432b42e [PATCH] some irq_chip variables point to NULL
I got an oops when booting 2.6.19-rc5-mm1 on my ia64 machine.

Below is the log.

Oops 11012296146944 [1]
Modules linked in: binfmt_misc dm_mirror dm_multipath dm_mod thermal processor f
an container button sg eepro100 e100 mii

Pid: 0, CPU 0, comm:              swapper
psr : 0000121008022038 ifs : 800000000000040b ip  : [<a0000001000e1411>]    Not
tainted
ip is at __do_IRQ+0x371/0x3e0
unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003
rnat: 656960155aa56aa5 bsps: a00000010058b890 pr  : 656960155aa55a65
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001000e1390 b6  : a0000001005beac0 b7  : e00000007f01aa00
f6  : 000000000000000000000 f7  : 0ffe69090000000000000
f8  : 1000a9090000000000000 f9  : 0ffff8000000000000000
f10 : 1000a908ffffff6f70000 f11 : 1003e0000000000000909
r1  : a000000100fbbff0 r2  : 0000000000010002 r3  : 0000000000010001
r8  : fffffffffffbffff r9  : a000000100bd8060 r10 : a000000100dd83b8
r11 : fffffffffffeffff r12 : a000000100bcbbb0 r13 : a000000100bc4000
r14 : 0000000000010000 r15 : 0000000000010000 r16 : a000000100c01aa8
r17 : a000000100d2c350 r18 : 0000000000000000 r19 : a000000100d2c300
r20 : a000000100c01a88 r21 : 0000000080010100 r22 : a000000100c01ac0
r23 : a0000001000108e0 r24 : e000000477980004 r25 : 0000000000000000
r26 : 0000000000000000 r27 : e00000000913400c r28 : e0000004799ee51c
r29 : e0000004778b87f0 r30 : a000000100d2c300 r31 : a00000010005c7e0

Call Trace:
 [<a000000100014600>] show_stack+0x40/0xa0
                                sp=a000000100bcb760 bsp=a000000100bc4f40
 [<a000000100014f00>] show_regs+0x840/0x880
                                sp=a000000100bcb930 bsp=a000000100bc4ee8
 [<a000000100037fb0>] die+0x250/0x320
                                sp=a000000100bcb930 bsp=a000000100bc4ea0
 [<a00000010005e5f0>] ia64_do_page_fault+0x8d0/0xa20
                                sp=a000000100bcb950 bsp=a000000100bc4e50
 [<a00000010000caa0>] ia64_leave_kernel+0x0/0x290
                                sp=a000000100bcb9e0 bsp=a000000100bc4e50
 [<a0000001000e1410>] __do_IRQ+0x370/0x3e0
                                sp=a000000100bcbbb0 bsp=a000000100bc4df0
 [<a000000100011f50>] ia64_handle_irq+0x170/0x220
                                sp=a000000100bcbbb0 bsp=a000000100bc4dc0
 [<a00000010000caa0>] ia64_leave_kernel+0x0/0x290
                                sp=a000000100bcbbb0 bsp=a000000100bc4dc0
 [<a000000100012390>] ia64_pal_call_static+0x90/0xc0
                                sp=a000000100bcbd80 bsp=a000000100bc4d78
 [<a000000100015630>] default_idle+0x90/0x160
                                sp=a000000100bcbd80 bsp=a000000100bc4d58
 [<a000000100014290>] cpu_idle+0x1f0/0x440
                                sp=a000000100bcbe20 bsp=a000000100bc4d18
 [<a000000100009980>] rest_init+0xc0/0xe0
                                sp=a000000100bcbe20 bsp=a000000100bc4d00
 [<a0000001009f8ea0>] start_kernel+0x6a0/0x6c0
                                sp=a000000100bcbe20 bsp=a000000100bc4ca0
 [<a0000001000089f0>] __end_ivt_text+0x6d0/0x6f0
                                sp=a000000100bcbe30 bsp=a000000100bc4c00
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

The root cause is that some irq_chip variables, especially ia64_msi_chip,
initiate their memeber end to point to NULL. __do_IRQ doesn't check
if irq_chip->end is null and just calls it after processing the interrupt.

As irq_chip->end is called at many places, so I fix it by reinitiating
irq_chip->end to dummy_irq_chip.end, e.g., a noop function.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-16 11:43:37 -08:00
Linus Torvalds
9a3a04ac38 Revert "[PATCH] fix Data Acess error in dup_fd"
This reverts commit 0130b0b32e.

Sergey Vlasov points out (and Vadim Lobanov concurs) that the bug it was
supposed to fix must be some unrelated memory corruption, and the "fix"
actually causes more problems:

  "However, the new code does not look safe in all cases.  If some other
   task has opened more files while dup_fd() released oldf->file_lock, the
   new code will update open_files to the new larger value.  But newf was
   allocated with the old smaller value of open_files, therefore subsequent
   accesses to newf may try to write into unallocated memory."

so revert it.

Cc: Sharyathi Nagesh <sharyath@in.ibm.com>
Cc: Sergey Vlasov <vsu@altlinux.ru>
Cc: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 15:20:51 -08:00
Andrew Morton
8b126b7753 [PATCH] setup_irq(): better mismatch debugging
When we get a mismatch between handlers on the same IRQ, all we get is "IRQ
handler type mismatch for IRQ n".  Let's print the name of the
presently-registered handler with which we got the mismatch.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-14 09:09:26 -08:00
Pavel Emelianov
f72fa70760 [PATCH] Fix misrouted interrupts deadlocks
While testing kernel on machine with "irqpoll" option I've caught such a
lockup:

	__do_IRQ()
	   spin_lock(&desc->lock);
           desc->chip->ack(); /* IRQ is ACKed */
	note_interrupt()
	misrouted_irq()
	handle_IRQ_event()
           if (...)
	      local_irq_enable_in_hardirq();
	/* interrupts are enabled from now */
	...
	__do_IRQ() /* same IRQ we've started from */
	   spin_lock(&desc->lock); /* LOCKUP */

Looking at misrouted_irq() code I've found that a potential deadlock like
this can also take place:

1CPU:
__do_IRQ()
   spin_lock(&desc->lock); /* irq = A */
misrouted_irq()
   for (i = 1; i < NR_IRQS; i++) {
      spin_lock(&desc->lock); /* irq = B */
      if (desc->status & IRQ_INPROGRESS) {

2CPU:
__do_IRQ()
   spin_lock(&desc->lock); /* irq = B */
misrouted_irq()
   for (i = 1; i < NR_IRQS; i++) {
      spin_lock(&desc->lock); /* irq = A */
      if (desc->status & IRQ_INPROGRESS) {

As the second lock on both CPUs is taken before checking that this irq is
being handled in another processor this may cause a deadlock.  This issue
is only theoretical.

I propose the attached patch to fix booth problems: when trying to handle
misrouted IRQ active desc->lock may be unlocked.

Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 07:40:43 -08:00
Sharyathi Nagesh
0130b0b32e [PATCH] fix Data Acess error in dup_fd
On running the Stress Test on machine for more than 72 hours following
error message was observed.

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c00000007ce2f7f0]
    pc: c000000000060d90: .dup_fd+0x240/0x39c
    lr: c000000000060d6c: .dup_fd+0x21c/0x39c
    sp: c00000007ce2fa70
   msr: 800000000000b032
   dar: ffffffff00000028
 dsisr: 40000000
  current = 0xc000000074950980
  paca    = 0xc000000000454500
    pid   = 27330, comm = bash

0:mon> t
[c00000007ce2fa70] c000000000060d28 .dup_fd+0x1d8/0x39c (unreliable)
[c00000007ce2fb30] c000000000060f48 .copy_files+0x5c/0x88
[c00000007ce2fbd0] c000000000061f5c .copy_process+0x574/0x1520
[c00000007ce2fcd0] c000000000062f88 .do_fork+0x80/0x1c4
[c00000007ce2fdc0] c000000000011790 .sys_clone+0x5c/0x74
[c00000007ce2fe30] c000000000008950 .ppc_clone+0x8/0xc

The problem is because of race window.  When if(expand) block is executed in
dup_fd unlocking of oldf->file_lock give a window for fdtable in oldf to be
modified.  So actual open_files in oldf may not match with open_files
variable.

Cc: Vadim Lobanov <vlobanov@speakeasy.net>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-13 07:40:43 -08:00
Eric W. Biederman
d99f160ac5 [PATCH] sysctl: allow a zero ctl_name in the middle of a sysctl table
Since it is becoming clear that there are just enough users of the binary
sysctl interface that completely removing the binary interface from the kernel
will not be an option for foreseeable future, we need to find a way to address
the sysctl maintenance issues.

The basic problem is that sysctl requires one central authority to allocate
sysctl numbers, or else conflicts and ABI breakage occur.  The proc interface
to sysctl does not have that problem, as names are not densely allocated.

By not terminating a sysctl table until I have neither a ctl_name nor a
procname, it becomes simple to add sysctl entries that don't show up in the
binary sysctl interface.  Which allows people to avoid allocating a binary
sysctl value when not needed.

I have audited the kernel code and in my reading I have not found a single
sysctl table that wasn't terminated by a completely zero filled entry.  So
this change in behavior should not affect anything.

I think this mechanism eases the pain enough that combined with a little
disciple we can solve the reoccurring sysctl ABI breakage.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:23 -08:00
Eric W. Biederman
0e009be8a0 [PATCH] Improve the removed sysctl warnings
Don't warn about libpthread's access to kernel.version.  When it receives
-ENOSYS it will read /proc/sys/kernel/version.

If anything else shows up print the sysctl number string.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cal Peake <cp@absolutedigital.net>
Cc: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:23 -08:00
Peter Zijlstra
64efade11c [PATCH] lockdep: fix delayacct locking bug
Make the delayacct lock irqsave; this avoids the possible deadlock where
an interrupt is taken while holding the delayacct lock which needs to
take the delayacct lock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:23 -08:00
Gautham R Shenoy
4b96b1a10c [PATCH] Fix the spurious unlock_cpu_hotplug false warnings
Cpu-hotplug locking has a minor race case caused because of setting the
variable "recursive" to NULL *after* releasing the cpu_bitmask_lock in the
function unlock_cpu_hotplug,instead of doing so before releasing the
cpu_bitmask_lock.

This was the cause of most of the recent false spurious lock_cpu_unlock
warnings.

This should fix the problem reported by Martin Lorenz reported in
http://lkml.org/lkml/2006/10/29/127.

Thanks to Srinivasa DS for pointing it out.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-06 01:46:22 -08:00
Linus Torvalds
10b1fbdb0a Make sure "user->sigpending" count is in sync
The previous commit (45c18b0bb5, aka "Fix
unlikely (but possible) race condition on task->user access") fixed a
potential oops due to __sigqueue_alloc() getting its "user" pointer out
of sync with switch_user(), and accessing a user pointer that had been
de-allocated on another CPU.

It still left another (much less serious) problem, where a concurrent
__sigqueue_alloc and swich_user could cause sigqueue_alloc to do signal
pending reference counting for a _different_ user than the one it then
actually ended up using.  No oops, but we'd end up with the wrong signal
accounting.

Another case of Oleg's eagle-eyes picking up the problem.

This is trivially fixed by just making sure we load whichever "user"
structure we decide to use (it doesn't matter _which_ one we pick, we
just need to pick one) just once.

Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-04 13:03:00 -08:00
Linus Torvalds
45c18b0bb5 Fix unlikely (but possible) race condition on task->user access
There's a possible race condition when doing a "switch_uid()" from one
user to another, which could race with another thread doing a signal
allocation and looking at the old thread ->user pointer as it is freed.

This explains an oops reported by Lukasz Trabinski:
	http://permalink.gmane.org/gmane.linux.kernel/462241

We fix this by delaying the (reference-counted) freeing of the user
structure until the thread signal handler lock has been released, so
that we know that the signal allocation has either seen the new value or
has properly incremented the reference count of the old one.

Race identified by Oleg Nesterov.

Cc: Lukasz Trabinski <lukasz@wsisiz.edu.pl>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-04 10:06:02 -08:00
Stephen Rothwell
3fd5939798 [PATCH] Create compat_sys_migrate_pages
This is needed on bigendian 64bit architectures.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:59 -08:00
Rafael J. Wysocki
b918f6e62c [PATCH] swsusp: debugging
Add a swsusp debugging mode.  This does everything that's needed for a suspend
except for actually suspending.  So we can look in the log messages and work
out a) what code is being slow and b) which drivers are misbehaving.

(1)
# echo testproc > /sys/power/disk
# echo disk > /sys/power/state

This should turn off the non-boot CPU, freeze all processes, wait for 5
seconds and then thaw the processes and the CPU.

(2)
# echo test > /sys/power/disk
# echo disk > /sys/power/state

This should turn off the non-boot CPU, freeze all processes, shrink
memory, suspend all devices, wait for 5 seconds, resume the devices etc.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: Stefan Seyfried <seife@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Andrew Morton
19c6b6ed3f [PATCH] schedule removal of FUTEX_FD
Apparently FUTEX_FD is unfixably racy and nothing uses it (or if it does, it
shouldn't).

Add a warning printk, give any remaining users six months to migrate off it.

Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Andrew Morton
f46c483357 [PATCH] Add printk_timed_ratelimit()
printk_ratelimit() has global state which makes it not useful for callers
which wish to perform ratelimiting at a particular frequency.

Add a printk_timed_ratelimit() which utilises caller-provided state storage to
permit more flexibility.

This function can in fact be used for things other than printk ratelimiting
and is perhaps poorly named.

Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-11-03 12:27:58 -08:00
Rafael J. Wysocki
9185cfa925 ACPI: S4: Use "platform" rather than "shutdown" mode by default
It has been reported that on some systems the functionality after a resume
from disk is limited if the system is simply powered off during the suspend
instead of using the ACPI S4 suspend (aka platform mode).

Unfortunately the default is currently to power off the system during the
suspend so the users of these systems experience problems after the resume
if they don't switch to the platform mode explicitly.  This patch makes swsusp
use the platform mode by default to avoid such situations.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Stefan Seyfried <seife@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2006-11-01 22:40:23 -05:00
Oleg Nesterov
4a279ff1ea [PATCH] taskstats: fix sub-threads accounting
If there are no listeners, taskstats_exit_send() just returns because
taskstats_exit_alloc() didn't allocate *tidstats.  This is wrong, each
sub-thread should do fill_tgid_exit() on exit, otherwise its ->delays is
not recorded in ->signal->stats and lost.

Q: We don't send TASKSTATS_TYPE_AGGR_TGID when single-threaded process
exits.  Is it good?  How can the listener figure out that it was actually a
process exit, not sub-thread?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-31 08:07:00 -08:00
Oleg Nesterov
f0ec1aaf54 [PATCH] xacct_add_tsk: fix pure theoretical ->mm use-after-free
Paranoid fix. The task can free its ->mm after the 'if (p->mm)' check.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 12:08:41 -08:00
Randy Dunlap
0e2d57fc6e [PATCH] ndiswrapper: don't set the module->taints flags
For ndiswrapper, don't set the module->taints flags, just set the kernel
global tainted flag.  This should allow ndiswrapper to continue to use GPL
symbols.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Florin Malita <fmalita@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-30 12:08:40 -08:00
Oleg Nesterov
3d8334def5 [PATCH] taskstats: fix sk_buff size calculation
prepare_reply() adds GENL_HDRLEN to the payload (genlmsg_total_size()),
but then it does genlmsg_put()->nlmsg_put().  This means we forget to
reserve a room for 'struct nlmsghdr'.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-29 12:07:37 -08:00
Oleg Nesterov
d46a3d0d07 [PATCH] taskstats: fix sk_buff leak
'return genlmsg_cancel()' in taskstats_user_cmd/taskstats_exit_send
potentially leaks a skb.  Unless we pass 'rep_skb' to the netlink layer
we own sk_buff.  This means we should always do kfree_skb() on failure.

[ Thomas acked and pointed out missing return value in original version ]

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Thomas Graf <tgraf@suug.ch>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-29 12:07:37 -08:00
Alan Stern
057647fc47 [PATCH] workqueue: update kerneldoc
This patch (as812) changes the kerneldoc comments explaining the return
values from queue_work(), queue_delayed_work(), and
queue_delayed_work_on().  The updated comments explain more accurately the
meaning of the return code and avoid suggesting that a 0 value means the
routine was unsuccessful.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Satoru Takeuchi
8fa1d7d3b2 [PATCH] cpu-hotplug: release `workqueue_mutex' properly on CPU hot-remove
_cpu_down() acquires `workqueue_mutex' on its process, but doen't release it
if __cpu_disable() fails.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Jim Houston
bb1d860551 [PATCH] time_adjust cleared before use
I notice that the code which implements adjtime clears the time_adjust
value before using it.  The attached patch makes the obvious fix.

Acked-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Jim Houston <jim.houston@ccur.com>
Cc: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Oleg Nesterov
d7c3f5f231 [PATCH] fill_tgid: cleanup delays accounting
fill_tgid() should skip not only an already exited group leader.  If the
task has ->exit_state != 0 it already did exit_notify(), so it also did
fill_tgid_exit()->delayacct_add_tsk(->signal->stats) and we should skip it
to avoid a double accounting.

This patch doesn't close the race completely, but it cleanups the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:55 -07:00
Oleg Nesterov
a98b609426 [PATCH] taskstats: don't use tasklist_lock
Remove tasklist_lock from taskstats.c. find_task_by_pid() is rcu-safe.
->siglock allows us to traverse subthread without tasklist.

Q: delay accounting looks wrong to me.  If sub-thread has already called
taskstats_exit_send() but didn't call release_task(self) yet it will be
accounted twice.  The window is big.  No?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
b8534d7bd8 [PATCH] taskstats: kill ->taskstats_lock in favor of ->siglock
signal_struct is (mostly) protected by ->sighand->siglock, I think we don't
need ->taskstats_lock to protect ->stats.  This also allows us to simplify the
locking in fill_tgid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
093a8e8aec [PATCH] taskstats_tgid_free: fix usage
taskstats_tgid_free() is called on copy_process's error path. This is wrong.

	IF (clone_flags & CLONE_THREAD)
		We should not clear ->signal->taskstats, current uses it,
		it probably has a valid accumulated info.
	ELSE
		taskstats_tgid_init() set ->signal->taskstats = NULL,
		there is nothing to free.

Move the callsite to __exit_signal(). We don't need any locking, entire
thread group is exiting, nobody should have a reference to soon to be
released ->signal.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
05d5bcd60e [PATCH] bacct_add_tsk: fix unsafe and wrong parent/group_leader dereference
1. ts = timespec_sub(uptime, current->group_leader->start_time);

   It is possible that current != tsk. Probably it was supposed
   to be 'tsk->group_leader->start_time. But why we are reading
   group_leader's start_time ? This accounting is per thread,
   not per procees, I changed this to 'tsk->start_time.
   Please corect me.

2. stats->ac_ppid = (tsk->parent) ? tsk->parent->pid : 0;

   tsk->parent never == NULL, and it is unsafe to dereference it.
   Both the task and it's parent may exit after the caller unlocks
   tasklist_lock, the memory could be unmapped (DEBUG_SLAB).
   (And we should use ->real_parent->tgid in fact).

Q: I don't understand the 'if (thread_group_leader(tsk))' check.
Why it is needed ?

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Jay Lan <jlan@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Oleg Nesterov
fca178c0c6 [PATCH] fill_tgid: fix task_struct leak and possible oops
1. fill_tgid() forgets to do put_task_struct(first).

2. release_task(first) can happen after fill_tgid() drops tasklist_lock,
   it is unsafe to dereference first->signal.

This is a temporary fix, imho the locking should be reworked.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Stephen Rothwell
5fa3839a64 [PATCH] Constify compat_get_bitmap argument
This means we can call it when the bitmap we want to fetch is declared
const.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:54 -07:00
Jan Dittmer
1d4d262769 [PATCH] Add missing space in module.c for taintskernel
Obvious fix.

Signed-off-by: Jan Dittmer <jdi@l4x.org>
Acked-by: Florin Malita <fmalita@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-28 11:30:53 -07:00
Jan Beulich
690a973f48 [PATCH] x86-64: Speed up dwarf2 unwinder
This changes the dwarf2 unwinder to do a binary search for CIEs
instead of a linear work. The linker is unfortunately not
able to build a proper lookup table at link time, instead it creates
one at runtime as soon as the bootmem allocator is usable (so you'll continue
using the linear lookup for the first [hopefully] few calls).
The code should be ready to utilize a build-time created table once
a fixed linker becomes available.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-10-21 18:37:01 +02:00
Alexey Dobriyan
e05d722e45 [PATCH] kernel/nsproxy.c: use kmemdup()
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:44 -07:00
Randy Dunlap
d6f8ff7381 [PATCH] cad_pid sysctl with PROC_FS=n
If CONFIG_PROC_FS=n:

kernel/sysctl.c:148: warning: 'proc_do_cad_pid' used but never defined
kernel/built-in.o:(.data+0x1228): undefined reference to `proc_do_cad_pid'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:38 -07:00
Borislav Petkov
91fcdd4e03 [PATCH] readjust comments of task_timeslice for kernel doc
Signed-off-by: Borislav Petkov <petkov@math.uni-muenster.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-20 10:26:37 -07:00
Linus Torvalds
43f82216f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
* git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: fm801-gp - handle errors from pci_enable_device()
  Input: gameport core - handle errors returned by device_bind_driver()
  Input: serio core - handle errors returned by device_bind_driver()
  Lockdep: fix compile error in drivers/input/serio/serio.c
  Input: serio - add lockdep annotations
  Lockdep: add lockdep_set_class_and_subclass() and lockdep_set_subclass()
  Input: atkbd - supress "too many keys" error message
  Input: i8042 - supress ACK/NAKs when blinking during panic
  Input: add missing exports to fix modular build
2006-10-17 08:56:43 -07:00
Neil Brown
bd5349cfd2 [PATCH] Convert cpu hotplug notifiers to use raw_notifier instead of blocking_notifier
The use of blocking notifier by _cpu_up and _cpu_down in cpu.c has two
problem.

1/ An interaction with the workqueue notifier causes lockdep to spit a
   warning.

2/ A notifier could conceivable be added or removed while _cpu_up or
   _cpu_down are in process.  As each notifier is called twice (prepare
   then commit/abort) this could be unhealthy.

To fix to we simply take cpu_add_remove_lock while adding or removing
notifiers to/from the list.

This makes the 'blocking' usage unnecessary as all accesses to cpu_chain
are now protected by cpu_add_remove_lock.  So change "blocking" to "raw" in
all relevant places.  This fixes 1.

Credit: Andrew Morton
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> (reporter)
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:48 -07:00
Peter Zijlstra
bea493a031 [PATCH] rt-mutex: fixup rt-mutex debug code
BUG: warning at kernel/rtmutex-debug.c:125/rt_mutex_debug_task_free() (Not tainted)
 [<c04051e3>] show_trace_log_lvl+0x58/0x16a
 [<c04057f0>] show_trace+0xd/0x10
 [<c0405900>] dump_stack+0x19/0x1b
 [<c043f03d>] rt_mutex_debug_task_free+0x35/0x6a
 [<c04224c0>] free_task+0x15/0x24
 [<c042378c>] copy_process+0x12bd/0x1324
 [<c0423835>] do_fork+0x42/0x113
 [<c04021dd>] sys_fork+0x19/0x1b
 [<c0403fb7>] syscall_call+0x7/0xb

In copy_process(), dup_task_struct() also duplicates the ->pi_lock,
->pi_waiters and ->pi_blocked_on members.  rt_mutex_debug_task_free()
called from free_task() validates these members.  However free_task() can
be invoked before these members are reset for the new task.

Move the initialization code before the first bail that can hit free_task().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:48 -07:00
Ingo Molnar
a460e745e8 [PATCH] genirq: clean up irq-flow-type naming
Introduce desc->name and eliminate the handle_irq_name() hack.  Add
set_irq_chip_and_handler_name() to set the flow type and name at once.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Matthew Wilcox <willy@debian.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:45 -07:00
Andrew Morton
c60099bfe3 [PATCH] swsusp: fix memory leaks
My fancy new swsusp IO code had a big memory leak.  It's somewhat invisible
because the whole mem_map[] gets overwritten after resume, but it can cause us
to get low on memory during the actual suspend process.

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:44 -07:00
Thomas Gleixner
ac08c26492 [PATCH] posix-cpu-timers: prevent signal delivery starvation
The integer divisions in the timer accounting code can round the result
down to 0.  Adding 0 is without effect and the signal delivery stops.

Clamp the division result to minimum 1 to avoid this.

Problem was reported by Seongbae Park <spark@google.com>, who provided
also an inital patch.

Roland sayeth:

  I have had some more time to think about the problem, and to reproduce it
  using Toyo's test case.  For the record, if my understanding of the problem
  is correct, this happens only in one very particular case.  First, the
  expiry time has to be so soon that in cputime_t units (usually 1s/HZ ticks)
  it's < nthreads so the division yields zero.  Second, it only affects each
  thread that is so new that its CPU time accumulation is zero so now+0 is
  still zero and ->it_*_expires winds up staying zero.  For the VIRT and PROF
  clocks when cputime_t is tick granularity (or the SCHED clock on
  configurations where sched_clock's value only advances on clock ticks), this
  is not hard to arrange with new threads starting up and blocking before they
  accumulate a whole tick of CPU time.  That's what happens in Toyo's test
  case.

  Note that in general it is fine for that division to round down to zero,
  and set each thread's expiry time to its "now" time.  The problem only
  arises with thread's whose "now" value is still zero, so that now+0 winds up
  0 and is interpreted as "not set" instead of ">= now".  So it would be a
  sufficient and more precise fix to just use max(ticks, 1) inside the loop
  when setting each it_*_expires value.

  But, it does no harm to round the division up to one and always advance
  every thread's expiry time.  If the thread didn't already fire timers for
  the expiry time of "now", there is no expectation that it will do so before
  the next tick anyway.  So I followed Thomas's patch in lifting the max out
  of the loops.

  This patch also covers the reload cases, which are harder to write a test
  for (and I didn't try).  I've tested it with Toyo's case and it fixes that.

[toyoa@mvista.com: fix: min_t -> max_t]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Daniel Walker <dwalker@mvista.com>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Seongbae Park <spark@google.com>
Cc: Peter Mattis <pmattis@google.com>
Cc: Rohit Seth <rohitseth@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:43 -07:00
john stultz
3f4a0b917c [PATCH] i386 Time: Avoid PIT SMP lockups
Avoid possible PIT livelock issues seen on SMP systems (and reported by
Andi), by not allowing it as a clocksource on SMP boxes.

However, since the PIT may no longer be present, we have to properly handle
the cases where SMP systems have TSC skew and fall back from the TSC.
Since the PIT isn't there, it would "fall back" to the TSC again.  So this
changes the jiffies rating to 1, and the TSC-bad rating value to 0.

Thus you will get the following behavior priority on i386 systems:

tsc		[if present & stable]
hpet		[if present]
cyclone		[if present]
acpi_pm		[if present]
pit		[if UP]
jiffies

Rather then the current more complicated:
tsc		[if present & stable]
hpet		[if present]
cyclone		[if present]
acpi_pm		[if present]
pit		[if cpus < 4]
tsc		[if present & unstable]
jiffies

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:42 -07:00
Ingo Molnar
ca268c691d [PATCH] lockdep: increase max allowed recursion depth
In general, lockdep warnings are intended to be non-fatal, so I have put in
various practical limits on internal data structure failure modes.  We haven't
had a /single/ lockdep-internal crash ever since lockdep went upstream [the
unwinder crashes are outside of lockdep], and that's largely due to the good
internal checks it does.

Recursion within the dependency graph is currently limited to 20, that's
probably not enough on some many-CPU boxes - this patch doubles it to 40.  I
have written the lockdep functions to have as small stackframes as possible,
so 40 should be OK too.  (The practical recursion limit should be somewhere
between 100 and 200 entries.  If we hit that then I'll change the algorithm to
be iteration-based.  Graph walking logic is so easy to program via recursion,
so i'd like to keep recursion as long as possible.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-17 08:18:42 -07:00
Randy Dunlap
39af114377 [PATCH] fix epoll_pwait when EPOLL=n
Fixes http://bugzilla.kernel.org/show_bug.cgi?id=7371

sys_epoll_pwait needs to be listed as a conditional (weak)
entry point for CONFIG_EPOLL=n.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-16 09:14:05 -07:00
Ingo Molnar
256a6b4136 [PATCH] lockdep: fix printk recursion logic
Bug reported and fixed by Tilman Schmidt <tilman@imap.cc>: if lockdep is
enabled then log messages make it to /var/log/messages belatedly.  The
reason is a missed wakeup of klogd.

Initially there was only a lockdep_internal() protection against lockdep
recursion within vprintk() - it grew the 'outer' lockdep_off()/on()
protection only later on.  But that lockdep_off() made the
release_console_sem() within vprintk() always happen under the
lockdep_internal() condition, causing the bug.

The right solution to remove the inner protection against recursion here -
the outer one is enough.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:24 -07:00
Alexey Dobriyan
3dc3099a9b [PATCH] lockdep: use BUILD_BUG_ON
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:24 -07:00
Reinette Chatre
01a3ee2b20 [PATCH] bitmap: parse input from kernel and user buffers
lib/bitmap.c:bitmap_parse() is a library function that received as input a
user buffer.  This seemed to have originated from the way the write_proc
function of the /proc filesystem operates.

This has been reworked to not use kmalloc and eliminates a lot of
get_user() overhead by performing one access_ok before using __get_user().

We need to test if we are in kernel or user space (is_user) and access the
buffer differently.  We cannot use __get_user() to access kernel addresses
in all cases, for example in architectures with separate address space for
kernel and user.

This function will be useful for other uses as well; for example, taking
input for /sysfs instead of /proc, so it was changed to accept kernel
buffers.  We have this use for the Linux UWB project, as part as the
upcoming bandwidth allocator code.

Only a few routines used this function and they were changed too.

Signed-off-by: Reinette Chatre <reinette.chatre@linux.intel.com>
Signed-off-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:22 -07:00
Nick Piggin
beed33a816 [PATCH] sched: likely profiling
This likely profiling is pretty fun. I found a few possible problems
in sched.c.

This patch may be not measurable, but when I did measure long ago,
nooping (un)likely cost a couple of % on scheduler heavy benchmarks, so
it all adds up.

Tweak some branch hints:

- the 2nd 64 bits in the bitmask is likely to be populated, because it
  contains the first 28 bits (nearly 3/4) of the normal priorities.
  (ratio of 669669:691 ~= 1000:1).

- it isn't unlikely that context switching switches to another process. it
  might be very rapidly switching to and from the idle process (ratio of
  475815:419004 and 471330:423544). Let the branch predictor decide.

- preempt_enable seems to be very often called in a nested preempt_disable
  or with interrupts disabled (ratio of 3567760:87965 ~= 40:1)

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Daniel Walker <dwalker@mvista.com>
Cc: Hua Zhong <hzhong@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:22 -07:00
Florin Malita
fa3ba2e81e [PATCH] fix Module taint flags listing in Oops/panic
Module taint flags listing in Oops/panic has a couple of issues:

* taint_flags() doesn't null-terminate the buffer after printing the flags

* per-module taints are only set if the kernel is not already tainted
  (with that particular flag) => only the first offending module gets its
  taint info correctly updated

Some additional changes:

* 'license_gplok' is no longer needed - equivalent to !(taints &
  TAINT_PROPRIETARY_MODULE) - so we can drop it from struct module *
  exporting module taint info via /proc/module:

pwc 88576 0 - Live 0xf8c32000
evilmod 6784 1 pwc, Live 0xf8bbf000 (PF)

Signed-off-by: Florin Malita <fmalita@gmail.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:21 -07:00
Christoph Lameter
469340236a [PATCH] mm: kevent threads: use MPOL_DEFAULT
Switch the memory policy of the kevent threads to MPOL_DEFAULT while
leaving the kzalloc of the workqueue structure on interleave.  This means
that all code executed in the context of the kevent thread is allocating
node local.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Alok Kataria <alok.kataria@calsoftinc.com>
Cc: Andi Kleen <ak@suse.de>
Cc: <pj@sgi.com>
Cc: <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:19 -07:00
Rafael J. Wysocki
97c7801cd5 [PATCH] swsusp: Use suspend_console
Add suspend_console() and resume_console() to the suspend-to-disk code paths
so that the users of netconsole can use swsusp with it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-11 11:14:14 -07:00
Peter Zijlstra
4dfbb9d8c6 Lockdep: add lockdep_set_class_and_subclass() and lockdep_set_subclass()
This annotation makes it possible to assign a subclass on lock init. This
annotation is meant to reduce the _nested() annotations by assigning a
default subclass.

One could do without this annotation and rely on lockdep_set_class()
exclusively, but that would require a manual stack of struct lock_class_key
objects.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2006-10-11 01:45:14 -04:00
Al Viro
1af9892811 [PATCH] cpuset ANSI prototype
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10 15:37:23 -07:00
Al Viro
ba2397efe1 [PATCH] make kernel/relay.c __user-clean
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10 15:37:22 -07:00
Al Viro
ba46df984b [PATCH] __user annotations: futex
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10 15:37:22 -07:00
Rafael J. Wysocki
5c339d4541 [PATCH] swsusp: Make userland suspend work on SMP again
Unfortunately one of the recent changes in swsusp has broken the userland
suspend on SMP.  Fix it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-07 10:51:14 -07:00
Frederik Deweerdt
e317c8ccaa [PATCH] ixp4xxdefconfig arm fixes
With the following patch, the ixp4xxdefconfig builds correctly.  I'll
test some more configs if I get some time.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 12:11:08 -07:00
Andrew Morton
4899b8b16b [PATCH] kauditd_thread warning fix
Squash this warning:

  kernel/audit.c: In function 'kauditd_thread':
  kernel/audit.c:367: warning: no return statement in function returning non-void

We might as test kthread_should_stop(), although it's not very pointful at
present.

The code which starts this thread looks racy - the kernel could start multiple
threads.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-06 08:53:39 -07:00
David Howells
7d12e780e0 IRQ: Maintain regs pointer globally rather than passing to IRQ handlers
Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
of passing regs around manually through all ~1800 interrupt handlers in the
Linux kernel.

The regs pointer is used in few places, but it potentially costs both stack
space and code to pass it around.  On the FRV arch, removing the regs parameter
from all the genirq function results in a 20% speed up of the IRQ exit path
(ie: from leaving timer_interrupt() to leaving do_IRQ()).

Where appropriate, an arch may override the generic storage facility and do
something different with the variable.  On FRV, for instance, the address is
maintained in GR28 at all times inside the kernel as part of general exception
handling.

Having looked over the code, it appears that the parameter may be handed down
through up to twenty or so layers of functions.  Consider a USB character
device attached to a USB hub, attached to a USB controller that posts its
interrupts through a cascaded auxiliary interrupt controller.  A character
device driver may want to pass regs to the sysrq handler through the input
layer which adds another few layers of parameter passing.

I've build this code with allyesconfig for x86_64 and i386.  I've runtested the
main part of the code on FRV and i386, though I can't test most of the drivers.
I've also done partial conversion for powerpc and MIPS - these at least compile
with minimal configurations.

This will affect all archs.  Mostly the changes should be relatively easy.
Take do_IRQ(), store the regs pointer at the beginning, saving the old one:

	struct pt_regs *old_regs = set_irq_regs(regs);

And put the old one back at the end:

	set_irq_regs(old_regs);

Don't pass regs through to generic_handle_irq() or __do_IRQ().

In timer_interrupt(), this sort of change will be necessary:

	-	update_process_times(user_mode(regs));
	-	profile_tick(CPU_PROFILING, regs);
	+	update_process_times(user_mode(get_irq_regs()));
	+	profile_tick(CPU_PROFILING);

I'd like to move update_process_times()'s use of get_irq_regs() into itself,
except that i386, alone of the archs, uses something other than user_mode().

Some notes on the interrupt handling in the drivers:

 (*) input_dev() is now gone entirely.  The regs pointer is no longer stored in
     the input_dev struct.

 (*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking.  It does
     something different depending on whether it's been supplied with a regs
     pointer or not.

 (*) Various IRQ handler function pointers have been moved to type
     irq_handler_t.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)
2006-10-05 15:10:12 +01:00
David Howells
da482792a6 IRQ: Typedef the IRQ handler function type
Typedef the IRQ handler function type.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 1356d1e5fd256997e3d3dce0777ab787d0515c7a commit)
2006-10-05 13:28:27 +01:00
David Howells
57a58a9435 IRQ: Typedef the IRQ flow handler function type
Typedef the IRQ flow handler function type.

Signed-Off-By: David Howells <dhowells@redhat.com>
(cherry picked from 8e973fbdf5716b93a0a8c0365be33a31ca0fa351 commit)
2006-10-05 13:28:06 +01:00
Linus Torvalds
fefd26b3b8 Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/configh
* master.kernel.org:/pub/scm/linux/kernel/git/davej/configh:
  Remove all inclusions of <linux/config.h>

Manually resolved trivial path conflicts due to removed files in
the sound/oss/ subdirectory.
2006-10-04 09:59:57 -07:00
Linus Torvalds
18e6756a6b Merge branch 'audit.b32' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b32' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] message types updated
  [PATCH] name_count array overrun
  [PATCH] PPID filtering fix
  [PATCH] arch filter lists with < or > should not be accepted
2006-10-04 08:15:55 -07:00
Oleg Nesterov
20e9751bd9 [PATCH] rcu: simplify/improve batch tuning
Kill a hard-to-calculate 'rsinterval' boot parameter and per-cpu
rcu_data.last_rs_qlen.  Instead, it adds adds a flag rcu_ctrlblk.signaled,
which records the fact that one of CPUs has sent a resched IPI since the
last rcu_start_batch().

Roughly speaking, we need two rcu_start_batch()s in order to move callbacks
from ->nxtlist to ->donelist.  This means that when ->qlen exceeds qhimark
and continues to grow, we should send a resched IPI, and then do it again
after we gone through a quiescent state.

On the other hand, if it was already sent, we don't need to do it again
when another CPU detects overflow of the queue.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
4b6c2cca6e [PATCH] rcu: add sched torture type to rcutorture
Implement torture testing for the "sched" variant of RCU, which uses
preempt_disable, preempt_enable, and synchronize_sched.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
11a147013e [PATCH] rcu: add rcu_bh_sync torture type to rcutorture
Use the newly-generic synchronous deferred free function to implement torture
testing for rcu_bh using synchronize_rcu_bh rather than the asynchronous
call_rcu_bh.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
20d2e4283a [PATCH] rcu: add rcu_sync torture type to rcutorture
Use the newly-generic synchronous deferred free function to implement torture
testing for RCU using synchronize_rcu rather than the asynchronous call_rcu.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
e303373658 [PATCH] rcu: refactor srcu_torture_deferred_free to work for any implementation
Make srcu_torture_deferred_free use cur_ops->sync() so it will work for any
implementation.  Move and rename it in preparation for use in the ops of other
implementations.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
b772e1dd4b [PATCH] RCU: add fake writers to rcutorture
rcutorture currently has one writer and an arbitrary number of readers.  To
better exercise some of the code paths in RCU implementations, add fake
writer threads which call the synchronize function for the RCU variant in a
loop, with a delay between calls to arrange for different numbers of
writers running in parallel.

[bunk@stusta.de: cleanup]
Acked-by: Paul McKenney <paulmck@us.ibm.com>
Cc: Dipkanar Sarma <dipankar@in.ibm.com>
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
75cfef32f2 [PATCH] rcu: Fix sign bug making rcu_random always return the same sequence
rcu_random uses a counter rrs_count to occasionally mix data from
get_random_bytes into the state of its pseudorandom generator.  However,
the rrs_counter gets declared as an unsigned long, and rcu_random checks
for --rrs_count < 0, so this code will never mix any real random data into
the state, and will thus always return the same sequence of random numbers.

Also, change the return value of rcu_random from long to unsigned long, to
avoid potential issues caused by the use of the % operator, which can
return negative values for negative left operands.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:31 -07:00
Josh Triplett
2860aaba4d [PATCH] rcu: Avoid kthread_stop on invalid pointer if rcutorture reader startup fails
rcu_torture_init kmallocs the array of reader threads, then creates each
one with kthread_run, cleaning up with rcu_torture_cleanup if this fails.
rcu_torture_cleanup calls kthread_stop on any non-NULL pointer in the
array; however, any readers after the one that failed to start up will have
invalid pointers, not null pointers.  Avoid this by using kzalloc instead.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Josh Triplett
3c29e03d91 [PATCH] rcu: Mention rcu_bh in description of rcutorture's torture_type parameter
The comment for rcutorture's torture_type parameter only lists the RCU
variants rcu and srcu, but not rcu_bh; add rcu_bh to the list.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Josh Triplett
ff2c93a537 [PATCH] rcu: Add MODULE_AUTHOR to rcutorture module
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Alan Stern
e6a92013ba [PATCH] SRCU: report out-of-memory errors
Currently the init_srcu_struct() routine has no way to report out-of-memory
errors.  This patch (as761) makes it return -ENOMEM when the per-cpu data
allocation fails.

The patch also makes srcu_init_notifier_head() report a BUG if a notifier
head can't be initialized.  Perhaps it should return -ENOMEM instead, but
in the most likely cases where this might occur I don't think any recovery
is possible.  Notifier chains generally are not created dynamically.

[akpm@osdl.org: avoid statement-with-side-effect in macro]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Alan Stern
eabc069401 [PATCH] Add SRCU-based notifier chains
This patch (as751) adds a new type of notifier chain, based on the SRCU
(Sleepable Read-Copy Update) primitives recently added to the kernel.  An
SRCU notifier chain is much like a blocking notifier chain, in that it must
be called in process context and its callout routines are allowed to sleep.
 The difference is that the chain's links are protected by the SRCU
mechanism rather than by an rw-semaphore, so calling the chain has
extremely low overhead: no memory barriers and no cache-line bouncing.  On
the other hand, unregistering from the chain is expensive and the chain
head requires special runtime initialization (plus cleanup if it is to be
deallocated).

SRCU notifiers are appropriate for notifiers that will be called very
frequently and for which unregistration occurs very seldom.  The proposed
"task notifier" scheme qualifies, as may some of the network notifiers.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Paul E. McKenney
b2896d2e75 [PATCH] srcu-3: add SRCU operations to rcutorture
Adds SRCU operations to rcutorture and updates rcutorture documentation.
Also increases the stress imposed by the rcutorture test.

[bunk@stusta.de: make needlessly global code static]
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Paul E. McKenney
621934ee7e [PATCH] srcu-3: RCU variant permitting read-side blocking
Updated patch adding a variant of RCU that permits sleeping in read-side
critical sections.  SRCU is as follows:

o	Each use of SRCU creates its own srcu_struct, and each
	srcu_struct has its own set of grace periods.  This is
	critical, as it prevents one subsystem with a blocking
	reader from holding up SRCU grace periods for other
	subsystems.

o	The SRCU primitives (srcu_read_lock(), srcu_read_unlock(),
	and synchronize_srcu()) all take a pointer to a srcu_struct.

o	The SRCU primitives must be called from process context.

o	srcu_read_lock() returns an int that must be passed to
	the matching srcu_read_unlock().  Realtime RCU avoids the
	need for this by storing the state in the task struct,
	but SRCU needs to allow a given code path to pass through
	multiple SRCU domains -- storing state in the task struct
	would therefore require either arbitrary space in the
	task struct or arbitrary limits on SRCU nesting.  So I
	kicked the state-storage problem up to the caller.

	Of course, it is not permitted to call synchronize_srcu()
	while in an SRCU read-side critical section.

o	There is no call_srcu().  It would not be hard to implement
	one, but it seems like too easy a way to OOM the system.
	(Hey, we have enough trouble with call_rcu(), which does
	-not- permit readers to sleep!!!)  So, if you want it,
	please tell me why...

[josht@us.ibm.com: sparse notation]
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:30 -07:00
Eric W. Biederman
1f80025e62 [PATCH] msi: simplify msi sanity checks by adding with generic irq code
Currently msi.c is doing sanity checks that make certain before an irq is
destroyed it has no more users.

By adding irq_has_action I can perform the test is a generic way, instead of
relying on a msi specific data structure.

By performing the core check in dynamic_irq_cleanup I ensure every user of
dynamic irqs has a test present and we don't free resources that are in use.

In msi.c this allows me to kill the attrib.state member of msi_desc and all of
the assciated code to maintain it.

To keep from freeing data structures when irq cleanup code is called to soon
changing dyanamic_irq_cleanup is insufficient because there are msi specific
data structures that are also not safe to free.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg KH <greg@kroah.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:29 -07:00
Eric W. Biederman
3a16d71362 [PATCH] genirq: irq: add a dynamic irq creation API
With the msi support comes a new concept in irq handling, irqs that are
created dynamically at run time.

Currently the msi code allocates irqs backwards.  First it allocates a
platform dependent routing value for an interrupt the ``vector'' and then it
figures out from the vector which irq you are on.

This msi backwards allocator suffers from two basic problems.  The allocator
suffers because it is trying to do something that is architecture specific in
a generic way making it brittle, inflexible, and tied to tightly to the
architecture implementation.  The alloctor also suffers from it's very
backwards nature as it has tied things together that should have no
dependencies.

To solve the basic dynamic irq allocation problem two new architecture
specific functions are added: create_irq and destroy_irq.

create_irq takes no input and returns an unused irq number, that won't be
reused until it is returned to the free poll with destroy_irq.  The irq then
can be used for any purpose although the only initial consumer is the msi
code.

destroy_irq takes an irq number allocated with create_irq and returns it to
the free pool.

Making this functionality per architecture increases the simplicity of the irq
allocation code and increases it's flexibility.

dynamic_irq_init() and dynamic_irq_cleanup() are added to automate the
irq_desc initializtion that should happen for dynamic irqs.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:27 -07:00
Eric W. Biederman
e7b946e98a [PATCH] genirq: irq: add moved_masked_irq
Currently move_native_irq disables and renables the irq we are migrating to
ensure we don't take that irq when we are actually doing the migration
operation.  Disabling the irq needs to happen but sometimes doing the work is
move_native_irq is too late.

On x86 with ioapics the irq move sequences needs to be:
edge_triggered:
  mask irq.
  move irq.
  unmask irq.
  ack irq.
level_triggered:
  mask irq.
  ack irq.
  move irq.
  unmask irq.

We can easily perform the edge triggered sequence, with the current defintion
of move_native_irq.  However the level triggered case does not map well.  For
that I have added move_masked_irq, to allow me to disable the irqs around both
the ack and the move.

Q: Why have we not seen this problem earlier?

A: The only symptom I have been able to reproduce is that if we change
   the vector before acknowleding an irq the wrong irq is acknowledged.
   Since we currently are not reprogramming the irq vector during
   migration no problems show up.

   We have to mask the irq before we acknowledge the irq or else we could
   hit a window where an irq is asserted just before we acknowledge it.

   Edge triggered irqs do not have this problem because acknowledgements
   do not propogate in the same way.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:26 -07:00
Eric W. Biederman
a24ceab4f4 [PATCH] genirq: irq: convert the move_irq flag from a 32bit word to a single bit
The primary aim of this patchset is to remove maintenances problems caused by
the irq infrastructure.  The two big issues I address are an artificially
small cap on the number of irqs, and that MSI assumes vector == irq.  My
primary focus is on x86_64 but I have touched other architectures where
necessary to keep them from breaking.

- To increase the number of irqs I modify the code to look at the (cpu,
  vector) pair instead of just looking at the vector.

  With a large number of irqs available systems with a large irq count no
  longer need to compress their irq numbers to fit.  Removing a lot of brittle
  special cases.

  For acpi guys the result is that irq == gsi.

- Addressing the fact that MSI assumes irq == vector takes a few more
  patches.  But suffice it to say when I am done none of the generic irq code
  even knows what a vector is.

In quick testing on a large Unisys x86_64 machine we stumbled over at least
one driver that assumed that NR_IRQS could always fit into an 8 bit number.
This driver is clearly buggy today.  But this has become a class of bugs that
it is now much easier to hit.

This patch:

This is a minor space optimization.  In practice I don't think this has any
affect because of our alignment constraints and the other fields but there is
not point in chewing up an uncessary word and since we already read the flag
field this should improve the cache hit ratio of the irq handler.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rajesh Shah <rajesh.shah@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-04 07:55:26 -07:00
Steve Grubb
ac9910ce01 [PATCH] name_count array overrun
Hi,

This patch removes the rdev logging from the previous patch

The below patch closes an unbounded use of name_count. This can lead to oopses
in some new file systems.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-10-04 08:31:21 -04:00
Alexander Viro
419c58f11f [PATCH] PPID filtering fix
On Thu, Sep 28, 2006 at 04:03:06PM -0400, Eric Paris wrote:
> After some looking I did not see a way to get into audit_log_exit
> without having set the ppid.  So I am dropping the set from there and
> only doing it at the beginning.
>
> Please comment/ack/nak as soon as possible.

Ehh...  That's one hell of an overhead to be had ;-/  Let's be lazy.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-10-04 08:31:19 -04:00
Eric Paris
4b8a311bb1 [PATCH] arch filter lists with < or > should not be accepted
Currently the kernel audit system represents arch's as numbers and will
gladly accept comparisons between archs using >, <, >=, <= when the only
thing that makes sense is = or !=.  I'm told that the next revision of
auditctl will do this checking but this will provide enforcement in the
kernel even for old userspace.  A simple command to show the issue would
be to run

auditctl -d entry,always -F arch>i686 -S chmod

with this patch the kernel will reject this with -EINVAL

Please comment/ack/nak as soon as possible.

-Eric

 kernel/auditfilter.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-10-04 08:31:16 -04:00
Dave Jones
038b0a6d8d Remove all inclusions of <linux/config.h>
kbuild explicitly includes this at build time.

Signed-off-by: Dave Jones <davej@redhat.com>
2006-10-04 03:38:54 -04:00
Josh Triplett
4802211cfd rcutorture: Fix incorrect description of default for nreaders parameter
The comment for the nreaders parameter of rcutorture gives the default as
4*ncpus, but the value actually defaults to 2*ncpus; fix the comment.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:26:16 +02:00
Rolf Eike Beer
9f5d785e93 remove duplicate "until" from kernel/workqueue.c
s/until until/until/

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:07:31 +02:00
Uwe Zeisberger
f30c226954 fix file specification in comments
Many files include the filename at the beginning, serveral used a wrong one.

Signed-off-by: Uwe Zeisberger <Uwe_Zeisberger@digi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-03 23:01:26 +02:00
Christoph Lameter
ce164428c4 [PATCH] scheduler: NUMA aware placement of sched_group_allnodes
When the per cpu sched domains are build then they also need to be placed
on the node where the cpu resides otherwise we will have frequent off node
accesses which will slow down the system.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:07 -07:00
Satoru Takeuchi
0feaece977 [PATCH] sched: fixing wrong comment for find_idlest_cpu()
Fixing wrong comment for find_idlest_cpu().

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:07 -07:00
Siddha, Suresh B
89c4710ee9 [PATCH] sched: cleanup sched_group cpu_power setup
Up to now sched group's cpu_power for each sched domain is initialized
independently.  This made the setup code ugly as the new sched domains are
getting added.

Make the sched group cpu_power setup code generic, by using domain child
field and new domain flag in sched_domain.  For most of the sched
domains(except NUMA), sched group's cpu_power is now computed generically
using the domain properties of itself and of the child domain.

sched groups in NUMA domains are setup little differently and hence they
don't use this generic mechanism.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Siddha, Suresh B
1a84887080 [PATCH] sched: introduce child field in sched_domain
Introduce the child field in sched_domain struct and use it in
sched_balance_self().

We will also use this field in cleaning up the sched group cpu_power
setup(done in a different patch) code.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Dave Jones
7473264643 [PATCH] sched: don't print migration cost when only 1 CPU
If only a single CPU is present, printing this doesn't make much sense.

Signed-off-by: Dave Jones <davej@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Siddha, Suresh B
a616058b78 [PATCH] sched: remove unnecessary sched group allocations
Remove dynamic sched group allocations for MC and SMP domains.  These
allocations can easily fail on big systems(1024 or so CPUs) and we can live
with out these dynamic allocations.

[akpm@osdl.org: build fix]
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Nick Piggin
5c1e176781 [PATCH] sched: force /sbin/init off isolated cpus
Force /sbin/init off isolated cpus (unless every CPU is specified as an
isolcpu).

Users seem to think that the isolated CPUs shouldn't have much running on
them to begin with.  That's fair enough: intuitive, I guess.  It also means
that the cpu affinity masks of tasks will not include isolcpus by default,
which is also more intuitive, perhaps.

/sbin/init is spawned from the boot CPU's idle thread, and /sbin/init
starts the rest of userspace. So if the boot CPU is specified to be an
isolcpu, then prior to this patch, all of userspace will be run there.

(throw in a couple of plausible devinit -> cpuinit conversions I spotted
while we're here).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:04:06 -07:00
Randy Dunlap
e1ca66d1b9 [PATCH] kernel-doc for kernel/resource.c
Add kernel-doc function headers in kernel/resource.c and use them in DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:03:41 -07:00
Randy Dunlap
eed34d0fc5 [PATCH] kernel-doc for kernel/dma.c
Add kernel-doc function headers in kernel/dma.c and use it in DocBook.

Clean up kernel-doc in mca_dma.h (the colon (':') represents a
section header).

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:03:41 -07:00
Franck Bui-Huu
ffc5089196 [PATCH] Create kallsyms_lookup_size_offset()
Some uses of kallsyms_lookup() do not need to find out the name of a symbol
and its module's name it belongs.  This is specially true in arch specific
code, which needs to unwind the stack to show the back trace during oops
(mips is an example).  In this specific case, we just need to retreive the
function's size and the offset of the active intruction inside it.

Adds a new entry "kallsyms_lookup_size_offset()" This new entry does
exactly the same as kallsyms_lookup() but does not require any buffers to
store any names.

It returns 0 if it fails otherwise 1.

Signed-off-by: Franck Bui-Huu <vagabon.xyz@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03 08:03:41 -07:00
David Howells
3f2e05e90e [PATCH] BLOCK: Revert patch to hack around undeclared sigset_t in linux/compat.h
Revert Andrew Morton's patch to temporarily hack around the lack of a
declaration of sigset_t in linux/compat.h to make the block-disablement
patches build on IA64.  This got accidentally pushed to Linus and should
be fixed in a different manner.

Also make linux/compat.h #include asm/signal.h to gain a definition of
sigset_t so that it can externally declare sigset_from_compat().

This has been compile-tested for i386, x86_64, ia64, mips, mips64, frv, ppc and
ppc64 and run-tested on frv.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 08:03:31 -07:00
Cedric Le Goater
9ec52099e4 [PATCH] replace cad_pid by a struct pid
There are a few places in the kernel where the init task is signaled.  The
ctrl+alt+del sequence is one them.  It kills a task, usually init, using a
cached pid (cad_pid).

This patch replaces the pid_t by a struct pid to avoid pid wrap around
problem.  The struct pid is initialized at boot time in init() and can be
modified through systctl with

	/proc/sys/kernel/cad_pid

[ I haven't found any distro using it ? ]

It also introduces a small helper routine kill_cad_pid() which is used
where it seemed ok to use cad_pid instead of pid 1.

[akpm@osdl.org: cleanups, build fix]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:25 -07:00
Oleg Nesterov
1a657f78dc [PATCH] introduce get_task_pid() to fix unsafe get_pid()
proc_pid_make_inode:

	ei->pid = get_pid(task_pid(task));

I think this is not safe.  get_pid() can be preempted after checking "pid
!= NULL".  Then the task exits, does detach_pid(), and RCU frees the pid.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:25 -07:00
Arnd Bergmann
6760856791 [PATCH] introduce kernel_execve
The use of execve() in the kernel is dubious, since it relies on the
__KERNEL_SYSCALLS__ mechanism that stores the result in a global errno
variable.  As a first step of getting rid of this, change all users to a
global kernel_execve function that returns a proper error code.

This function is a terrible hack, and a later patch removes it again after the
kernel syscalls are gone.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:23 -07:00
Pavel
5d124e99c2 [PATCH] nsproxy cloning error path fix
This patch fixes copy_namespaces()'s error path.

when new nsproxy (new_ns) is created pointers to namespaces (ipc, uts) are
copied from the old nsproxy.  Later in copy_utsname, copy_ipcs, etc.
according namespaces are get-ed.  On error path needed namespaces are
put-ed, so there's no need to put new nsproxy itelf as it woud cause
putting namespaces for the second time.

Found when incorporating namespaces into OpenVZ kernel.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Kirill Korotaev
fcfbd547b1 [PATCH] IPC namespace - sysctls
Sysctl tweaks for IPC namespace

Signed-off-by: Pavel Emelianiov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Kirill Korotaev
73ea41302b [PATCH] IPC namespace - utils
This patch adds basic IPC namespace functionality to
IPC utils:
- init_ipc_ns
- copy/clone/unshare/free IPC ns
- /proc preparations

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Kirill Korotaev
25b21cb2f6 [PATCH] IPC namespace core
This patch set allows to unshare IPCs and have a private set of IPC objects
(sem, shm, msg) inside namespace.  Basically, it is another building block of
containers functionality.

This patch implements core IPC namespace changes:
- ipc_namespace structure
- new config option CONFIG_IPC_NS
- adds CLONE_NEWIPC flag
- unshare support

[clg@fr.ibm.com: small fix for unshare of ipc namespace]
[akpm@osdl.org: build fix]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Serge Hallyn
c0b2fc3165 [PATCH] uts: copy nsproxy only when needed
The nsproxy was being copied in unshare() when anything was being unshared,
even if it was something not referenced from nsproxy.  This should end up
in some cases with far more memory usage than necessary.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Serge E. Hallyn
071df104f8 [PATCH] namespaces: utsname: implement CLONE_NEWUTS flag
Implement a CLONE_NEWUTS flag, and use it at clone and sys_unshare.

[clg@fr.ibm.com: IPC unshare fix]
[bunk@stusta.de: cleanup]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:22 -07:00
Serge E. Hallyn
8218c74c02 [PATCH] namespaces: utsname: sysctl
Sysctl uts patch.  This will need to be done another way, but since sysctl
itself needs to be container aware, 'the right thing' is a separate patchset.

[akpm@osdl.org: ia64 build fix]
[sam.vilain@catalyst.net.nz: cleanup]
[sam.vilain@catalyst.net.nz: add proc_do_utsns_string]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
4865ecf131 [PATCH] namespaces: utsname: implement utsname namespaces
This patch defines the uts namespace and some manipulators.
Adds the uts namespace to task_struct, and initializes a
system-wide init namespace.

It leaves a #define for system_utsname so sysctl will compile.
This define will be removed in a separate patch.

[akpm@osdl.org: build fix, cleanup]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
96b644bdec [PATCH] namespaces: utsname: use init_utsname when appropriate
In some places, particularly drivers and __init code, the init utsns is the
appropriate one to use.  This patch replaces those with a the init_utsname
helper.

Changes: Removed several uses of init_utsname().  Hope I picked all the
	right ones in net/ipv4/ipconfig.c.  These are now changed to
	utsname() (the per-process namespace utsname) in the previous
	patch (2/7)

[akpm@osdl.org: CIFS fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
e9ff3990f0 [PATCH] namespaces: utsname: switch to using uts namespaces
Replace references to system_utsname to the per-process uts namespace
where appropriate.  This includes things like uname.

Changes: Per Eric Biederman's comments, use the per-process uts namespace
	for ELF_PLATFORM, sunrpc, and parts of net/ipv4/ipconfig.c

[jdike@addtoit.com: UML fix]
[clg@fr.ibm.com: cleanup]
[akpm@osdl.org: build fix]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Cedric Le Goater
fab413a334 [PATCH] namespaces: exit_task_namespaces() invalidates nsproxy
exit_task_namespaces() has replaced the former exit_namespace().  It
invalidates task->nsproxy and associated namespaces.  This is an issue for
the (futur) pid namespace which is required to be valid in exit_notify().

This patch moves exit_task_namespaces() after exit_notify() to keep nsproxy
valid.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:21 -07:00
Serge E. Hallyn
1651e14e28 [PATCH] namespaces: incorporate fs namespace into nsproxy
This moves the mount namespace into the nsproxy.  The mount namespace count
now refers to the number of nsproxies point to it, rather than the number of
tasks.  As a result, the unshare_namespace() function in kernel/fork.c no
longer checks whether it is being shared.

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Serge E. Hallyn
0437eb594e [PATCH] nsproxy: move init_nsproxy into kernel/nsproxy.c
Move the init_nsproxy definition out of arch/ into kernel/nsproxy.c.  This
avoids all arches having to be updated.  Compiles and boots on s390.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Serge E. Hallyn
ab516013ad [PATCH] namespaces: add nsproxy
This patch adds a nsproxy structure to the task struct.  Later patches will
move the fs namespace pointer into this structure, and introduce a new utsname
namespace into the nsproxy.

The vserver and openvz functionality, then, would be implemented in large part
by virtualizing/isolating more and more resources into namespaces, each
contained in the nsproxy.

[akpm@osdl.org: build fix]
Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Adrian Bunk
b1ba4ddde0 [PATCH] make kernel/sysctl.c:_proc_do_string() static
This patch makes the needlessly global _proc_do_string() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Sam Vilain
f5dd3d6fad [PATCH] proc: sysctl: add _proc_do_string helper
The logic in proc_do_string is worth re-using without passing in a
ctl_table structure (say, we want to calculate a pointer and pass that in
instead); pass in the two fields it uses from that structure as explicit
arguments.

Signed-off-by: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:20 -07:00
Greg Banks
e16b38f713 [PATCH] cpumask: export cpu_online_map and cpu_possible_map consistently
cpumask: ensure that the cpu_online_map and cpu_possible_map bitmasks, and
hence all the macros in <linux/cpumask.h> that require them, are available to
modules for all supported combinations of architecture and CONFIG_SMP.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:17 -07:00
bibo,mao
99219a3fbc [PATCH] kretprobe spinlock deadlock patch
kprobe_flush_task() possibly calls kfree function during holding
kretprobe_lock spinlock, if kfree function is probed by kretprobe that will
incur spinlock deadlock.  This patch moves kfree function out scope of
kretprobe_lock.

Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
bibo,mao
f2aa85a0cc [PATCH] disallow kprobes on notifier_call_chain
When kprobe is re-entered, the re-entered kprobe kernel path will will call
atomic_notifier_call_chain function, if this function is kprobed that will
incur numerous kprobe recursive fault.  This patch disallows kprobes on
atomic_notifier_call_chain function.

Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
bibo,mao
62c27be0dd [PATCH] kprobe whitespace cleanup
Whitespace is used to indent, this patch cleans up these sentences by
kernel coding style.

Signed-off-by: bibo, mao <bibo.mao@intel.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
Ananth N Mavinakayanahalli
3a872d89ba [PATCH] Kprobes: Make kprobe modules more portable
In an effort to make kprobe modules more portable, here is a patch that:

o Introduces the "symbol_name" field to struct kprobe.
  The symbol->address resolution now happens in the kernel in an
  architecture agnostic manner. 64-bit powerpc users no longer have
  to specify the ".symbols"
o Introduces the "offset" field to struct kprobe to allow a user to
  specify an offset into a symbol.
o The legacy mechanism of specifying the kprobe.addr is still supported.
  However, if both the kprobe.addr and kprobe.symbol_name are specified,
  probe registration fails with an -EINVAL.
o The symbol resolution code uses kallsyms_lookup_name(). So
  CONFIG_KPROBES now depends on CONFIG_KALLSYMS
o Apparantly kprobe modules were the only legitimate out-of-tree user of
  the kallsyms_lookup_name() EXPORT. Now that the symbol resolution
  happens in-kernel, remove the EXPORT as suggested by Christoph Hellwig
o Modify tcp_probe.c that uses the kprobe interface so as to make it
  work on multiple platforms (in its earlier form, the code wouldn't
  work, say, on powerpc)

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:16 -07:00
Eric W. Biederman
2425c08b37 [PATCH] usb: fixup usb so it uses struct pid
The problem with remembering a user space process by its pid is that it is
possible that the process will exit, pid wrap around will occur.
Converting to a struct pid avoid that problem, and paves the way for
implementing a pid namespace.

Also since usb is the only user of kill_proc_info_as_uid rename
kill_proc_info_as_uid to kill_pid_info_as_uid and have the new version take
a struct pid.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Eric W. Biederman
f40f50d3bb [PATCH] Use struct pspace in next_pidmap and find_ge_pid
This updates my proc: readdir race fix (take 3) patch
to account for the changes made by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
to introduce struct pspace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Sukadev Bhattiprolu
3fbc964864 [PATCH] Define struct pspace
Define a per-container pid space object.  And create one instance of this
object, init_pspace, to define the entire pid space.  Subsequent patches
will provide/use interfaces to create/destroy pid spaces.

Its a subset/rework of Eric Biederman's patch
http://lkml.org/lkml/2006/2/6/285 .

Signed-off-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Andrey Savochkin <saw@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Sukadev Bhattiprolu
aa5a6662f9 [PATCH] Move pidmap to pspace.h
Move struct pidmap and PIDMAP_ENTRIES to a new file, include/linux/pspace.h
where it will be used in subsequent patches to define pid spaces.

Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/285

[akpm@osdl.org: cleanups]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Eric W. Biederman
c88be3eb2e [PATCH] pids coding style use struct pidmap in next_pidmap
Use struct pidmap instead of pidmap_t.

This updates my proc: readdir race fix (take 3) patch
to account for the changes made by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
to kill pidmap_t.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:15 -07:00
Sukadev Bhattiprolu
6a1f3b8455 [PATCH] pids: coding style: use struct pidmap
Use struct pidmap instead of pidmap_t.

Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/271.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:14 -07:00
Eric W. Biederman
609d7fa956 [PATCH] file: modify struct fown_struct to use a struct pid
File handles can be requested to send sigio and sigurg to processes.  By
tracking the destination processes using struct pid instead of pid_t we make
the interface safe from all potential pid wrap around problems.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:14 -07:00
Eric W. Biederman
bbf73147e2 [PATCH] pid: export the symbols needed to use struct pid *
pids aren't something that drivers should care about.  However there are a lot
of helper layers in the kernel that do care, and are built as modules.  Before
I can convert them to using struct pid instead of pid_t I need to export the
appropriate symbols so they can continue to be built.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman
c4b92fc112 [PATCH] pid: implement signal functions that take a struct pid *
Currently the signal functions all either take a task or a pid_t argument.
This patch implements variants that take a struct pid *.  After all of the
users have been update it is my intention to remove the variants that take a
pid_t as using pid_t can be more work (an extra hash table lookup) and
difficult to get right in the presence of multiple pid namespaces.

There are two kinds of functions introduced in this patch.  The are the
general use functions kill_pgrp and kill_pid which take a priv argument that
is ultimately used to create the appropriate siginfo information, Then there
are _kill_pgrp_info, kill_pgrp_info, kill_pid_info the internal implementation
helpers that take an explicit siginfo.

The distinction is made because filling out an explcit siginfo is tricky, and
will be even more tricky when pid namespaces are introduced.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:13 -07:00
Eric W. Biederman
0804ef4b0d [PATCH] proc: readdir race fix (take 3)
The problem: An opendir, readdir, closedir sequence can fail to report
process ids that are continually in use throughout the sequence of system
calls.  For this race to trigger the process that proc_pid_readdir stops at
must exit before readdir is called again.

This can cause ps to fail to report processes, and it is in violation of
posix guarantees and normal application expectations with respect to
readdir.

Currently there is no way to work around this problem in user space short
of providing a gargantuan buffer to user space so the directory read all
happens in on system call.

This patch implements the normal directory semantics for proc, that
guarantee that a directory entry that is neither created nor destroyed
while reading the directory entry will be returned.  For directory that are
either created or destroyed during the readdir you may or may not see them.
 Furthermore you may seek to a directory offset you have previously seen.

These are the guarantee that ext[23] provides and that posix requires, and
more importantly that user space expects.  Plus it is a simple semantic to
implement reliable service.  It is just a matter of calling readdir a
second time if you are wondering if something new has show up.

These better semantics are implemented by scanning through the pids in
numerical order and by making the file offset a pid plus a fixed offset.

The pid scan happens on the pid bitmap, which when you look at it is
remarkably efficient for a brute force algorithm.  Given that a typical
cache line is 64 bytes and thus covers space for 64*8 == 200 pids.  There
are only 40 cache lines for the entire 32K pid space.  A typical system
will have 100 pids or more so this is actually fewer cache lines we have to
look at to scan a linked list, and the worst case of having to scan the
entire pid bitmap is pretty reasonable.

If we need something more efficient we can go to a more efficient data
structure for indexing the pids, but for now what we have should be
sufficient.

In addition this takes no additional locks and is actually less code than
what we are doing now.

Also another very subtle bug in this area has been fixed.  It is possible
to catch a task in the middle of de_thread where a thread is assuming the
thread of it's thread group leader.  This patch carefully handles that case
so if we hit it we don't fail to return the pid, that is undergoing the
de_thread dance.

Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
providing the first fix, pointing this out and working on it.

[oleg@tv-sign.ru: fix it]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:12 -07:00
Randy Dunlap
2bc2d61a96 [PATCH] list module taint flags in Oops/panic
When listing loaded modules during an oops or panic, also list each
module's Tainted flags if non-zero (P: Proprietary or F: Forced load only).

If a module is did not taint the kernel, it is just listed like
	usbcore
but if it did taint the kernel, it is listed like
	wizmodem(PF)

Example:
[ 3260.121718] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
[ 3260.121729]  [<ffffffff8804c099>] :dump_test:proc_dump_test+0x99/0xc8
[ 3260.121742] PGD fe8d067 PUD 264a6067 PMD 0
[ 3260.121748] Oops: 0002 [1] SMP
[ 3260.121753] CPU 1
[ 3260.121756] Modules linked in: dump_test(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ide_cd generic ohci1394 snd_hda_intel snd_hda_codec snd_pcm snd_timer snd ieee1394 snd_page_alloc piix ide_core arcmsr aic79xx scsi_transport_spi usblp
[ 3260.121785] Pid: 5556, comm: bash Tainted: P      2.6.18-git10 #1

[Alternatively, I can look into listing tainted flags with 'lsmod',
but that won't help in oopsen/panics so much.]

[akpm@osdl.org: cleanup]
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-02 07:57:12 -07:00
Andi Kleen
d025c9db7f [PATCH] Support piping into commands in /proc/sys/kernel/core_pattern
Using the infrastructure created in previous patches implement support to
pipe core dumps into programs.

This is done by overloading the existing core_pattern sysctl
with a new syntax:

|program

When the first character of the pattern is a '|' the kernel will instead
threat the rest of the pattern as a command to run.  The core dump will be
written to the standard input of that program instead of to a file.

This is useful for having automatic core dump analysis without filling up
disks.  The program can do some simple analysis and save only a summary of
the core dump.

The core dump proces will run with the privileges and in the name space of
the process that caused the core dump.

I also increased the core pattern size to 128 bytes so that longer command
lines fit.

Most of the changes comes from allowing core dumps without seeks.  They are
fairly straight forward though.

One small incompatibility is that if someone had a core pattern previously
that started with '|' they will get suddenly new behaviour.  I think that's
unlikely to be a real problem though.

Additional background:

> Very nice, do you happen to have a program that can accept this kind of
> input for crash dumps?  I'm guessing that the embedded people will
> really want this functionality.

I had a cheesy demo/prototype.  Basically it wrote the dump to a file again,
ran gdb on it to get a backtrace and wrote the summary to a shared directory.
Then there was a simple CGI script to generate a "top 10" crashes HTML
listing.

Unfortunately this still had the disadvantage to needing full disk space for a
dump except for deleting it afterwards (in fact it was worse because over the
pipe holes didn't work so if you have a holey address map it would require
more space).

Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
cores (at least it worked with zsh's =(cat core) syntax), so it would be
likely possible to do it without temporary space with a simple wrapper that
calls it in the right way.  I ran out of time before doing that though.

The demo prototype scripts weren't very good.  If there is really interest I
can dig them out (they are currently on a laptop disk on the desk with the
laptop itself being in service), but I would recommend to rewrite them for any
serious application of this and fix the disk space problem.

Also to be really useful it should probably find a way to automatically fetch
the debuginfos (I cheated and just installed them in advance).  If nobody else
does it I can probably do the rewrite myself again at some point.

My hope at some point was that desktops would support it in their builtin
crash reporters, but at least the KDE people I talked too seemed to be happy
with their user space only solution.

Alan sayeth:

  I don't believe that piping as such as neccessarily the right model, but
  the ability to intercept and processes core dumps from user space is asked
  for by many enterprise users as well.  They want to know about, capture,
  analyse and process core dumps, often centrally and in automated form.

[akpm@osdl.org: loff_t != unsigned long]
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Andi Kleen
e239ca5405 [PATCH] Create call_usermodehelper_pipe()
A new member in the ever growing family of call_usermode* functions is
born.  The new call_usermodehelper_pipe() function allows to pipe data to
the stdin of the called user mode progam and behaves otherwise like the
normal call_usermodehelp() (except that it always waits for the child to
finish)

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:33 -07:00
Dave Hansen
d8c76e6f45 [PATCH] r/o bind mount prepwork: inc_nlink() helper
This is mostly included for parity with dec_nlink(), where we will have some
more hooks.  This one should stay pretty darn straightforward for now.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:30 -07:00
Jay Lan
db5fed26b2 [PATCH] csa accounting taskstats update
ChangeLog:
   Feedbacks from Andrew Morton:
   - define TS_COMM_LEN to 32
   - change acct_stimexpd field of task_struct to be of
     cputime_t, which is to be used to save the tsk->stime
     of last timer interrupt update.
   - a new Documentation/accounting/taskstats-struct.txt
     to describe fields of taskstats struct.

   Feedback from Balbir Singh:
   - keep the stime of a task to be zero when both stime
     and utime are zero as recoreded in task_struct.

   Misc:
   - convert accumulated RSS/VM from platform dependent
     pages-ticks to MBytes-usecs in the kernel

Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan
8f0ab51479 [PATCH] csa: convert CONFIG tag for extended accounting routines
There were a few accounting data/macros that are used in CSA but are #ifdef'ed
inside CONFIG_BSD_PROCESS_ACCT.  This patch is to change those ifdef's from
CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT.  A few defines are moved from
kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and
include/linux/tsacct_kern.h.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan
9acc185351 [PATCH] csa: Extended system accounting over taskstats
Add extended system accounting handling over taskstats interface.  A
CONFIG_TASK_XACCT flag is created to enable the extended accounting code.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan
f3cef7a994 [PATCH] csa: basic accounting over taskstats
Add some basic accounting fields to the taskstats struct, add a new
kernel/tsacct.c to handle basic accounting data handling upon exit.  A handle
is added to taskstats.c to invoke the basic accounting data handling.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Balbir Singh
0ae646845b [PATCH] Fix taskstats size calculation (use the new genetlink utility functions)
The addition of the CSA patch pushed the size of struct taskstats to 256
bytes.  This exposed a problem with prepare_reply(), we were not allocating
space for the netlink and genetlink header.  It worked earlier because
alloc_skb() would align the skb to SMP_CACHE_BYTES, which added some additonal
bytes.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jamal Hadi <hadi@cyberus.ca>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Atsushi Nemoto
8ef386092d [PATCH] kill wall_jiffies
With 2.6.18-rc4-mm2, now wall_jiffies will always be the same as jiffies.
So we can kill wall_jiffies completely.

This is just a cleanup and logically should not change any real behavior
except for one thing: RTC updating code in (old) ppc and xtensa use a
condition "jiffies - wall_jiffies == 1".  This condition is never met so I
suppose it is just a bug.  I just remove that condition only instead of
kill the whole "if" block.

[heiko.carstens@de.ibm.com: s390 build fix and cleanup]
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Adrian Bunk
70bc42f90a [PATCH] kernel/time/ntp.c: possible cleanups
This patch contains the following possible cleanups:
- make the following needlessly global function static:
  - ntp_update_frequency()
- make the following needlessly global variables static:
  - time_state
  - time_offset
  - time_constant
  - time_reftime
- remove the following read-only global variable:
  - time_precision

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel
f199239373 [PATCH] ntp: convert to the NTP4 reference model
This converts the kernel ntp model into a model which matches the nanokernel
reference implementations.  The previous patches already increased the
resolution and precision of the computations, so that this conversion becomes
quite simple.

<linux@horizon.com> explains:

The original NTP kernel interface was defined in units of microseconds.
That's what Linux implements.  As computers have gotten faster and can now
split microseconds easily, a new kernel interface using nanosecond units was
defined ("the nanokernel", confusing as that name is to OS hackers), and
there's an STA_NANO bit in the adjtimex() status field to tell the application
which units it's using.

The current ntpd supports both, but Linux loses some possible timing
resolution because of quantization effects, and the ntpd hackers would really
like to be able to drop the backwards compatibility code.

Ulrich Windl has been maintaining a patch set to do the conversion for years,
but it's hard to keep in sync.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel
04b617e71e [PATCH] ntp: convert time_freq to nsec value
This converts time_freq to a scaled nsec value and adds around 6bit of extra
resolution.  This pushes the time_freq to its 32bit limits so the calculatons
have to be done with 64bit.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel
97eebe138c [PATCH] ntp: remove time_tolerance
time_tolerance isn't changed at all in the kernel, so simply remove it, this
simplifies the next patch, as it avoids a number of conversions.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
8f807f8d21 [PATCH] ntp: add time_adjust to tick length
This folds update_ntp_one_tick() into second_overflow() and adds time_adjust
to the tick length, this makes time_next_adjust unnecessary.  This slightly
changes the adjtime() behaviour, instead of applying it to the next tick, it's
applied to the next second.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
3d3675cc3d [PATCH] ntp: prescale time_offset
This converts time_offset into a scaled per tick value.  This avoids now
completely the crude compensation in second_overflow().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
dc6a43e46f [PATCH] ntp: add time_freq to tick length
This adds the frequency part to ntp_update_frequency().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
ab8783b688 [PATCH] ntp: add time_adj to tick length
This makes time_adj local to second_overflow() and integrates it into the tick
length instead of adding it everytime.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel
b0ee75561b [PATCH] ntp: add ntp_update_frequency
This introduces ntp_update_frequency() and deinlines ntp_clear() (as it's not
performance critical).  ntp_update_frequency() calculates the base tick length
using tick_usec and adds a base adjustment, in case the frequency doesn't
divide evenly by HZ.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
john stultz
4c7ee8de95 [PATCH] NTP: Move all the NTP related code to ntp.c
Move all the NTP related code to ntp.c

[akpm@osdl.org: cleanups, build fix]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Martin Schwidefsky
ef6edc9746 [PATCH] Directed yield: cpu_relax variants for spinlocks and rw-locks
On systems running with virtual cpus there is optimization potential in
regard to spinlocks and rw-locks.  If the virtual cpu that has taken a lock
is known to a cpu that wants to acquire the same lock it is beneficial to
yield the timeslice of the virtual cpu in favour of the cpu that has the
lock (directed yield).

With CONFIG_PREEMPT="n" this can be implemented by the architecture without
common code changes.  Powerpc already does this.

With CONFIG_PREEMPT="y" the lock loops are coded with _raw_spin_trylock,
_raw_read_trylock and _raw_write_trylock in kernel/spinlock.c.  If the lock
could not be taken cpu_relax is called.  A directed yield is not possible
because cpu_relax doesn't know anything about the lock.  To be able to
yield the lock in favour of the current lock holder variants of cpu_relax
for spinlocks and rw-locks are needed.  The new _raw_spin_relax,
_raw_read_relax and _raw_write_relax primitives differ from cpu_relax
insofar that they have an argument: a pointer to the lock structure.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:21 -07:00
Cal Peake
756184b7d7 [PATCH] CodingStyle cleanup for kernel/sys.c
Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the
file.

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:20 -07:00
Arjan van de Ven
5c87579e65 [PATCH] maximum latency tracking infrastructure
Add infrastructure to track "maximum allowable latency" for power saving
policies.

The reason for adding this infrastructure is that power management in the
idle loop needs to make a tradeoff between latency and power savings
(deeper power save modes have a longer latency to running code again).  The
code that today makes this tradeoff just does a rather simple algorithm;
however this is not good enough: There are devices and use cases where a
lower latency is required than that the higher power saving states provide.
 An example would be audio playback, but another example is the ipw2100
wireless driver that right now has a very direct and ugly acpi hook to
disable some higher power states randomly when it gets certain types of
error.

The proposed solution is to have an interface where drivers can

* announce the maximum latency (in microseconds) that they can deal with
* modify this latency
* give up their constraint

and a function where the code that decides on power saving strategy can
query the current global desired maximum.

This patch has a user of each side: on the consumer side, ACPI is patched
to use this, on the producer side the ipw2100 driver is patched.

A generic maximum latency is also registered of 2 timer ticks (more and you
lose accurate time tracking after all).

While the existing users of the patch are x86 specific, the infrastructure
is not.  I'd like to ask the arch maintainers of other architectures if the
infrastructure is generic enough for their use (assuming the architecture
has such a tradeoff as concept at all), and the sound/multimedia driver
owners to look at the driver facing API to see if this is something they
can use.

[akpm@osdl.org: cleanups]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jesse Barnes <jesse.barnes@intel.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:19 -07:00
David Howells
9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
David Howells
07f3f05c1e [PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6]
Create a new header file, fs/internal.h, for common definitions local to the
sources in the fs/ directory.

Move extern definitions that should be in header files from fs/*.c to
fs/internal.h or other main header files where they span directories.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:18 +02:00
David Howells
0d67a46df0 [PATCH] BLOCK: Remove duplicate declaration of exit_io_context() [try #6]
Remove the duplicate declaration of exit_io_context() from linux/sched.h.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:20 +02:00
Andi Kleen
34596dc9e5 [PATCH] Define vsyscall cache as blob to make clearer that user space shouldn't use it
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Andi Kleen
29cbc78b90 [PATCH] x86: Clean up x86 NMI sysctls
Use prototypes in headers
Don't define panic_on_unrecovered_nmi for all architectures

Cc: dzickus@redhat.com

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Paul Jackson
181b648036 [PATCH] cpuset: fix obscure attach_task vs exiting race
Fix obscure race condition in kernel/cpuset.c attach_task() code.

There is basically zero chance of anyone accidentally being harmed by this
race.

It requires a special 'micro-stress' load and a special timing loop hacks
in the kernel to hit in less than an hour, and even then you'd have to hit
it hundreds or thousands of times, followed by some unusual and senseless
cpuset configuration requests, including removing the top cpuset, to cause
any visibly harm affects.

One could, with perhaps a few days or weeks of such effort, get the
reference count on the top cpuset below zero, and manage to crash the
kernel by asking to remove the top cpuset.

I found it by code inspection.

The race was introduced when 'the_top_cpuset_hack' was introduced, and one
piece of code was not updated.  An old check for a possibly null task
cpuset pointer needed to be changed to a check for a task marked
PF_EXITING.  The pointer can't be null anymore, thanks to
the_top_cpuset_hack (documented in kernel/cpuset.c).  But the task could
have gone into PF_EXITING state after it was found in the task_list scan.

If a task is PF_EXITING in this code, it is possible that its task->cpuset
pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
than because the top_cpuset was that tasks last valid cpuset.  In that
case, the wrong cpuset reference counter would be decremented.

The fix is trivial.  Instead of failing the system call if the tasks cpuset
pointer is null here, fail it if the task is in PF_EXITING state.

The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
the top_cpuset is done without locking, so could happen at anytime.  But it
is done during the exit handling, after the PF_EXITING flag is set.  So if
we verify that a task is still not PF_EXITING after we copy out its cpuset
pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
hack references to the top_cpuset.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Ingo Molnar
03cbc358aa [PATCH] lockdep core: improve the lock-chain-hash
With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in
the locking self-test:

  ------------>
  | Locking API testsuite:
  ----------------------------------------------------------------------------
                                   | spin |wlock |rlock |mutex | wsem | rsem |
    --------------------------------------------------------------------------
                       A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                   A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
               A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
               A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
           A-B-B-C-C-D-D-A deadlock:  ok  |FAILED|  ok  |  ok  |  ok  |  ok  |
           A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
           A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |FAILED|

after much debugging it turned out to be caused by accidental chain-hash
key collisions.  The current hash is:

 #define iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
 	(key2))

where MAX_LOCKDEP_KEYS_BITS is 11.  This hash is pretty good as it will
shift by 5 bits in every iteration, where every new ID 'mixed' into the
hash would have up to 11 bits.  But because there was a 6 bits overlap
between subsequent IDs and their high bits tended to be similar, there was
a chance for accidental chain-hash collision for a low number of locks
held.

the solution is to shift by 11 bits:

 #define iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \
	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
 	(key2))

This keeps the hash perfect up to 5 locks held, but even above that the
hash is still good because 11 bits is a relative prime to the total 64
bits, so a complete match will only occur after 64 held locks (which doesnt
happen in Linux).  Even after 5 locks held, entropy of the 5 IDs mixed into
the hash is already good enough so that overlap doesnt generate a colliding
hash ID.

with this change the false positives went away.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Alan Cox
eb84a20e9e [PATCH] audit/accounting: tty locking
Add tty locking around the audit and accounting code.

The whole current->signal-> locking is all deeply strange but it's for
someone else to sort out.  Add rather than replace the lock for acct.c

Signed-off-by: Alan Cox <alan@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Rusty Russell
e5582ca21a [PATCH] stop_machine.c copyright
I had to look back: this code was extracted from the module.c code in 2005.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:24 -07:00
Ian S. Nelson
04b1db9fd7 [PATCH] /sys/modules: allow full length section names
I've been using systemtap for some debugging and I noticed that it can't
probe a lot of modules.  Turns out it's kind of silly, the sections section
of /sys/module is limited to 32byte filenames and many of the actual
sections are a a bit longer than that.

[akpm@osdl.org: rewrite to use dymanic allocation]
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:23 -07:00
Paul Jackson
b1aac8bb82 [PATCH] cpuset: hotunplug cpus and mems in all cpusets
The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
it could remove a CPU or Node from the top cpuset, while leaving it still
in some child cpusets.

One basic rule of cpusets is that each cpusets cpus and mems are subsets of
its parents.  The cpuset hot unplug code violated this rule.

So the cpuset hotunplug handler must walk down the tree, removing any
removed CPU or Node from all cpusets.

However, it is not allowed to make a cpusets cpus or mems become empty.
They can only transition from empty to non-empty, not back.

So if the last CPU or Node would be removed from a cpuset by the above
walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
that still has something online, and copy its CPU or Memory placement.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Paul Jackson
38837fc75a [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code.  Make this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset.  This is a surprising regression
over earlier kernels that didn't have cpusets enabled.

One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.

This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets.  The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed.  With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes.  Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map.  Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.

[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov
c394cc9fbb [PATCH] introduce TASK_DEAD state
I am not sure about this patch, I am asking Ingo to take a decision.

task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
live only in ->exit_state as documented in sched.h.

Note that this state is not visible to user-space, get_task_state() masks off
unsuitable states.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov
55a101f8f7 [PATCH] kill PF_DEAD flag
After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we
don't need PF_DEAD any longer.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Oleg Nesterov
29b8849216 [PATCH] set EXIT_DEAD state in do_exit(), not in schedule()
schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
to ensure that the exiting task will be deactivated.  Note that this EXIT_DEAD
is in fact a "random" value, we can use any bit except normal TASK_XXX values.

It is better to set this state in do_exit() along with PF_DEAD flag and remove
that check in schedule().

We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
can not change task's ->state: the 'state' argument of try_to_wake_up() can't
have EXIT_DEAD bit.  And in case when try_to_wake_up() sees a stale value of
->state == TASK_RUNNING it will do nothing.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Oleg Nesterov
aaa2a97eb9 [PATCH] sys_get_robust_list(): don't take tasklist_lock
use rcu locks for find_task_by_pid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
d359b549bf [PATCH] futex_find_get_task(): don't take tasklist_lock
It is ok to do find_task_by_pid() + get_task_struct() under
rcu_read_lock(), we cand drop tasklist_lock.

Note that testing of ->exit_state is racy with or without tasklist anyway.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
5b160f5ecd [PATCH] copy_process: cosmetic ->ioprio tweak
copy_process:
// holds tasklist_lock + ->siglock
       /*
        * inherit ioprio
        */
       p->ioprio = current->ioprio;

Why?  ->ioprio was already copied in dup_task_struct().  I guess this is
needed to ensure that the child can't escape
sys_ioprio_set(IOPRIO_WHO_{PGRP,USER}), yes?

In that case we don't need ->siglock held, and the comment should be
updated.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
1c573afebc [PATCH] reparent_to_init(): use has_rt_policy()
Remove open-coded has_rt_policy(), no changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov
8dc3e9099e [PATCH] sched_setscheduler: fix? policy checks
I am not sure this patch is correct: I can't understand what the current
code does, and I don't know what it was supposed to do.

The comment says:

		 * can't change policy, except between SCHED_NORMAL
		 * and SCHED_BATCH:

The code:

		if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) &&
			(policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) &&

But this is equivalent to:

		if ( (is_rt_policy(policy) && has_rt_policy(p)) &&

which means something different.  We can't _decrease_ the current
->rt_priority with such a check (if rlim[RLIMIT_RTPRIO] == 0).

Probably, it was supposed to be:

		if (	!(policy == SCHED_NORMAL && p->policy == SCHED_BATCH)  &&
			!(policy == SCHED_BATCH  && p->policy == SCHED_NORMAL)

this matches the comment, but strange: it doesn't allow to _drop_ the
realtime priority when rlim[RLIMIT_RTPRIO] == 0.

I think the right check would be:

		/* can't set/change rt policy */
		if (is_rt_policy(policy) &&
				policy != p->policy &&
				!rlim_rtprio)
			return -EPERM;

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Oleg Nesterov
57a6f51c42 [PATCH] introduce is_rt_policy() helper
Imho, makes the code a bit easier to read.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Oleg Nesterov
5fe1d75f34 [PATCH] do_sched_setscheduler(): don't take tasklist_lock
Use rcu locks instead. sched_setscheduler() now takes ->siglock
before reading ->signal->rlim[].

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Cal Peake
c9472e0f28 [PATCH] kill extraneous printk in kernel_restart()
Get rid of an extraneous printk in kernel_restart().

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Björn Steinbrink
111dbe0c8a [PATCH] Fix ____call_usermodehelper errors being silently ignored
If ____call_usermodehelper fails, we're not interested in the child
process' exit value, but the real error, so let's stop wait_for_helper from
overwriting it in that case.

Issue discovered by Benedikt Böhm while working on a Linux-VServer usermode
helper.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Alan Cox
54306cf04c [PATCH] exit: fix crash case
If we are going to BUG() not panic() here then we should cover the case of
the BUG being compiled out

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Roland McGrath
0b4a8a789a [PATCH] kexec warning fix
This fixes a couple of compiler warnings, and adds paranoia checks as well.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Atsushi Nemoto
3171a0305d [PATCH] simplify update_times (avoid jiffies/jiffies_64 aliasing problem)
Pass ticks to do_timer() and update_times(), and adjust x86_64 and s390
timer interrupt handler with this change.

Currently update_times() calculates ticks by "jiffies - wall_jiffies", but
callers of do_timer() should know how many ticks to update.  Passing ticks
get rid of this redundant calculation.  Also there are another redundancy
pointed out by Martin Schwidefsky.

This cleanup make a barrier added by
5aee405c66 needless.  So this patch removes
it.

As a bonus, this cleanup make wall_jiffies can be removed easily, since now
wall_jiffies is always synced with jiffies.  (This patch does not really
remove wall_jiffies.  It would be another cleanup patch)

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Roland McGrath
27d91e07f9 [PATCH] __dequeue_signal() cleanup
This tightens up __dequeue_signal a little.  It also avoids doing
recalc_sigpending twice in a row, instead doing it once in dequeue_signal.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Roland McGrath
b9ecb2bd5d [PATCH] has_stopped_jobs() cleanup
This check has been obsolete since the introduction of TASK_TRACED.  Now
TASK_STOPPED always means job control stop.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Toyo Abe
e4b765551a [PATCH] posix-timers: Fix the flags handling in posix_cpu_nsleep()
When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it
incorrectly reports the sleep time remaining to the user.  Because
posix_cpu_nsleep() doesn't report back to the user when it's called from
restart function due to the wrong flags handling.

This patch, which applies after previous one, moves the nanosleep() function
from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling
appropriately.

Signed-off-by: Toyo Abe <toyoa@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Toyo Abe
1711ef3866 [PATCH] posix-timers: Fix clock_nanosleep() doesn't return the remaining time in compatibility mode
The clock_nanosleep() function does not return the time remaining when the
sleep is interrupted by a signal.

This patch creates a new call out, compat_clock_nanosleep_restart(), which
handles returning the remaining time after a sleep is interrupted.  This
patch revives clock_nanosleep_restart().  It is now accessed via the new
call out.  The compat_clock_nanosleep_restart() is used for compatibility
access.

Since this is implemented in compatibility mode the normal path is
virtually unaffected - no real performance impact.

Signed-off-by: Toyo Abe <toyoa@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Akinobu Mita
07dccf3344 [PATCH] check return value of cpu_callback
Spawing ksoftirqd, migration, or watchdog, and calling init_timers_cpu()
may fail with small memory.  If it happens in initcalls, kernel NULL
pointer dereference happens later.  This patch makes crash happen
immediately in such cases.  It seems a bit better than getting kernel NULL
pointer dereference later.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:14 -07:00
Paul E. McKenney
a45bce4954 [PATCH] memory ordering in __kfifo primitives
Both __kfifo_put() and __kfifo_get() have header comments stating that if
there is but one concurrent reader and one concurrent writer, locking is not
necessary.  This is almost the case, but a couple of memory barriers are
needed.  Another option would be to change the header comments to remove the
bit about locking not being needed, and to change the those callers who
currently don't use locking to add the required locking.  The attachment
analyzes this approach, but the patch below seems simpler.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Stelian Pop <stelian@popies.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:13 -07:00
Dave Jones
99de055ac0 [PATCH] lockdep: print kernel version
Lets do the same thing we do for oopses - print out the version in the
report.  It's an extra line of output though.  We could tack it on the end
of the INFO: lines, but that screws up Ingo's pretty output.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:13 -07:00
Sukadev Bhattiprolu
f400e198b2 [PATCH] pidspace: is_init()
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280).  It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().

Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.

Eric's original description:

	There are a lot of places in the kernel where we test for init
	because we give it special properties.  Most  significantly init
	must not die.  This results in code all over the kernel test
	->pid == 1.

	Introduce is_init to capture this case.

	With multiple pid spaces for all of the cases affected we are
	looking for only the first process on the system, not some other
	process that has pid == 1.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Kirill Korotaev
3b9b8ab65d [PATCH] Fix unserialized task->files changing
Fixed race on put_files_struct on exec with proc.  Restoring files on
current on error path may lead to proc having a pointer to already kfree-d
files_struct.

->files changing at exit.c and khtread.c are safe as exit_files() makes all
things under lock.

Found during OpenVZ stress testing.

[akpm@osdl.org: add export]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Chuck Ebbert
e6cab99bb4 [PATCH] unwind: fix unused variable warning when !CONFIG_MODULES
Fix "variable defined but not used" compiler warning in unwind.c when
CONFIG_MODULES is not set.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:11 -07:00
Rolf Eike Beer
2aae4a108d [PATCH] Fix kerneldoc comments in kernel/timer.c
Some of the kerneldoc comments in this file are ignored since the lead-in
is malformed, using either "/*" or "/***" instead of "/**".

[rdunlap@xenotime.net: kerneldoc fixes]
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:10 -07:00
Steven Rostedt
db630637b2 [PATCH] clean up and remove some extra spinlocks from rtmutex
Oleg brought up some interesting points about grabbing the pi_lock for some
protections.  In this discussion, I realized that there are some places
that the pi_lock is being grabbed when it really wasn't necessary.  Also
this patch does a little bit of clean up.

This patch basically does three things:

1) renames the "boost" variable to "chain_walk".  Since it is used in
   the debugging case when it isn't going to be boosted.  It better
   describes what the test is going to do if it succeeds.

2) moves get_task_struct to just before the unlocking of the wait_lock.
   This removes duplicate code, and makes it a little easier to read.  The
   owner wont go away while either the pi_lock or the wait_lock are held.

3) removes the pi_locking and owner blocked checking completely from the
   debugging case.  This is because the grabbing the lock and doing the
   check, then releasing the lock is just so full of races.  It's just as
   good to go ahead and call the pi_chain_walk function, since after
   releasing the lock the owner can then block anyway, and we would have
   missed that.  For the debug case, we really do want to do the chain walk
   to test for deadlocks anyway.

[oleg@tv-sign.ru: more of the same]
Signed-of-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Esben Nielsen <nielsen.esben@googlemail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Alexey Dobriyan
6c5c934153 [PATCH] ifdef blktrace debugging fields
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Josh Triplett
89e7e374dd [PATCH] timer: add lock annotation to lock_timer_base
lock_timer_base acquires a lock and returns with that lock held.  Add a
lock annotation to this function so that sparse can check callers for lock
pairing, and so that sparse will not complain about this function since it
intentionally uses the lock in this manner.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Mark Huang
d10be6d1bd [PATCH] module_subsys: initialize earlier
Initialize module_subsys earlier (or at least earlier than devices) since
it could be used very early in the boot process if kmod loads a module
before the device initcalls.  Otherwise, kmod will crash in
kernel/module.c:mod_sysfs_setup() since the kset in module_subsys is not
initialized yet.

I only noticed this problem because occasionally, kmod loads the modules
for my SCSI and Ethernet adapters very early, during the boot process
itself.  I don't quite understand why it loads them sometimes and doesn't
load them other times.  Or who is telling kmod to do so.  Can someone
explain?

Signed-off-by: Mark Huang <mlhuang@cs.princeton.edu>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:08 -07:00
Josh Triplett
a49a4af759 [PATCH] rcu: add lock annotations to rcu{,_bh}_torture_read_{lock,unlock}
rcu_torture_read_lock and rcu_bh_torture_read_lock acquire locks without
releasing them, and the matching functions rcu_torture_read_unlock and
rcu_bh_torture_read_unlock get called with the corresponding locks held and
release them.  Add lock annotations to these four functions so that sparse
can check callers for lock pairing, and so that sparse will not complain
about these functions since they intentionally use locks in this manner.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:08 -07:00
Yoichi Yuasa
538d9d532b [PATCH] irq: remove a extra line
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:07 -07:00
Yoichi Yuasa
2ff6fd8f4a [PATCH] irq: fixed coding style
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:07 -07:00
Randy Dunlap
4c78a66393 [PATCH] kernel-doc for relay interface
Add relay interface support to DocBook/kernel-api.tmpl.  Fix typos etc.  in
relay.c and relayfs.txt.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:06 -07:00
Randy Dunlap
d8c7649e99 [PATCH] kernel/params: driver layer error checking
Check driver layer return values in kernel/params.c

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:04 -07:00
Matthew Wilcox
910067d188 [PATCH] remove generic__raw_read_trylock()
If the cpu has the lock held for write, is interrupted, and the interrupt
handler calls read_trylock(), it's an instant deadlock.

Now, Dave Miller has subsequently pointed out that we don't have any
situations where this can occur.  Nevertheless, we should delete
generic__raw_read_lock (and its associated EXPORT to make Arjan happy) so that
nobody thinks they can use it.

Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:03 -07:00
Linus Torvalds
ebdea46fec Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (130 commits)
  [ARM] 3856/1: Add clocksource for Intel IXP4xx platforms
  [ARM] 3855/1: Add generic time support
  [ARM] 3873/1: S3C24XX: Add irq_chip names
  [ARM] 3872/1: S3C24XX: Apply consistant tabbing to irq_chips
  [ARM] 3871/1: S3C24XX: Fix ordering of EINT4..23
  [ARM] nommu: confirms the CR_V bit in nommu mode
  [ARM] nommu: abort handler fixup for !CPU_CP15_MMU cores.
  [ARM] 3870/1: AT91: Start removing static memory mappings
  [ARM] 3869/1: AT91: NAND support for DK and KB9202 boards
  [ARM] 3868/1: AT91 hardware header update
  [ARM] 3867/1: AT91 GPIO update
  [ARM] 3866/1: AT91 clock update
  [ARM] 3865/1: AT91RM9200 header updates
  [ARM] 3862/2: S3C2410 - add basic power management support for AML M5900 series
  [ARM] kthread: switch arch/arm/kernel/apm.c
  [ARM] Off-by-one in arch/arm/common/icst*
  [ARM] 3864/1: Refactore sharpsl_pm
  [ARM] 3863/1: Add Locomo SPI Device
  [ARM] 3847/2:  Convert LOMOMO to use struct device for GPIOs
  [ARM] Use CPU_CACHE_* where possible in asm/cacheflush.h
  ...
2006-09-28 14:40:39 -07:00
Russell King
2dc94310bd Merge master.kernel.org:/pub/scm/linux/kernel/git/tmlind/linux-omap-upstream into devel 2006-09-27 19:57:54 +01:00
Eric W. Biederman
65800ac77e [PATCH] pid: remove temporary debug code in attach_pid
With the patches flying between Oleg and myself somehow this temporary
debug code got left in pid.c.  It was never intended to make it to the
stable kernel.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Eric W. Biederman
c18258c6f0 [PATCH] pid: Implement transfer_pid and use it to simplify de_thread
In de_thread we move pids from one process to another, a rather ugly case.
The function transfer_pid makes it clear what we are doing, and makes the
action atomic.  This is useful we ever want to atomically traverse the
process group and session lists, in a rcu safe manner.

Even if the atomic properties this change should be a win as transfer_pid
should be less code to execute than executing both attach_pid and
detach_pid, and this should make de_thread slightly smaller as only a
single function call needs to be emitted.  The only downside is that the
code might be slower to execute as the odds are against transfer_pid being
in cache.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Eric W. Biederman
b89a81712f [PATCH] sysctl: Allow /proc/sys without sys_sysctl
Since sys_sysctl is deprecated start allow it to be compiled out.  This
should catch any remaining user space code that cares, and paves the way
for further sysctl cleanups.

[akpm@osdl.org: If sys_sysctl() is not compiled-in, emit a warning]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Theodore Ts'o
ba52de123d [PATCH] inode-diet: Eliminate i_blksize from the inode structure
This eliminates the i_blksize field from struct inode.  Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:18 -07:00
Theodore Ts'o
8e18e2941c [PATCH] inode_diet: Replace inode.u.generic_ip with inode.i_private
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86.  (It would be more on an x86_64 system).  This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).

This patch:

The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer.  Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer.  This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.

[judith@osdl.org: powerpc build fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Judith Lebzelter <judith@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:17 -07:00
David Howells
f269fdd182 [PATCH] NOMMU: move the fallback arch_vma_name() to a sensible place
Move the fallback arch_vma_name() to a sensible place (kernel/signal.c).

Currently it's in fs/proc/task_mmu.c, a file that is dependent on both
CONFIG_PROC_FS and CONFIG_MMU being enabled, but it's used from
kernel/signal.c from where it is called unconditionally.

[akpm@osdl.org: build fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:15 -07:00
David Howells
0ec76a110f [PATCH] NOMMU: Check that access_process_vm() has a valid target
Check that access_process_vm() is accessing a valid mapping in the target
process.

This limits ptrace() accesses and accesses through /proc/<pid>/maps to only
those regions actually mapped by a program.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
Matthew Wilcox
d33b6fba2c Resources: insert identical resources above existing resources
If you have two resources which aree exactly the same size,
insert_resource() currently inserts the new one below the existing one. 
This is wrong because there's no way to insert a resource of the same size
above an existing one.

I took this opportunity to rewrite the initial loop to be a for-loop
instead of a goto-loop and fix the documentation.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-26 17:43:52 -07:00
Linus Torvalds
b278240839 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
  [PATCH] Don't set calgary iommu as default y
  [PATCH] i386/x86-64: New Intel feature flags
  [PATCH] x86: Add a cumulative thermal throttle event counter.
  [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
  [PATCH] x86: Refactor thermal throttle processing
  [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
  [PATCH] Fix unwinder warning in traps.c
  [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
  [PATCH] x86: Move direct PCI scanning functions out of line
  [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
  [PATCH] Don't leak NT bit into next task
  [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
  [PATCH] Fix some broken white space in ia32_signal.c
  [PATCH] Initialize argument registers for 32bit signal handlers.
  [PATCH] Remove all traces of signal number conversion
  [PATCH] Don't synchronize time reading on single core AMD systems
  [PATCH] Remove outdated comment in x86-64 mmconfig code
  [PATCH] Use string instructions for Core2 copy/clear
  [PATCH] x86: - restore i8259A eoi status on resume
  [PATCH] i386: Split multi-line printk in oops output.
  ...
2006-09-26 13:07:55 -07:00
Linus Torvalds
dd77a4ee0f Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (47 commits)
  Driver core: Don't call put methods while holding a spinlock
  Driver core: Remove unneeded routines from driver core
  Driver core: Fix potential deadlock in driver core
  PCI: enable driver multi-threaded probe
  Driver Core: add ability for drivers to do a threaded probe
  sysfs: add proper sysfs_init() prototype
  drivers/base: check errors
  drivers/base: Platform notify needs to occur before drivers attach to the device
  v4l-dev2: handle __must_check
  add CONFIG_ENABLE_MUST_CHECK
  add __must_check to device management code
  Driver core: fixed add_bind_files() definition
  Driver core: fix comments in drivers/base/power/resume.c
  sysfs_remove_bin_file: no return value, dump_stack on error
  kobject: must_check fixes
  Driver core: add ability for devices to create and remove bin files
  Class: add support for class interfaces for devices
  Driver core: create devices/virtual/ tree
  Driver core: add device_rename function
  Driver core: add ability for classes to handle devices properly
  ...
2006-09-26 11:49:46 -07:00
Rafael J. Wysocki
c5c6ba4e08 [PATCH] PM: Add pm_trace switch
Add the pm_trace attribute in /sys/power which has to be explicitly set to
one to really enable the "PM tracing" code compiled in when CONFIG_PM_TRACE
is set (which modifies the machine's CMOS clock in unpredictable ways).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:04 -07:00
Rafael J. Wysocki
c8eb8b4025 [PATCH] PM: make it possible to disable console suspending
Change suspend_console() so that it waits for all consoles to flush the
remaining messages and make it possible to switch the console suspending off
with the help of a Kconfig option.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Stefan Seyfried <seife@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:03 -07:00
Rafael J. Wysocki
940864ddab [PATCH] swsusp: Use memory bitmaps during resume
Make swsusp use memory bitmaps to store its internal information during the
resume phase of the suspend-resume cycle.

If the pfns of saveable pages are saved during the suspend phase instead of
the kernel virtual addresses of these pages, we can use them during the resume
phase directly to set the corresponding bits in a memory bitmap.  Then, this
bitmap is used to mark the page frames corresponding to the pages that were
saveable before the suspend (aka "unsafe" page frames).

Next, we allocate as many page frames as needed to store the entire suspend
image and make sure that there will be some extra free "safe" page frames for
the list of PBEs constructed later.  Subsequently, the image is loaded and, if
possible, the data loaded from it are written into their "original" page
frames (ie.  the ones they had occupied before the suspend).

The image data that cannot be written into their "original" page frames are
loaded into "safe" page frames and their "original" kernel virtual addresses,
as well as the addresses of the "safe" pages containing their copies, are
stored in a list of PBEs.  Finally, the list of PBEs is used to copy the
remaining image data into their "original" page frames (this is done
atomically, by the architecture-dependent parts of swsusp).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki
b788db7989 [PATCH] swsusp: Introduce memory bitmaps
Introduce the memory bitmap data structure and make swsusp use in the suspend
phase.

The current swsusp's internal data structure is not very efficient from the
memory usage point of view, so it seems reasonable to replace it with a data
structure that will require less memory, such as a pair of bitmaps.

The idea is to use bitmaps that may be allocated as sets of individual pages,
so that we can avoid making allocations of order greater than 0.  For this
reason the memory bitmap structure consists of several linked lists of objects
that contain pointers to memory pages with the actual bitmap data.  Still, for
a typical system all of these lists fit in a single page, so it's reasonable
to introduce an additional mechanism allowing us to allocate all of them
efficiently without sacrificing the generality of the design.  This is done
with the help of the chain_allocator structure and associated functions.

We need to use two memory bitmaps during the suspend phase of the
suspend-resume cycle.  One of them is necessary for marking the saveable
pages, and the second is used to mark the pages in which to store the copies
of them (aka image pages).

First, the bitmaps are created and we allocate as many image pages as needed
(the corresponding bits in the second bitmap are set as soon as the pages are
allocated).  Second, the bits corresponding to the saveable pages are set in
the first bitmap and the saveable pages are copied to the image pages.
Finally, the first bitmap is used to save the kernel virtual addresses of the
saveable pages and the second one is used to save the contents of the image
pages.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki
0bcd888d64 [PATCH] swsusp: Introduce some helpful constants
Introduce some constants that hopefully will help improve the readability of
code in kernel/power/snapshot.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki
75534b50cc [PATCH] Change the name of pagedir_nosave
The name of the pagedir_nosave variable does not make sense any more, so it
seems reasonable to change it to something more meaningful.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:01 -07:00
Rafael J. Wysocki
cd560bb2f9 [PATCH] swsusp: Fix alloc_pagedir
Get rid of the FIXME in kernel/power/snapshot.c#alloc_pagedir() and
simplify the functions called by it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
f6143aa60e [PATCH] swsusp: Reorder memory-allocating functions
Move some functions in kernel/power/snapshot.c to a better place (in the
same file) and introduce free_image_page() (will be necessary in the
future).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
f623f0db8e [PATCH] swsusp: Fix mark_free_pages
Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
PageNosaveFree for PageNosave pages.  This allows us to get rid of an ugly
hack in kernel/power/snapshot.c#copy_data_pages().

Additionally, the page-copying loop in copy_data_pages() is moved to an
inline function.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
e3920fb42c [PATCH] Disable CPU hotplug during suspend
The current suspend code has to be run on one CPU, so we use the CPU
hotplug to take the non-boot CPUs offline on SMP machines.  However, we
should also make sure that these CPUs will not be enabled by someone else
after we have disabled them.

The functions disable_nonboot_cpus() and enable_nonboot_cpus() are moved to
kernel/cpu.c, because they now refer to some stuff in there that should
better be static.  Also it's better if disable_nonboot_cpus() returns an
error instead of panicking if something goes wrong, and
enable_nonboot_cpus() has no reason to panic(), because the CPUs may have
been enabled by the userland before it tries to take them online.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki
fb13a28b0f [PATCH] swsusp: struct snapshot_handle cleanup
Add comments describing struct snapshot_handle and its members, change the
confusing name of its member 'page' to 'cur'.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Rafael J. Wysocki
ae83c5eef5 [PATCH] swsusp: clean up browsing of pfns
Clean up some loops over pfns for each zone in snapshot.c: reduce the
number of additions to perform, rework detection of saveable pages and make
the code a bit less difficult to understand, hopefully.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
546e0d2719 [PATCH] swsusp: read speedup
Implement async reads for swsusp resuming.

Crufty old PIII testbox:
	15.7 MB/s -> 20.3 MB/s

Sony Vaio:
	14.6 MB/s -> 33.3 MB/s

I didn't implement the post-resume bio_set_pages_dirty().  I don't really
understand why resume needs to run set_page_dirty() against these pages.

It might be a worry that this code modifies PG_Uptodate, PG_Error and
PG_Locked against the image pages.  Can this possibly affect the resumed-into
kernel?  Hopefully not, if we're atomically restoring its mem_map?

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jens Axboe <axboe@suse.de>
Cc: Laurent Riffard <laurent.riffard@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
8c002494b5 [PATCH] swsusp: add read-speed instrumentation
Add some instrumentation to the swsusp readin code to show what bandwidth
we're achieving.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
ab95416035 [PATCH] swsusp: write speedup
Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

Crufty old PIII testbox:
	12.9 MB/s -> 20.9 MB/s

Sony Vaio:
	14.7 MB/s -> 26.5 MB/s

The implementation is crude.  A better one would use larger BIOs, but wouldn't
gain any performance.

The memcpys will be mostly pipelined with the IO and basically come for free.

The ENOMEM path has not been tested.  It should be.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton
3a4f7577c9 [PATCH] swsusp: add write-speed instrumentation
Add some instrumentation to the swsusp writeout code to show what bandwidth
we're achieving.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
David Howells
af8c65b57a [PATCH] FRV: permit __do_IRQ() to be dispensed with
Permit __do_IRQ() to be dispensed with based on a configuration option.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:53 -07:00
Stephen Smalley
1a70cd40cb [PATCH] selinux: rename selinux_ctxid_to_string
Rename selinux_ctxid_to_string to selinux_sid_to_string to be
consistent with other interfaces.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Stephen Smalley
62bac0185a [PATCH] selinux: eliminate selinux_task_ctxid
Eliminate selinux_task_ctxid since it duplicates selinux_task_get_sid.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter
89fa30242f [PATCH] NUMA: Add zone_to_nid function
There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM.  Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter
0ff38490c8 [PATCH] zone_reclaim: dynamic slab reclaim
Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode.  Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs.  We have had a case where a machine with
46GB of memory was using 40-42GB for slab.  Zone reclaim was effective in
dealing with pagecache pages.  However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim.  Zone reclaim
occurs if there is a danger of an off node allocation.  At that point we

1. Shrink the per node page cache if the number of pagecache
   pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
   (patch depends on earlier one that implements that counter)
   are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific.  So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit.  I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter
fbd98167e6 [PATCH] Profiling: require buffer allocation on the correct node
Profiling really suffers with off node buffers.  Fail if no memory is
available on the nodes.  The profiling code can deal with these failures
should they occur.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter
9b819d204c [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions
Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes.  This
flag is essential if a kernel component requires memory to be located on a
certain node.  It will be needed for alloc_pages_node() to force allocation
on the indicated node and for alloc_pages() to force allocation on the
current node.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00
Christoph Lameter
0a2966b48f [PATCH] Fix longstanding load balancing bug in the scheduler
The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu.  If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e.  If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit).  For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack.  On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack.  The maximum additional cache footprint is one
cacheline.  Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: John Hawkes <hawkes@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:43 -07:00
Jan Beulich
adf1423698 [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
Current gcc generates calls not jumps to noreturn functions. When that happens the
return address can point to the next function, which confuses the unwinder.

This patch works around it by marking asynchronous exception
frames in contrast normal call frames in the unwind information.  Then teach
the unwinder to decode this.

For normal call frames the unwinder now subtracts one from the address which avoids
this problem.  The standard libgcc unwinder uses the same trick.

It doesn't include adjustment of the printed address (i.e. for the original
example, it'd still be kernel_math_error+0 that gets displayed, but the
unwinder wouldn't get confused anymore.

This only works with binutils 2.6.17+ and some versions of H.J.Lu's 2.6.16
unfortunately because earlier binutils don't support .cfi_signal_frame

[AK: added automatic detection of the new binutils and wrote description]

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:41 +02:00
Arjan van de Ven
3162f751d0 [PATCH] Add the __stack_chk_fail() function
GCC emits a call to a __stack_chk_fail() function when the stack canary is
not matching the expected value.

Since this is a bad security issue; lets panic the kernel rather than limping
along; the kernel really can't be trusted anymore when this happens.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Andi Kleen <ak@suse.de>
2006-09-26 10:52:39 +02:00
Arjan van de Ven
0a42540580 [PATCH] Add the canary field to the PDA area and the task struct
This patch adds the per thread cookie field to the task struct and the PDA.
Also it makes sure that the PDA value gets the new cookie value at context
switch, and that a new task gets a new cookie at task creation time.

Signed-off-by: Arjan van Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
CC: Andi Kleen <ak@suse.de>
2006-09-26 10:52:38 +02:00
Andi Kleen
3fa7c794fe [PATCH] Avoid recursion in lockdep when stack tracer takes locks
The new dwarf2 unwinder needs to take locks to do backtraces
inside modules. This patch makes sure lockdep which calls
stacktrace is not reentered.

Thanks to Ingo for suggesting this simpler approach.

Cc: mingo@elte.hu
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:34 +02:00
Andi Kleen
5a1b3999d6 [PATCH] x86: Some preparationary cleanup for stack trace
- Remove unused all_contexts parameter
No caller used it
- Move skip argument into the structure (needed for
followon patches)

Cc: mingo@elte.hu

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:34 +02:00
Andi Kleen
0cb91a2293 [PATCH] i386: Account spinlocks to the caller during profiling for !FP kernels
This ports the algorithm from x86-64 (with improvements) to i386.
Previously this only worked for frame pointer enabled kernels.
But spinlocks have a very simple stack frame that can be manually
analyzed. Do this.

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:28 +02:00
Andi Kleen
3cfc348bf9 [PATCH] x86: Add portable getcpu call
For NUMA optimization and some other algorithms it is useful to have a fast
to get the current CPU and node numbers in user space.

x86-64 added a fast way to do this in a vsyscall. This adds a generic
syscall for other architectures to make it a generic portable facility.

I expect some of them will also implement it as a faster vsyscall.

The cache is an optimization for the x86-64 vsyscall optimization. Since
what the syscall returns is an approximation anyways and user space
often wants very fast results it can be cached for some time.  The norma
methods to get this information in user space are relatively slow

The vsyscall is in a better position to manage the cache because it has direct
access to a fast time stamp (jiffies). For the generic syscall optimization
it doesn't help much, but enforce a valid argument to keep programs
portable

I only added an i386 syscall entry for now. Other architectures can follow
as needed.

AK: Also added some cleanups from Andrew Morton

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:28 +02:00
Don Zickus
8da5adda91 [PATCH] x86: Allow users to force a panic on NMI
To quote Alan Cox:

The default Linux behaviour on an NMI of either memory or unknown is to
continue operation. For many environments such as scientific computing
it is preferable that the box is taken out and the error dealt with than
an uncorrected parity/ECC error get propogated.

A small number of systems do generate NMI's for bizarre random reasons
such as power management so the default is unchanged. In other respects
the new proc/sys entry works like the existing panic controls already in
that directory.

This is separate to the edac support - EDAC allows supported chipsets to
handle ECC errors well, this change allows unsupported cases to at least
panic rather than cause problems further down the line.

Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:27 +02:00
Don Zickus
407984f1af [PATCH] x86: Add abilty to enable/disable nmi watchdog with sysctl
Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
watchdog.

Signed-off-by:  Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:27 +02:00
Don Zickus
2fbe7b25c8 [PATCH] i386/x86-64: Remove un/set_nmi_callback and reserve/release_lapic_nmi functions
Removes the un/set_nmi_callback and reserve/release_lapic_nmi functions as
they are no longer needed.  The various subsystems are modified to register
with the die_notifier instead.

Also includes compile fixes by Andrew Morton.

Signed-off-by:  Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26 10:52:27 +02:00
David Brownell
1d3a82af45 PM: no suspend_prepare() phase
Remove the new suspend_prepare() phase.  It doesn't seem very usable,
has never been tested, doesn't address fault cleanup, and would need
a sibling resume_complete(); plus there are no real use cases.  It
could be restored later if those issues get resolved.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:38 -07:00
David Brownell
2bca293e56 PM: add kconfig option for deprecated .../power/state files
Add a new PM_SYSFS_DEPRECATED config option to control whether or
not the /sys/devices/.../power/state files are provided.  This will
make it easier to get rid of that mechanism when the time comes,
and to verify that userspace tools work right without it.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:37 -07:00
David Brownell
f1cc0a894c PM: issue PM_EVENT_PRETHAW
This patch is the first of this series that should actually change any
behavior ...  by issuing the new event, now tha the rest of the kernel is
prepared to receive it.

This converts the PM core to issue the new PRETHAW message, which the rest of
the kernel is now ready to receive.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:37 -07:00
Linus Torvalds
7c8265f510 Suspend infrastructure cleanup and extension
Allow devices to participate in the suspend process more intimately,
in particular, allow the final phase (with interrupts disabled) to
also be open to normal devices, not just system devices.

Also, allow classes to participate in device suspend.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-25 21:08:36 -07:00
Ed Swierk
1cc5f7142e [PATCH] load_module: no BUG if module_subsys uninitialized
Invoking load_module() before param_sysfs_init() is called crashes in
mod_sysfs_setup(), since the kset in module_subsys is not initialized yet.

In my case, net-pf-1 is getting modprobed as a result of hotplug trying to
create a UNIX socket.  Calls to hotplug begin after the topology_init
initcall.

Another patch for the same symptom (module_subsys-initialize-earlier.patch)
moves param_sysfs_init() to the subsys initcalls, but this is still not
early enough in the boot process in some cases.  In particular,
topology_init() causes /sbin/hotplug to run, which requests net-pf-1 (the
UNIX socket protocol) which can be compiled as a module.  Moving
param_sysfs_init() to the postcore initcalls fixes this particular race,
but there might well be other cases where a usermodehelper causes a module
to load earlier still.

The patch makes load_module() return an error rather than crashing the
kernel if invoked before module_subsys is initialized.

Cc: Mark Huang <mlhuang@cs.princeton.edu>
Cc: Greg KH <greg@kroah.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-25 17:38:36 -07:00
Thomas Graf
fe4944e59c [NETLINK]: Extend netlink messaging interface
Adds:
 nlmsg_get_pos()                 return current position in message
 nlmsg_trim()                    trim part of message
 nla_reserve_nohdr(skb, len)     reserve room for an attribute w/o hdr
 nla_put_nohdr(skb, len, data)   add attribute w/o hdr
 nla_find_nested()               find attribute in nested attributes

Fixes nlmsg_new() to take allocation flags and consider size.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-09-22 14:53:43 -07:00
Russell King
b36e4758dc [ARM] Fix kernel/fork.c for lockdep on ARM
ARM has interrupts enabled over context switches (iow, has
__ARCH_WANT_INTERRUPTS_ON_CTXSW defined.)  The lockdep code in fork.c
 assumes that interrupts are always disabled.  Fix this wrong
assumption by making the initialisation of 'p->hardirqs_enabled'
depend on __ARCH_WANT_INTERRUPTS_ON_CTXSW.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-09-20 14:58:35 +01:00
Ingo Molnar
86998aa653 [PATCH] genirq core: fix handle_level_irq()
while porting the -rt tree to 2.6.18-rc7 i noticed the following
screaming-IRQ scenario on an SMP system:

 2274  0Dn.:1 0.001ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.010ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.020ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.029ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.039ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.048ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.058ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.068ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.077ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.087ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)
 2274  0Dn.:1 0.097ms: do_IRQ+0xc/0x103  <= (ret_from_intr+0x0/0xf)

as it turns out, the bug is caused by handle_level_irq(), which if it
races with another CPU already handling this IRQ, it _unmasks_ the IRQ
line on the way out. This is not how 2.6.17 works, and we introduced
this bug in one of the early genirq cleanups right before it went into
-mm. (the bug was not in the genirq patchset for a long time, and we
didnt notice the bug due to the lack of -rt rebase to the new genirq
code. -rt, and hardirq-preemption in particular opens up such races much
wider than anything else.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-19 07:57:20 -07:00
Kenneth Lee
e4b69aa2a1 [PATCH] bug fix in kernel/kmod.c
I think there is a bug in kmod.c: In __call_usermodehelper(), when
kernel_thread(wait_for_helper, ...) return success, since wait_for_helper()
might call complete() at any time, the sub_info should not be used any
more.

Normally wait_for_helper() take a long time to finish, you may not get
problem for most of the case.  But if you remove /sbin/modprobe, it may
become easier for you to get a oop in khelper.

Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-16 12:54:32 -07:00
Imre Deak
e1ed7ac77b [PATCH] genirq: fix typo in IRQ resend
Fix a bug where the IRQ_PENDING flag is never cleared and the ISR is called
endlessly without an actual interrupt.

Signed-off-by: Imre Deak <imre.deak@solidboot.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-16 12:54:30 -07:00
Oleg Nesterov
dd9daa221e [PATCH] rcu_do_batch: make ->qlen decrement irq safe
rcu_do_batch() decrements rdp->qlen with irqs enabled.  This is not good,
it can also be modified by call_rcu() from interrupt.

Decrement ->qlen once with irqs disabled, after a main loop.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-13 07:32:14 -07:00
Ingo Molnar
9bb25bf36f [PATCH] lockdep: double the number of stack-trace entries
Miles Lane reported the "BUG: MAX_STACK_TRACE_ENTRIES too low!" message,
which means that during normal use his system produced enough lockdep
events so that the 128-thousand entries stack-trace array got exhausted.
Double the size of the array.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-13 07:32:14 -07:00
Al Viro
55669bfa14 [PATCH] audit: AUDIT_PERM support
add support for AUDIT_PERM predicate

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:30 -04:00
Amy Griffis
5974501e2d [PATCH] update audit rule change messages
Make the audit message for implicit rule removal more informative.
Make the rule update message consistent with other messages.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:17 -04:00
Amy Griffis
8ef2d3040e [PATCH] sanity check audit_buffer
Add sanity checks for NULL audit_buffer consistent with other
audit_log* routines.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:17 -04:00
Steve Grubb
3b33ac3182 [PATCH] fix ppid bug in 2.6.18 kernel
Hello,

During some troubleshooting, I found that ppid was accidentally omitted from
the legacy rule section. This resulted in EINVAL for any rule with ppid sent
with AUDIT_ADD.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-09-11 13:32:04 -04:00
Thomas Gleixner
c5780e976e [PATCH] Use the correct restart option for futex_lock_pi
The current implementation of futex_lock_pi returns -ERESTART_RESTARTBLOCK
in case that the lock operation has been interrupted by a signal.  This
results in a return of -EINTR to userspace in case there is an handler for
the signal.  This is wrong, because userspace expects that the lock
function does not return in any case of signal delivery.

This was not caught by my insufficient test case, but triggered a nasty
userspace problem in an high load application scenario.  Unfortunately also
glibc does not check for this invalid return value.

Using -ERSTARTNOINTR makes sure, that the interrupted syscall is restarted.
 The restart block related code can be safely removed, as the possible
timeout argument is an absolute time value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-08 10:22:50 -07:00
Ingo Molnar
068c4579fe [PATCH] lockdep: do not touch console state when tainting the kernel
Remove an unintended console_verbose() side-effect from add_taint().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-06 11:00:02 -07:00
Pavel Machek
471b40d0df [PATCH] prevent swsusp with PAE
PAE + swsusp results in hard-to-debug crash about 50% of time during
resume.  Cause is known, fix needs to be ported from x86-64 (but we can't
make it to 2.6.18, and I'd like this to be worked around in 2.6.18).

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-06 11:00:02 -07:00
Jarek Poplawski
fc47e7b592 [PATCH] lockdep ifdef fix
With

	CONFIG_SMP=y
	CONFIG_PREEMPT=y
	CONFIG_LOCKDEP=y
	CONFIG_DEBUG_LOCK_ALLOC=y
	# CONFIG_PROVE_LOCKING is not set

spin_unlock_irqrestore() goes through lockdep but spin_lock_irqsave() doesn't.
Apparently, bad things happen.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-06 11:00:01 -07:00
Oleg Nesterov
3b6362b833 [PATCH] eligible_child: remove an obsolete ->tgid check
It is not possible to find a sub-thread in ->children/->ptrace_children
lists, ptrace_attach() does not allow to attach to sub-threads.

Even if it was possible to ptrace the task from the same thread group,
we can't allow to release ->group_leader while there are others (ptracer)
threads in the same group.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-02 14:51:27 -07:00
Henrik Kretzschmar
43a1dd502f [PATCH] kerneldoc for handle_bad_irq()
Adds the description of the parameters from handle_bad_irq().

Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-01 11:39:09 -07:00
Shailabh Nagar
35df17c57c [PATCH] task delay accounting fixes
Cleanup allocation and freeing of tsk->delays used by delay accounting.
This solves two problems reported for delay accounting:

1. oops in __delayacct_blkio_ticks
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1844.html

Currently tsk->delays is getting freed too early in task exit which can
cause a NULL tsk->delays to get accessed via reading of /proc/<tgid>/stats.
 The patch fixes this problem by freeing tsk->delays closer to when
task_struct itself is freed up.  As a result, it also eliminates the use of
tsk->delays_lock which was only being used (inadequately) to safeguard
access to tsk->delays while a task was exiting.

2. Possible memory leak in kernel/delayacct.c
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1389.html

The patch cleans up tsk->delays allocations after a bad fork which was
missing earlier.

The patch has been tested to fix the problems listed above and stress
tested with rapid calls to delay accounting's taskstats command interface
(which is the other path that can access the same data, besides the /proc
interface causing the oops above).

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-01 11:39:08 -07:00
Nick Piggin
0d673a5a47 [PATCH] cpuset: oom panic fix
cpuset_excl_nodes_overlap always returns 0 if current is exiting.  This caused
customer's systems to panic in the OOM killer when processes were having
trouble getting memory for the final put_user in mm_release.  Even though
there were lots of processes to kill.

Change to returning 1 in this case.  This achieves parity with !CONFIG_CPUSETS
case, and was observed to fix the problem.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:32 -07:00
Paul Jackson
4c4d50f7b3 [PATCH] cpuset: top_cpuset tracks hotplug changes to cpu_online_map
Change the list of cpus allowed to tasks in the top (root) cpuset to
dynamically track what cpus are online, using a CPU hotplug notifier.  Make
this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support CPU hotplug, then these tasks cannot make use
of CPUs that are added after system boot, because the CPUs are not allowed
in the top cpuset.  This is a surprising regression over earlier kernels
that didn't have cpusets enabled.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'cpus' file in the top (root) cpuset, making it read
only, and making it automatically track the value of cpu_online_map.  Thus
tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
by their cpuset.

Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
driving the fix, and earlier versions of this patch.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:32 -07:00
Yingchao Zhou
4edb9a143e [PATCH] Remove redundant up() in stop_machine()
An up() is called in kernel/stop_machine.c on failure, and also in the
caller (unconditionally).

Signed-off-by: Zhou Yingchao <yingchao.zhou@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:31 -07:00
Oleg Nesterov
d015baebba [PATCH] futex_find_get_task(): remove an obscure EXIT_ZOMBIE check
futex_find_get_task:

	if (p->state == EXIT_ZOMBIE || p->exit_state == EXIT_ZOMBIE)
		return NULL;

I can't understand this.  First, p->state can't be EXIT_ZOMBIE.  The
->exit_state check looks strange too.  Sub-threads or tasks whose ->parent
ignores SIGCHLD go directly to EXIT_DEAD state (I am ignoring a ptrace
case).  Why EXIT_DEAD tasks should be ok?  Yes, EXIT_ZOMBIE is more
important (a task may stay zombie for a long time), but this doesn't mean
we should explicitely ignore other EXIT_XXX states.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:30 -07:00
Oleg Nesterov
f8986c241d [PATCH] revert "Drop tasklist lock in do_sched_setscheduler"
sched_setscheduler() looks at ->signal->rlim[].  It is unsafe do
dereference ->signal unless tasklist_lock or ->siglock is held (or p ==
current).  We pin the task structure, but this can't prevent from
release_task()->__exit_signal() which sets ->signal = NULL.

Restore tasklist_lock across the setscheduler call.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-27 11:01:29 -07:00
Andrew Morton
9b41ea7289 [PATCH] workqueue: remove lock_cpu_hotplug()
Use a private lock instead.  It protects all per-cpu data structures in
workqueue.c, including the workqueues list.

Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the
per-cpu resources.

Unfixed long-standing bug: if someone unplugs the CPU identified by
`singlethread_cpu' the kernel will get very sick.

Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
john stultz
e579dcbf23 [PATCH] futex_handle_fault always fails
We found this issue last week w/ the -RT kernel, but it seems the same
issue is in mainline as well.

Basically it is possible for futex_unlock_pi to return without actually
freeing the lock.  This is due to buggy logic in the use of
futex_handle_fault() and its attempt argument in a failure case.

Looking at futex.c the logic is as follows:

1) In futex_unlock_pi() we start w/ ret=0 and we go down to the first
   futex_atomic_cmpxchg_inatomic(), where we find uval==-EFAULT.  We then
   jump to the pi_faulted label.

2) From pi_faulted: We increment attempt, unlock the sem and hit the
   retry label.

3) From the retry label, with ret still zero, we again hit EFAULT on the
   first futex_atomic_cmpxchg_inatomic(), and again goto the pi_faulted
   label.

4) Again from pi_faulted: we increment attempt and enter the
   conditional, where we call futex_handle_fault.

5) futex_handle_fault fails, and we goto the out_unlock_release_sem
   label.

6) From out_unlock_release_sem we return, and since ret is still zero,
   we return without error, while never actually unlocking the lock.

Issue #1: at the first futex_atomic_cmpxchg_inatomic() we should probably
be setting ret=-EFAULT before jumping to pi_faulted: However in our case
this doesn't really affect anything, as the glibc we're using ignores the
error value from futex_unlock_pi().

Issue #2: Look at futex_handle_fault(), its first conditional will return
-EFAULT if attempt is >= 2.  However, from the "if(attempt++)
futex_handle_fault(attempt)" logic above, we'll *never* call
futex_handle_fault when attempt is less then two.  So we never get a chance
to even try to fault the page in.

The following patch addresses these two issues by 1) Always setting ret to
-EFAULT if futex_handle_fault fails, and 2) Removing the = in
futex_handle_fault's (attempt >= 2) check.

I'm really not sure this is the right fix, but wanted to bring it up so
folks knew the issue is alive and well in the current -git tree.  From
looking at the git logs the logic was first introduced (then later copied
to other places) in the following commit almost a year ago:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4732efbeb997189d9f9b04708dc26bf8613ed721;hp=5b039e681b8c5f30aac9cc04385cc94be45d0823

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
Kirill Korotaev
6997a6faaa [PATCH] sys_getppid oopses on debug kernel
sys_getppid() optimization can access a freed memory.  On kernels with
DEBUG_SLAB turned ON, this results in Oops.  As Dave Hansen noted, this
optimization is also unsafe for memory hotplug.

So this patch always takes the lock to be safe.

[oleg@tv-sign.ru: simplifications]
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Cc: <stable@kernel.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:29 -07:00
Andrew Morton
657b3010d8 [PATCH] panic.c build fix
kernel/panic.c: In function 'add_taint':
kernel/panic.c:176: warning: implicit declaration of function 'debug_locks_off'

Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:28 -07:00
Jan Blunck
3773dc9205 [PATCH] fix hrtimer percpu usage typo
The percpu variable is used incorrectly in switch_hrtimer_base().

Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14 12:54:28 -07:00
Thomas Gleixner
ce2c6b5384 [PATCH] futex: Apply recent futex fixes to futex_compat
The recent fixups in futex.c need to be applied to futex_compat.c too.  Fixes
a hang reported by Olaf.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:49 -07:00
KAMEZAWA Hiroyuki
58c1b5b079 [PATCH] memory hotadd fixes: find_next_system_ram catch range fix
find_next_system_ram() is used to find available memory resource at onlining
newly added memory.  This patch fixes following problem.

find_next_system_ram() cannot catch this case.

Resource:      (start)-------------(end)
Section :                (start)-------------(end)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:48 -07:00
KAMEZAWA Hiroyuki
0f04ab5efb [PATCH] memory hotadd fixes: change find_next_system_ram's return value manner
find_next_system_ram() returns valid memory range which meets requested area,
only used by memory-hot-add.

This function always rewrite requested resource even if returned area is not
fully fit in requested one.  And sometimes the returnd resource is larger than
requested area.  This annoyes the caller.  This patch changes the returned
value to fit in requested area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Keith Mannthey <kmannth@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:48 -07:00
Antonino A. Daplas
78944e549d [PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertion
Reported by: Dave Jones

Whilst printk'ing to both console and serial console, I got this...
(2.6.18rc1)

BUG: sleeping function called from invalid context at kernel/sched.c:4438
in_atomic():0, irqs_disabled():1

Call Trace:
 [<ffffffff80271db8>] show_trace+0xaa/0x23d
 [<ffffffff80271f60>] dump_stack+0x15/0x17
 [<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4
 [<ffffffff8029232e>] __cond_resched+0x15/0x55
 [<ffffffff80267eb8>] cond_resched+0x3b/0x42
 [<ffffffff80268c64>] console_conditional_schedule+0x12/0x14
 [<ffffffff80368159>] fbcon_redraw+0xf6/0x160
 [<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52
 [<ffffffff803a43c4>] scrup+0x6b/0xd6
 [<ffffffff803a4453>] lf+0x24/0x44
 [<ffffffff803a7ff8>] vt_console_print+0x166/0x23d
 [<ffffffff80295528>] __call_console_drivers+0x65/0x76
 [<ffffffff80295597>] _call_console_drivers+0x5e/0x62
 [<ffffffff80217e3f>] release_console_sem+0x14b/0x232
 [<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6
 [<ffffffff80251e3f>] run_workqueue+0xa8/0xfb
 [<ffffffff8024e5e0>] worker_thread+0xef/0x122
 [<ffffffff8023660f>] kthread+0x100/0x136
 [<ffffffff8026419e>] child_rip+0x8/0x12

This can occur when release_console_sem() is called but the log
buffer still has contents that need to be flushed. The console drivers
are called while the console_may_schedule flag is still true. The
might_sleep() is triggered when fbcon calls console_conditional_schedule().

Fix by setting console_may_schedule to zero earlier, before the call to the
console drivers.

Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:47 -07:00
Chuck Ebbert
9f59ce5d0e [PATCH] ptrace: make pid of child process available for PTRACE_EVENT_VFORK_DONE
When delivering PTRACE_EVENT_VFORK_DONE, provide pid of the child process
when tracer calls ptrace(PTRACE_GETEVENTMSG).  This is already
(accidentally) available when the tracer is tracing VFORK in addition to
VFORK_DONE.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Daniel Jacobowitz <dan@debian.org>
Cc: Albert Cahalan <acahalan@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:46 -07:00
Christian Borntraeger
e91467ecd1 [PATCH] bug in futex unqueue_me
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
pointers.

On my s390x system I saw the following oops:

Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU:    0    Not tainted
Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
           0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
           000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
           00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
Call Trace:
([<0000000000144f90>] unqueue_me+0x40/0xe4)
 [<0000000000145a0c>] do_futex+0x33c/0xc40
 [<000000000014643e>] sys_futex+0x12e/0x144
 [<000000000010bb00>] sysc_noemu+0x10/0x16
 [<000002000003741c>] 0x2000003741c

The code in question is:

static int unqueue_me(struct futex_q *q)
{
        int ret = 0;
        spinlock_t *lock_ptr;

        /* In the common case we don't take the spinlock, which is nice. */
 retry:
        lock_ptr = q->lock_ptr;
        if (lock_ptr != 0) {
                spin_lock(lock_ptr);
		/*
                 * q->lock_ptr can change between reading it and
                 * spin_lock(), causing us to take the wrong lock.  This
                 * corrects the race condition.
[...]

and my compiler (gcc 4.1.0) makes the following out of it:

00000000000003c8 <unqueue_me>:
     3c8:       eb bf f0 70 00 24       stmg    %r11,%r15,112(%r15)
     3ce:       c0 d0 00 00 00 00       larl    %r13,3ce <unqueue_me+0x6>
                        3d0: R_390_PC32DBL      .rodata+0x2a
     3d4:       a7 f1 1e 00             tml     %r15,7680
     3d8:       a7 84 00 01             je      3da <unqueue_me+0x12>
     3dc:       b9 04 00 ef             lgr     %r14,%r15
     3e0:       a7 fb ff d0             aghi    %r15,-48
     3e4:       b9 04 00 b2             lgr     %r11,%r2
     3e8:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
		/* write q->lock_ptr in r12 */
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       a7 84 00 4b             je      48e <unqueue_me+0xc6>
		/* if r12 is zero then jump over the code.... */
     3fc:       e3 20 b0 28 00 04       lg      %r2,40(%r11)
		/* write q->lock_ptr in r2 */
     402:       c0 e5 00 00 00 00       brasl   %r14,402 <unqueue_me+0x3a>
                        404: R_390_PC32DBL      _spin_lock+0x2
		/* use r2 as parameter for spin_lock */

So the code becomes more or less:
if (q->lock_ptr != 0) spin_lock(q->lock_ptr)
instead of
if (lock_ptr != 0) spin_lock(lock_ptr)

Which caused the oops from above.
After adding a barrier gcc creates code without this problem:
[...] (the same)
     3ee:       e3 c0 b0 28 00 04       lg      %r12,40(%r11)
     3f4:       b9 02 00 cc             ltgr    %r12,%r12
     3f8:       b9 04 00 2c             lgr     %r2,%r12
     3fc:       a7 84 00 48             je      48c <unqueue_me+0xc4>
     400:       c0 e5 00 00 00 00       brasl   %r14,400 <unqueue_me+0x38>
                        402: R_390_PC32DBL      _spin_lock+0x2

As a general note, this code of unqueue_me seems a bit fishy. The retry logic
of unqueue_me only works if we can guarantee, that the original value of
q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
know that q->lock_ptr can change. I dont know what happens with the original
spinlock, as I am not an expert with the futex code.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@timesys.com>
Signed-off-by: Christian Borntraeger <borntrae@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:46 -07:00
Rafael J. Wysocki
a7ef7878ea [PATCH] Make suspend possible with a traced process at a breakpoint
It should be possible to suspend, either to RAM or to disk, if there's a
traced process that has just reached a breakpoint.  However, this is a
special case, because its parent process might have been frozen already and
then we are unable to deliver the "freeze" signal to the traced process.
If this happens, it's better to cancel the freezing of the traced process.

Ref. http://bugzilla.kernel.org/show_bug.cgi?id=6787

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06 08:57:45 -07:00
Al Viro
3f2792ffbd [PATCH] take filling ->pid, etc. out of audit_get_context()
move that stuff downstream and into the only branch where it'll be
used.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:59:51 -04:00
Al Viro
5ac3a9c26c [PATCH] don't bother with aux entires for dummy context
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:59:42 -04:00
Al Viro
d51374adf5 [PATCH] mark context of syscall entered with no rules as dummy
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:59:26 -04:00
Al Viro
471a5c7c83 [PATCH] introduce audit rules counter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:55:18 -04:00
Amy Griffis
5422e01ac1 [PATCH] fix audit oops with invalid operator
Michael C Thompson wrote:  [Tue Aug 01 2006, 02:36:36PM EDT]
> The trigger for this oops is:
> # auditctl -a exit,always -S pread64 -F 'inode<1'

Setting the err value will fix it.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:54:43 -04:00
Amy Griffis
6988434ee5 [PATCH] fix oops with CONFIG_AUDIT and !CONFIG_AUDITSYSCALL
Always initialize the audit_inode_hash[] so we don't oops on list rules.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:50:39 -04:00
Amy Griffis
73d3ec5aba [PATCH] fix missed create event for directory audit
When an object is created via a symlink into an audited directory, audit misses
the event due to not having collected the inode data for the directory.  Modify
__audit_inode_child() to copy the parent inode data if a parent wasn't found in
audit_names[].

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:50:30 -04:00
Amy Griffis
3e2efce067 [PATCH] fix faulty inode data collection for open() with O_CREAT
When the specified path is an existing file or when it is a symlink, audit
collects the wrong inode number, which causes it to miss the open() event.
Adding a second hook to the open() path fixes this.

Also add audit_copy_inode() to consolidate some code.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03 10:50:21 -04:00
Linus Torvalds
ae74c3b69a Fix force_sig_info() semantics after cleanups
Suresh points out that commit b0423a0d9c
broke the semantics of a synchronous signal like SIGSEGV occurring
recursively inside its own handler handler (or, indeed, any other
context when the signal was blocked).

That was unintentional, and this fixes things up by reinstating the old
semantics, but without reverting the cleanups.

Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-02 20:17:49 -07:00
Josh Triplett
51d8c5edd3 [PATCH] timer: Fix tvec_bases initializer
kernel/timer.c defines a (per-cpu) pointer to tvec_base_t, but initializes
it using { &a_tvec_base_t }, which sparse warns about; change this to just
&a_tvec_base_t.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:44 -07:00
Steven Rostedt
d07fe82c24 [PATCH] reference rt-mutex-design in rtmutex.c
In order to prevent Doc Rot, this patch adds a reference to the design
document for rtmutex.c in rtmutex.c.  So when someone needs to update or
change the design of that file they will know that a document actually
exists that explains the design (helping them change it), and hopefully
that they will update the document if they too change the design.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:43 -07:00
Tim Chen
3c829c367a [PATCH] Reducing local_bh_enable/disable overhead in irqtrace
The recent changes from irqtrace feature has added overheads to
local_bh_disable and local_bh_enable that reduces UDP performance across
x86_64 and IA64, even though IA64 does not support the irqtrace feature.
Patch in question is

[PATCH]lockdep: irqtrace subsystem, core
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c
ommit;h=de30a2b355ea85350ca2f58f3b9bf4e5bc007986

Prior to this patch, local_bh_disable was a short macro.  Now it is a
function which calls __local_bh_disable with added irq flags save and
restore.  The irq flags save and restore were also added to
local_bh_enable, probably for injecting the trace irqs code.

This overhead is on the generic code path across all architectures.  On a
IA_64 test machine (Itanium-2 1.6 GHz) running a benchmark like netperf's
UDP streaming test, the added overhead results in a drop of 3% in
throughput, as udp_sendmsg calls the local_bh_enable/disable several times.

Other workloads that have heavy usages of local_bh_enable/disable could
also be affected.  The patch ideally should not have affected IA-64
performance as it does not have IRQ tracing support.  A significant portion
of the overhead is in the added irq flags save and restore, which I think
is not needed if IRQ tracing is unused.  A suggested patch is attached
below that recovers the lost performance.  However, the "ifdef"s in the
patch are a bit ugly.

Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:42 -07:00
Heiko Carstens
b50f60ceee [PATCH] pi-futex: missing pi_waiters plist initialization
Initialize init task's pi_waiters plist.  Otherwise cpu hotplug of cpu 0
might crash, since rt_mutex_getprio() accesses an uninitialized list head.

call chain which led to crash:

take_cpu_down
sched_idle_next
__setscheduler
rt_mutex_getprio

Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately,
since the pi_waiters member is only conditionally present.

Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:41 -07:00
Rolf Eike Beer
0fcb78c22f [PATCH] Add DocBook documentation for workqueue functions
kernel/workqueue.c was omitted from generating kernel documentation.  This
adds a new section "Workqueues and Kevents" and adds documentation for some
of the functions.

Some functions in this file already had DocBook-style comments, now they
finally become visible.

Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:40 -07:00
Jim Houston
2d7d253548 [PATCH] fix cond_resched() fix
In cond_resched_lock() it calls __resched_legal() before dropping the spin
lock.  __resched_legal() will always finds the preempt_count non-zero and
will prevent the call to __cond_resched().

The attached patch adds a parameter to __resched_legal() with the expected
preempt_count value.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:40 -07:00
Steven Rostedt
6ea24f9ad1 [PATCH] fix bad macro param in timer.c
We have

#define INDEX(N) (base->timer_jiffies >> (TVR_BITS + N * TVN_BITS)) & TVN_MASK

and it's used via

	list = varray[i + 1]->vec + (INDEX(i + 1));

So, due to underparenthesisation, this INDEX(i+1) is now a ...  (TVR_BITS + i
+ 1 * TVN_BITS)) ...

So this bugfix changes behaviour.  It worked before by sheer luck:

  "If i was anything but 0, it was broken.  But this was only used by
   s390 and arm.  Since it was for the next interrupt, could that next
   interrupt be a problem (going into the second cascade)? But it was
   probably seldom wrong.  That is, this would fail if the next
   interrupt was in the second cascade, and was wrapped.  Which may
   never of happened.  Also if it did happen, it would have just missed
   the interrupt.

   If an interrupt was missed, and no one was there to miss it, was it
   really missed :-)"

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:40 -07:00
Chandra Seetharaman
8c78f3075d [PATCH] cpu hotplug: replace __devinit* with __cpuinit* for cpu notifications
Few of the callback functions and notifier blocks that are associated with cpu
notifications incorrectly have __devinit and __devinitdata.  They should be
__cpuinit and __cpuinitdata instead.

It makes no functional difference but wastes text area when CONFIG_HOTPLUG is
enabled and CONFIG_HOTPLUG_CPU is not.

This patch fixes all those instances.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:39 -07:00
bibo, mao
a9ad965ea9 [PATCH] IA64: kprobe invalidate icache of jump buffer
Kprobe inserts breakpoint instruction in probepoint and then jumps to
instruction slot when breakpoint is hit, the instruction slot icache must
be consistent with dcache.  Here is the patch which invalidates instruction
slot icache area.

Without this patch, in some machines there will be fault when executing
instruction slot where icache content is inconsistent with dcache.

Signed-off-by: bibo,mao <bibo.mao@intel.com>
Acked-by: "Luck, Tony" <tony.luck@intel.com>
Acked-by: Keshavamurthy Anil S <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:38 -07:00
Shailabh Nagar
163ecdff06 [PATCH] delay accounting: temporarily enable by default
Enable delay accounting by default so that feature gets coverage testing
without requiring special measures.

Earlier, it was off by default and had to be enabled via a boot time param.
 This patch reverses the default behaviour to improve coverage testing.  It
can be removed late in the kernel development cycle if its believed users
shouldn't have to incur any cost if they don't want delay accounting.  Or
it can be retained forever if the utility of the stats is deemed common
enough to warrant keeping the feature on.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:37 -07:00
Shailabh Nagar
d94a041519 [PATCH] taskstats: free skb, avoid returns in send_cpu_listeners
Add a missing freeing of skb in the case there are no listeners at all.
Also remove the returning of error values by the function as it is unused
by the sole caller.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:37 -07:00
Shailabh Nagar
7d94dddd43 [PATCH] make taskstats sending completely independent of delay accounting on/off status
Complete the separation of delay accounting and taskstats by ignoring the
return value of delay accounting functions that fill in parts of taskstats
before it is sent out (either in response to a command or as part of a task
exit).

Also make delayacct_add_tsk return silently when delay accounting is turned
off rather than treat it as an error.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:37 -07:00
David Brownell
15a647eba9 [PATCH] genirq: {en,dis}able_irq_wake() need refcounting too
IRQs need refcounting and a state flag to track whether the the IRQ should
be enabled or disabled as a "normal IRQ" source after a series of calls to
{en,dis}able_irq().  For shared IRQs, the IRQ must be enabled so long as at
least one driver needs it active.

Likewise, IRQs need the same support to track whether the IRQ should be
enabled or disabled as a "wakeup event" source after a series of calls to
{en,dis}able_irq_wake().  For shared IRQs, the IRQ must be enabled as a
wakeup source during sleep so long as at least one driver needs it.  But
right now they _don't have_ that refcounting ...  which means sharing a
wakeup-capable IRQ can't work correctly in some configurations.

This patch adds the refcount and flag mechanisms to set_irq_wake() -- which
is what {en,dis}able_irq_wake() call -- and minimal documentation of what
the irq wake mechanism does.

Drivers relying on the older (broken) "toggle" semantics will trigger a
warning; that'll be a handful of drivers on ARM systems.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:36 -07:00
Siddha, Suresh B
f712c0c7e1 [PATCH] sched: build_sched_domains() fix
Use the correct groups while initializing sched groups power for
allnodes_domain.  This fixes the crash observed while creating exclusive
cpusets.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Reported-and-tested-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31 13:28:36 -07:00
Ingo Molnar
e3f2ddeac7 [PATCH] pi-futex: robust-futex exit
Fix robust PI-futexes to be properly unlocked on unexpected exit.

For this to work the kernel has to know whether a futex is a PI or a
non-PI one, because the semantics are different.  Since the space in
relevant glibc data structures is extremely scarce, the best solution is
to encode the 'PI' information in bit 0 of the robust list pointer.
Existing (non-PI) glibc robust futexes have this bit always zero, so the
ABI is kept.  New glibc with PI-robust-futexes will set this bit.

Further fixes from Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-28 21:02:00 -07:00
Ingo Molnar
627371d73c [PATCH] pi-futex: robust-futex exit crash fix
Fix pi_state->list handling bugs: list handling mishap, locking error.
Plus add more debug checks and fix a few style issues i noticed while
debugging this.

(reported by Ulrich Drepper and Jakub Jelinek.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-28 21:02:00 -07:00
Paul Jackson
abb5a5cc6b [PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lock
Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset
callback_mutex lock.

It only happens on cpu_exclusive cpusets, due to the dynamic
sched domain code trying to take the cpu hotplug lock inside
the cpuset callback_mutex lock.

This bug has apparently been here for several months, but didn't
get hit until the right customer load on a large system.

This fix appears right from inspection, but it will take a few
more days running it on that customers workload to be confident
we nailed it.  We don't have any other reproducible test case.

The cpu_hotplug_lock() tends to cover large runs of code.
The other places that hold both that lock and the cpuset callback
mutex lock always nest the cpuset lock inside the hotplug lock.
This place tries to do the reverse, risking an ABBA deadlock.

This is in the cpuset_rmdir() code, where we:
  * take the callback_mutex lock
  * mark the cpuset CS_REMOVED
  * call update_cpu_domains for cpu_exclusive cpusets
  * in that call, take the cpu_hotplug lock if the
    cpuset is marked for removal.

Thanks to Jack Steiner for identifying this deadlock.

The fix is to tear down the dynamic sched domain before we grab
the cpuset callback_mutex lock.  This way, the two locks are
serialized, with the hotplug lock taken and released before
trying for the cpuset lock.

I suspect that this bug was introduced when I changed the
cpuset locking from one lock to two.  The dynamic sched domain
dependency on cpu_exclusive cpusets and its hotplug hooks were
added to this code earlier, when cpusets had only a single lock.
It may well have been fine then.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-23 13:03:05 -07:00
Linus Torvalds
aa95387774 cpu hotplug: simplify and hopefully fix locking
The CPU hotplug locking was quite messy, with a recursive lock to
handle the fact that both the actual up/down sequence wanted to
protect itself from being re-entered, but the callbacks that it
called also tended to want to protect themselves from CPU events.

This splits the lock into two (one to serialize the whole hotplug
sequence, the other to protect against the CPU present bitmaps
changing). The latter still allows recursive usage because some
subsystems (ondemand policy for cpufreq at least) had already gotten
too used to the lax locking, but the locking mistakes are hopefully
now less fundamental, and we now warn about recursive lock usage
when we see it, in the hope that it can be fixed.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-23 12:12:16 -07:00
Shailabh Nagar
bb129994c3 [PATCH] Remove down_write() from taskstats code invoked on the exit() path
In send_cpu_listeners(), which is called on the exit path, a down_write()
was protecting operations like skb_clone() and genlmsg_unicast() that do
GFP_KERNEL allocations.  If the oom-killer decides to kill tasks to satisfy
the allocations,the exit of those tasks could block on the same semphore.

The down_write() was only needed to allow removal of invalid listeners from
the listener list.  The patch converts the down_write to a down_read and
defers the removal to a separate critical region.  This ensures that even
if the oom-killer is called, no other task's exit is blocked as it can
still acquire another down_read.

Thanks to Andrew Morton & Herbert Xu for pointing out the oom related
pitfalls, and to Chandra Seetharaman for suggesting this fix instead of
using something more complex like RCU.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
f9fd8914c1 [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks
On systems with a large number of cpus, with even a modest rate of tasks
exiting per cpu, the volume of taskstats data sent on thread exit can
overflow a userspace listener's buffers.

One approach to avoiding overflow is to allow listeners to get data for a
limited and specific set of cpus.  By scaling the number of listeners
and/or the cpus they monitor, userspace can handle the statistical data
overload more gracefully.

In this patch, each listener registers to listen to a specific set of cpus
by specifying a cpumask.  The interest is recorded per-cpu.  When a task
exits on a cpu, its taskstats data is unicast to each listener interested
in that cpu.

Thanks to Andrew Morton for pointing out the various scalability and
general concerns of previous attempts and for suggesting this design.

[akpm@osdl.org: build fix]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
ad4ecbcba7 [PATCH] delay accounting taskstats interface send tgid once
Send per-tgid data only once during exit of a thread group instead of once
with each member thread exit.

Currently, when a thread exits, besides its per-tid data, the per-tgid data
of its thread group is also sent out, if its thread group is non-empty.
The per-tgid data sent consists of the sum of per-tid stats for all
*remaining* threads of the thread group.

This patch modifies this sending in two ways:

- the per-tgid data is sent only when the last thread of a thread group
  exits.  This cuts down heavily on the overhead of sending/receiving
  per-tgid data, especially when other exploiters of the taskstats
  interface aren't interested in per-tgid stats

- the semantics of the per-tgid data sent are changed.  Instead of being
  the sum of per-tid data for remaining threads, the value now sent is the
  true total accumalated statistics for all threads that are/were part of
  the thread group.

The patch also addresses a minor issue where failure of one accounting
subsystem to fill in the taskstats structure was causing the send of
taskstats to not be sent at all.

The patch has been tested for stability and run cerberus for over 4 hours
on an SMP.

[akpm@osdl.org: bugfixes]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
2589045466 [PATCH] per-task-delay-accounting: /proc export of aggregated block I/O delays
Export I/O delays seen by a task through /proc/<tgid>/stats for use in top
etc.

Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed
together with all other I/O here (this is not the case in the netlink
interface where the swapin I/O is kept distinct)

[akpm@osdl.org: printk warning fix]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:57 -07:00
Shailabh Nagar
6f44993fe1 [PATCH] per-task-delay-accounting: delay accounting usage of taskstats interface
Usage of taskstats interface by delay accounting.

Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Shailabh Nagar
c757249af1 [PATCH] per-task-delay-accounting: taskstats interface
Create a "taskstats" interface based on generic netlink (NETLINK_GENERIC
family), for getting statistics of tasks and thread groups during their
lifetime and when they exit.  The interface is intended for use by multiple
accounting packages though it is being created in the context of delay
accounting.

This patch creates the interface without populating the fields of the data
that is sent to the user in response to a command or upon the exit of a task.
Each accounting package interested in using taskstats has to provide an
additional patch to add its stats to the common structure.

[akpm@osdl.org: cleanups, Kconfig fix]
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Chandra Seetharaman
52f17b6c2b [PATCH] per-task-delay-accounting: cpu delay collection via schedstats
Make the task-related schedstats functions callable by delay accounting even
if schedstats collection isn't turned on.  This removes the dependency of
delay accounting on schedstats.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Shailabh Nagar
0ff922452d [PATCH] per-task-delay-accounting: sync block I/O and swapin delay collection
Unlike earlier iterations of the delay accounting patches, now delays are only
collected for the actual I/O waits rather than try and cover the delays seen
in I/O submission paths.

Account separately for block I/O delays incurred as a result of swapin page
faults whose frequency can be affected by the task/process' rss limit.  Hence
swapin delays can act as feedback for rss limit changes independent of I/O
priority changes.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Shailabh Nagar
ca74e92b46 [PATCH] per-task-delay-accounting: setup
Initialization code related to collection of per-task "delay" statistics which
measure how long it had to wait for cpu, sync block io, swapping etc.  The
collection of statistics and the interface are in other patches.  This patch
sets up the data structures and allows the statistics collection to be
disabled through a kernel boot parameter.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Peter Chubb <peterc@gelato.unsw.edu.au>
Cc: Erich Focht <efocht@ess.nec.de>
Cc: Levent Serinol <lserinol@gmail.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:56 -07:00
Ingo Molnar
3a5f5e488c [PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSW
On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement
lock validator support there's a bug in rq->lock handling: in this case we
dont 'carry over' the runqueue lock into another task - but still we did a
spinlock_release() of it.  Fix this by making the spinlock_release() in
context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW.

(Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW.
This fixes a lockdep-internal BUG message on such platforms.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:55 -07:00
OGAWA Hirofumi
a4afee02a5 [PATCH] Fix sighand->siglock usage in kernel/acct.c
IRQs must be disabled before taking ->siglock.

Noticed by lockdep.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:54 -07:00
john stultz
3e143475c2 [PATCH] improve timekeeping resume robustness
Resolve problems seen w/ APM suspend.

Due to resume initialization ordering, its possible we could get a timer
interrupt before the timekeeping resume() function is called.  This patch
ensures we don't do any timekeeping accounting before we're fully resumed.

(akpm: fixes the machine-freezes-on-APM-resume bug)

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:54 -07:00
Adrian Bunk
6bc02d8412 [PATCH] unexport open_softirq
Christoph Hellwig:
open_softirq just enables a softirq.  The softirq array is statically
allocated so to add a new one you would have to patch the kernel.  So
there's no point to keep this export at all as any user would have to
patch the enum in include/linux/interrupt.h anyway.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:53 -07:00
Luca Tettamanti
60198f9992 [PATCH] Add try_to_freeze() to rt-test kthreads
When CONFIG_RT_MUTEX_TESTER is enabled kernel refuses to suspend the
machine because it's unable to freeze the rt-test-* threads.

Add try_to_freeze() after schedule() so that the threads will be freezed
correctly; I've tested the patch and it lets the notebook suspends and
resumes nicely.

Signed-off-by: Luca Tettamanti <kronos.it@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:53 -07:00
Andrew Morton
a0009652af [PATCH] del_timer_sync(): add cpu_relax()
Relax the CPU in the del_timer_sync() busywait loop.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:52 -07:00
Adrian Bunk
52e92e5788 [PATCH] remove kernel/kthread.c:kthread_stop_sem()
Remove the now-unneeded kthread_stop_sem().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:52 -07:00
Andreas Gruenbacher
098c5eea03 [PATCH] null-terminate over-long /proc/kallsyms symbols
Got a customer bug report (https://bugzilla.novell.com/190296) about kernel
symbols longer than 127 characters which end up in a string buffer that is
not NULL terminated, leading to garbage in /proc/kallsyms.  Using strlcpy
prevents this from happening, even though such symbols still won't come out
right.

A better fix would be to not use a fixed-size buffer, but it's probably not
worth the trouble.  (Modversion'ed symbols even have a length limit of 60.)

[bunk@stusta.de: build fix]
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14 21:53:52 -07:00
Adrian Bunk
cd6ef2ada5 [PATCH] The scheduled unexport of insert_resource
Implement the scheduled unexport of insert_resource.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12 16:09:08 -07:00
Adrian Bunk
26865e9c26 [PATCH] remove kernel/power/pm.c:pm_unregister_all()
Remove the deprecated and no longer used pm_unregister_all().

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12 16:09:08 -07:00
Marcel Holtmann
abf75a5033 [PATCH] Fix prctl privilege escalation and suid_dumpable (CVE-2006-2451)
Based on a patch from Ernie Petrides

During security research, Red Hat discovered a behavioral flaw in core
dump handling. A local user could create a program that would cause a
core file to be dumped into a directory they would not normally have
permissions to write to. This could lead to a denial of service (disk
consumption), or allow the local user to gain root privileges.

The prctl() system call should never allow to set "dumpable" to the
value 2. Especially not for non-privileged users.

This can be split into three cases:

  1) running as root -- then core dumps will already be done as root,
     and so prctl(PR_SET_DUMPABLE, 2) is not useful

  2) running as non-root w/setuid-to-root -- this is the debatable case

  3) running as non-root w/setuid-to-non-root -- then you definitely
     do NOT want "dumpable" to get set to 2 because you have the
     privilege escalation vulnerability

With case #2, the only potential usefulness is for a program that has
designed to run with higher privilege (than the user invoking it) that
wants to be able to create root-owned root-validated core dumps. This
might be useful as a debugging aid, but would only be safe if the program
had done a chdir() to a safe directory.

There is no benefit to a production setuid-to-root utility, because it
shouldn't be dumping core in the first place. If this is true, then the
same debugging aid could also be accomplished with the "suid_dumpable"
sysctl.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12 12:50:25 -07:00
Arjan van de Ven
2c16e9c888 [PATCH] lockdep: disable lock debugging when kernel state becomes untrusted
Disable lockdep debugging in two situations where the integrity of the
kernel no longer is guaranteed: when oopsing and when hitting a
tainting-condition.  The goal is to not get weird lockdep traces that don't
make sense or are otherwise undebuggable, to not waste time.

Lockdep assumes that the previous state it knows about is valid to operate,
which is why lockdep turns itself off after the first violation it reports,
after that point it can no longer make that assumption.

A kernel oops means that the integrity of the kernel compromised; in
addition anything lockdep would report is of lesser importance than the
oops.

All the tainting conditions are of similar integrity-violating nature and
also make debugging/diagnosing more difficult.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:27 -07:00
Christoph Hellwig
c59923a15c [PATCH] remove the tasklist_lock export
As announced half a year ago this patch will remove the tasklist_lock
export.  The previous two patches got rid of the remaining modular users.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:26 -07:00
Ingo Molnar
21d71f513b [PATCH] uninline init_waitqueue_head()
allyesconfig vmlinux size delta:

  text            data    bss     dec          filename
  20736884        6073834 3075176 29885894     vmlinux.before
  20721009        6073966 3075176 29870151     vmlinux.after

~18 bytes per callsite, 15K of text size (~0.1%) saved.

(as an added bonus this also removes a lockdep annotation.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:25 -07:00
Linus Torvalds
aeceb15738 [PATCH] swsusp: fix panic when signature can't be read
Do not panic a machine when swsusp signature can't be read.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Andrew Morton
712f403af6 [PATCH] swsusp warning fix
kernel/power/swap.c: In function 'swsusp_write':
kernel/power/swap.c:275: warning: 'start' may be used uninitialized in this function

gcc isn't smart enough, so help it.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Rafael J. Wysocki
95018f7c94 [PATCH] swsusp: do not use memcpy for snapshotting memory
swsusp should not use memcpy for snapshotting memory, because on some
architectures memcpy may increase preempt_count (i386 does this when
CONFIG_X86_USE_3DNOW is set).  Then, as a result, wrong value of preempt_count
is stored in the image.

Replace memcpy in copy_data_pages with an open-coded loop.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:22 -07:00
Roman Zippel
e154ff3d2c [PATCH] adjust clock for lost ticks
A large number of lost ticks can cause an overadjustment of the clock.  To
compensate for this we look at the current error and the larger the error
already is the more careful we are at adjusting the error.  As small extra
fix reset the error when the clock is set.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Cc: Uwe Bugla <uwe.bugla@gmx.de>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:18 -07:00
Thomas Gleixner
06a9ec291b [PATCH] pi-futex: Validate futex type instead of oopsing
Calling futex_lock_pi is called with a reference to a non PI futex and
waiters exist already, lookup_pi_state() oopses due to pi_state == NULL.
Check this condition and return -EINVAL to userspace.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:18 -07:00
Adrian Bunk
80d6679a62 [PATCH] kernel/softirq.c: EXPORT_UNUSED_SYMBOL
This patch marks an unused export as EXPORT_UNUSED_SYMBOL.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:18 -07:00
Adrian Bunk
c0fc84d2e5 [PATCH] kernel/printk.c: EXPORT_SYMBOL_UNUSED
This patch marks unused exports as EXPORT_SYMBOL_UNUSED.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:17 -07:00
Ingo Molnar
d6d897cec2 [PATCH] lockdep: core, reduce per-lock class-cache size
lockdep_map is embedded into every lock, which blows up data structure
sizes all around the kernel.  Reduce the class-cache to be for the default
class only - that is used in 99.9% of the cases and even if we dont have a
class cached, the lookup in the class-hash is lockless.

This change reduces the per-lock dep_map overhead by 56 bytes on 64-bit
platforms and by 28 bytes on 32-bit platforms.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:14 -07:00
Arjan van de Ven
55794a412f [PATCH] lockdep: improve debug output
Make lockdep print which lock is held, in the "kfree() of a live lock"
scenario.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:14 -07:00
Andi Kleen
f9829cceb6 [PATCH] Minor cleanup to lockdep.c
- Use printk formatting for indentation
- Don't leave NTFS in the default event filter

Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:14 -07:00
Andreas Mohr
2ed6e34f88 [PATCH] small kernel/sched.c cleanup
- constify and optimize stat_nam (thanks to Michael Tokarev!)
- spelling and comment fixes

Signed-off-by: Andreas Mohr <andi@lisas.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:13 -07:00
Peter Williams
0a565f7919 [PATCH] sched: fix bug in __migrate_task()
Problem:

In the function __migrate_task(), deactivate_task() followed by
activate_task() is used to move the task from one run queue to
another.  This has two undesirable effects:

1. The task's priority is recalculated. (Nowhere else in the
scheduler code is the priority recalculated for a change of CPU.)

2. The task's time stamp is set to the current time.  At the very least,
this makes the adjustment of the time stamp before the call to
deactivate_task() redundant but I believe the problem is more serious
as the time stamp now holds the time of the queue change instead of
the time at which the task was woken.  In addition, unless dest_rq is
the same queue as "current" is on the time stamp could be inaccurate
due to inter CPU drift.

Solution:

Replace the call to activate_task() with one to __activate_task().

Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10 13:24:13 -07:00
Linus Torvalds
ca78f6baca Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
  Move workqueue exports to where the functions are defined.
  [CPUFREQ] Misc cleanups in ondemand.
  [CPUFREQ] Make ondemand sampling per CPU and remove the mutex usage in sampling path.
  [CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
  [CPUFREQ] Remove slowdown from ondemand sampling path.
2006-07-04 14:00:26 -07:00
Andrew Morton
d8cb7c1ded [PATCH] revert "kthread: convert stop_machine into a kthread"
Jiri reports that the stop_machin kthread conversion caused his machine to
hang when suspending.  Hyperthreading is apparently involved.

I don't see why that would be and I can't reproduce it.  Revert to the 2.6.17
code.

Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 21:25:20 -07:00
Linus Torvalds
912b2539e1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  powerpc: add defconfig for Freescale MPC8349E-mITX board
  powerpc: Add base support for the Freescale MPC8349E-mITX eval board
  Documentation: correct values in MPC8548E SEC example node
  [POWERPC] Actually copy over i8259.c to arch/ppc/syslib this time
  [POWERPC] Add new interrupt mapping core and change platforms to use it
  [POWERPC] Copy i8259 code back to arch/ppc
  [POWERPC] New device-tree interrupt parsing code
  [POWERPC] Use the genirq framework
  [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts
  [POWERPC] Update the SWIM3 (powermac) floppy driver
  [POWERPC] Fix error handling in detecting legacy serial ports
  [POWERPC] Fix booting on Momentum "Apache" board (a Maple derivative)
  [POWERPC] Fix various offb and BootX-related issues
  [POWERPC] Add a default config for 32-bit CHRP machines
  [POWERPC] fix implicit declaration on cell.
  [POWERPC] change get_property to return void *
2006-07-03 15:28:34 -07:00
Ingo Molnar
70b97a7f0b [PATCH] sched: cleanup, convert sched.c-internal typedefs to struct
convert:

 - runqueue_t to 'struct rq'
 - prio_array_t to 'struct prio_array'
 - migration_req_t to 'struct migration_req'

I was the one who added these but they are both against the kernel coding
style and also were used inconsistently at places.  So just get rid of them at
once, now that we are flushing the scheduler patch-queue anyway.

Conversion was mostly scripted, the result was reviewed and all secondary
whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:11 -07:00
Ingo Molnar
36c8b58689 [PATCH] sched: cleanup, remove task_t, convert to struct task_struct
cleanup: remove task_t and convert all the uses to struct task_struct. I
introduced it for the scheduler anno and it was a mistake.

Conversion was mostly scripted, the result was reviewed and all
secondary whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:11 -07:00
Ingo Molnar
48f24c4da1 [PATCH] sched: clean up fallout of recent changes
Clean up some of the impact of recent (and not so recent) scheduler
changes:

 - turning macros into nice inline functions
 - sanitizing and unifying variable definitions
 - whitespace, style consistency, 80-lines, comment correctness, spelling
   and curly braces police

Due to the macro hell and variable placement simplifications there's even 26
bytes of .text saved:

   text    data     bss     dec     hex filename
  25510    4153     192   29855    749f sched.o.before
  25484    4153     192   29829    7485 sched.o.after

[akpm@osdl.org: build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Paul Mackerras
829035fd70 [PATCH] lockdep: irqtrace subsystem, move account_system_vtime() calls into kernel/softirq.c
At the moment, powerpc and s390 have their own versions of do_softirq which
include local_bh_disable() and __local_bh_enable() calls.  They end up
calling __do_softirq (in kernel/softirq.c) which also does
local_bh_disable/enable.

Apparently the two levels of disable/enable trigger a warning from some
validation code that Ingo is working on, and he would like to see the outer
level removed.  But to do that, we have to move the account_system_vtime
calls that are currently in the arch do_softirq() implementations for
powerpc and s390 into the generic __do_softirq() (this is a no-op for other
archs because account_system_vtime is defined to be an empty inline
function on all other archs).  This patch does that.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:10 -07:00
Ingo Molnar
60be6b9a41 [PATCH] lockdep: annotate on-stack completions
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues
implicitly initialized by DECLARE_COMPLETION().  Annotate on-stack completions
accordingly.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Ingo Molnar
366c7f554e [PATCH] lockdep: annotate enable_in_hardirq()
Make use of local_irq_enable_in_hardirq() API to annotate places that enable
hardirqs in hardirq context.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:09 -07:00
Ingo Molnar
ad33945175 [PATCH] lockdep: annotate ->mmap_sem
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:08 -07:00
Ingo Molnar
5436552448 [PATCH] lockdep: annotate hrtimer base locks
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
fcb993712f [PATCH] lockdep: annotate scheduler runqueue locks
Teach per-CPU runqueue locks and recursive locking code to the lock validator.
 Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
d730e882a1 [PATCH] lockdep: annotate timer base locks
Split the per-CPU timer base locks up into separate lock classes, because they
are used recursively.

Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
eb4542b98c [PATCH] lockdep: annotate waitqueues
Create one lock class for all waitqueue locks in the kernel.  Has no effect on
non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:07 -07:00
Ingo Molnar
243c7621aa [PATCH] lockdep: annotate genirq
Teach special (recursive) locking code to the lock validator.  Has no effect
on non-lockdep kernels.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:06 -07:00
Ingo Molnar
8b8f319fc7 [PATCH] lockdep: annotate futex
Teach special (recursive) locking code to the lock validator.  Introduces
double_lock_hb() to unify double- hash-bucket-lock taking.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:06 -07:00
Ingo Molnar
a0f1ccfd8d [PATCH] lockdep: do not recurse in printk
Make printk()-ing from within the lock validation code safer by using the
lockdep-recursion counter.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:05 -07:00
Ingo Molnar
ef5d4707b9 [PATCH] lockdep: prove mutex locking correctness
Use the lock validator framework to prove mutex locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
8a25d5debf [PATCH] lockdep: prove spinlock rwlock locking correctness
Use the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
4ea2176dfa [PATCH] lockdep: prove rwsem locking correctness
Use the lock validator framework to prove rwsem locking correctness.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
a8f24a3978 [PATCH] lockdep: procfs
Lock validator /proc/lockdep and /proc/lockdep_stats support.
(FIXME: should go into debugfs)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
6c9076ec9c [PATCH] lockdep: allow read_lock() recursion of same class
From: Ingo Molnar <mingo@elte.hu>

lockdep so far only allowed read-recursion for the same lock instance.
This is enough in the overwhelming majority of cases, but a hostap case
triggered and reported by Miles Lane relies on same-class
different-instance recursion.  So we relax the restriction on read-lock
recursion.

(This change does not allow rwsem read-recursion, which is still
forbidden.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:04 -07:00
Ingo Molnar
fbb9ce9530 [PATCH] lockdep: core
Do 'make oldconfig' and accept all the defaults for new config options -
reboot into the kernel and if everything goes well it should boot up fine and
you should have /proc/lockdep and /proc/lockdep_stats files.

Typically if the lock validator finds some problem it will print out
voluminous debug output that begins with "BUG: ..." and which syslog output
can be used by kernel developers to figure out the precise locking scenario.

What does the lock validator do?  It "observes" and maps all locking rules as
they occur dynamically (as triggered by the kernel's natural use of spinlocks,
rwlocks, mutexes and rwsems).  Whenever the lock validator subsystem detects a
new locking scenario, it validates this new rule against the existing set of
rules.  If this new rule is consistent with the existing set of rules then the
new rule is added transparently and the kernel continues as normal.  If the
new rule could create a deadlock scenario then this condition is printed out.

When determining validity of locking, all possible "deadlock scenarios" are
considered: assuming arbitrary number of CPUs, arbitrary irq context and task
context constellations, running arbitrary combinations of all the existing
locking scenarios.  In a typical system this means millions of separate
scenarios.  This is why we call it a "locking correctness" validator - for all
rules that are observed the lock validator proves it with mathematical
certainty that a deadlock could not occur (assuming that the lock validator
implementation itself is correct and its internal data structures are not
corrupted by some other kernel subsystem).  [see more details and conditionals
of this statement in include/linux/lockdep.h and
Documentation/lockdep-design.txt]

Furthermore, this "all possible scenarios" property of the validator also
enables the finding of complex, highly unlikely multi-CPU multi-context races
via single single-context rules, increasing the likelyhood of finding bugs
drastically.  In practical terms: the lock validator already found a bug in
the upstream kernel that could only occur on systems with 3 or more CPUs, and
which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
That bug was found and reported on a single-CPU system (!).  So in essence a
race will be found "piecemail-wise", triggering all the necessary components
for the race, without having to reproduce the race scenario itself!  In its
short existence the lock validator found and reported many bugs before they
actually caused a real deadlock.

To further increase the efficiency of the validator, the mapping is not per
"lock instance", but per "lock-class".  For example, all struct inode objects
in the kernel have inode->inotify_mutex.  If there are 10,000 inodes cached,
then there are 10,000 lock objects.  But ->inotify_mutex is a single "lock
type", and all locking activities that occur against ->inotify_mutex are
"unified" into this single lock-class.  The advantage of the lock-class
approach is that all historical ->inotify_mutex uses are mapped into a single
(and as narrow as possible) set of locking rules - regardless of how many
different tasks or inode structures it took to build this set of rules.  The
set of rules persist during the lifetime of the kernel.

To see the rough magnitude of checking that the lock validator does, here's a
portion of /proc/lockdep_stats, fresh after bootup:

 lock-classes:                            694 [max: 2048]
 direct dependencies:                  1598 [max: 8192]
 indirect dependencies:               17896
 all direct dependencies:             16206
 dependency chains:                    1910 [max: 8192]
 in-hardirq chains:                      17
 in-softirq chains:                     105
 in-process chains:                    1065
 stack-trace entries:                 38761 [max: 131072]
 combined max dependencies:         2033928
 hardirq-safe locks:                     24
 hardirq-unsafe locks:                  176
 softirq-safe locks:                     53
 softirq-unsafe locks:                  137
 irq-safe locks:                         59
 irq-unsafe locks:                      176

The lock validator has observed 1598 actual single-thread locking patterns,
and has validated all possible 2033928 distinct locking scenarios.

More details about the design of the lock validator can be found in
Documentation/lockdep-design.txt, which can also found at:

   http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

[bunk@stusta.de: cleanups]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:03 -07:00
Ingo Molnar
de30a2b355 [PATCH] lockdep: irqtrace subsystem, core
Accurate hard-IRQ-flags and softirq-flags state tracing.

This allows us to attach extra functionality to IRQ flags on/off
events (such as trace-on/off).

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:03 -07:00
Ingo Molnar
8637c09901 [PATCH] lockdep: stacktrace subsystem, core
Framework to generate and save stacktraces quickly, without printing anything
to the console.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Ingo Molnar
e4d9191885 [PATCH] lockdep: locking init debugging improvement
Locking init improvement:

 - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
   to pass in the name string of locks, used by debugging

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:02 -07:00
Ingo Molnar
9cebb55268 [PATCH] lockdep: mutex section binutils workaround
Work around weird section nesting build bug causing smp-alternatives failures
under certain circumstances.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
9a11b49a80 [PATCH] lockdep: better lock debugging
Generic lock debugging:

 - generalized lock debugging framework. For example, a bug in one lock
   subsystem turns off debugging in all lock subsystems.

 - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
   the mutex/rtmutex debugging code: it caused way too much prototype
   hackery, and lockdep will give the same information anyway.

 - ability to do silent tests

 - check lock freeing in vfree too.

 - more finegrained debugging options, to allow distributions to
   turn off more expensive debugging features.

There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes.  (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)

Here are the current debugging options:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y

which do:

 config DEBUG_MUTEXES
          bool "Mutex debugging, basic checks"

 config DEBUG_LOCK_ALLOC
         bool "Detect incorrect freeing of live mutexes"

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
fb7e42413a [PATCH] lockdep: remove mutex deadlock checking code
With the lock validator we detect mutex deadlocks (and more), the mutex
deadlock checking code is both redundant and slower.  So remove it.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
36596243da [PATCH] lockdep: remove DEBUG_BUG_ON()
cleanup: remove unused DEBUG_BUG_ON() defines.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
9e7f4d451e [PATCH] lockdep: rename DEBUG_WARN_ON()
Rename DEBUG_WARN_ON() to the less generic DEBUG_LOCKS_WARN_ON() name, so that
it's clear that this is a lock-debugging internal mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
c4e05116a2 [PATCH] lockdep: clean up rwsems
Clean up rwsems.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:01 -07:00
Ingo Molnar
4d435f9d8f [PATCH] lockdep: add is_module_address()
Add is_module_address() method - to be used by lockdep.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:27:00 -07:00
Christoph Lameter
9614634fe6 [PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/O
It turns out that it is advantageous to leave a small portion of unmapped file
backed pages if all of a zone's pages (or almost all pages) are allocated and
so the page allocator has to go off-node.

This allows recently used file I/O buffers to stay on the node and
reduces the times that zone reclaim is invoked if file I/O occurs
when we run out of memory in a zone.

The problem is that zone reclaim runs too frequently when the page cache is
used for file I/O (read write and therefore unmapped pages!) alone and we have
almost all pages of the zone allocated.  Zone reclaim may remove 32 unmapped
pages.  File I/O will use these pages for the next read/write requests and the
unmapped pages increase.  After the zone has filled up again zone reclaim will
remove it again after only 32 pages.  This cycle is too inefficient and there
are potentially too many zone reclaim cycles.

With the 1% boundary we may still remove all unmapped pages for file I/O in
zone reclaim pass.  However.  it will take a large number of read and writes
to get back to 1% again where we trigger zone reclaim again.

The zone reclaim 2.6.16/17 does not show this behavior because we have a 30
second timeout.

[akpm@osdl.org: rename the /proc file and the variable]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 15:26:59 -07:00
Benjamin Herrenschmidt
5a43a066b1 [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts
Make the fasteoi handler mark disabled interrupts as pending if they
happen anyway. This allow implementation of a delayed disable scheme
with the fasteoi handler.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-03 19:54:59 +10:00
Thomas Gleixner
284c66806e [PATCH] genirq:fixup missing SA_PERCPU replacement
The irqflags consolidation converted SA_PERCPU_IRQ to IRQF_PERCPU but
did not define the new constant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 17:29:22 -07:00
Thomas Gleixner
d061daa0e3 [PATCH] genirq: ARM dyntick cleanup
Linus: "The hacks in kernel/irq/handle.c are really horrid. REALLY
horrid."

They are indeed. Move the dyntick quirks to ARM where they belong.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 17:29:21 -07:00
Linus Torvalds
b4b9034132 Merge branch 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm: (24 commits)
  [ARM] 3683/2:  ARM: Convert at91rm9200 to generic irq handling
  [ARM] 3682/2:  ARM: Convert ixp4xx to generic irq handling
  [ARM] 3702/1: ARM: Convert ixp23xx to generic irq handling
  [ARM] 3701/1: ARM: Convert plat-omap to generic irq handling
  [ARM] 3700/1: ARM: Convert lh7a40x to generic irq handling
  [ARM] 3699/1: ARM: Convert s3c2410 to generic irq handling
  [ARM] 3698/1: ARM: Convert sa1100 to generic irq handling
  [ARM] 3697/1: ARM: Convert shark to generic irq handling
  [ARM] 3696/1: ARM: Convert clps711x to generic irq handling
  [ARM] 3694/1: ARM: Convert ecard driver to generic irq handling
  [ARM] 3693/1: ARM: Convert omap1 to generic irq handling
  [ARM] 3691/1: ARM: Convert imx to generic irq handling
  [ARM] 3688/1: ARM: Convert clps7500 to generic irq handling
  [ARM] 3687/1: ARM: Convert integrator to generic irq handling
  [ARM] 3685/1: ARM: Convert pxa to generic irq handling
  [ARM] 3684/1: ARM: Convert l7200 to generic irq handling
  [ARM] 3681/1: ARM: Convert ixp2000 to generic irq handling
  [ARM] 3680/1: ARM: Convert footbridge to generic irq handling
  [ARM] 3695/1: ARM drivers/pcmcia: Fixup includes
  [ARM] 3689/1: ARM drivers/input/touchscreen: Fixup includes
  ...

Manual conflict resolved in kernel/irq/handle.c (butt-ugly ARM tickless
code).
2006-07-02 15:07:45 -07:00
Thomas Gleixner
3cca53b02a [PATCH] irq-flags: generic irq: Use the new IRQF_ constants
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02 13:58:49 -07:00
Thomas Gleixner
f8b5473fcb [ARM] 3690/1: genirq: Introduce and make use of dummy irq chip
Patch from Thomas Gleixner

From: Thomas Gleixner <tglx@linutronix.de>

ARM has a couple of really dumb interrupt controllers.
Implement a generic one and fixup the ARM migration. ARM reused
the no_irq_chip for this purpose, but this does not work out
for platforms which are not converted to the new interrupt
type handling model.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01 22:30:08 +01:00
Thomas Gleixner
a2166abd06 [ARM] 3679/1: ARM: Make ARM dyntick implementation work with genirq
Patch from Thomas Gleixner

From: Thomas Gleixner <tglx@linutronix.de>

Make the ARM dyntick implementation work with the generic
irq code. This hopefully goes away once we consolidated the
dyntick implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01 22:30:07 +01:00
Linus Torvalds
fc25465f09 Merge branch 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [PATCH] audit syscall classes
  [PATCH] audit: support for object context filters
  [PATCH] audit: rename AUDIT_SE_* constants
  [PATCH] add rule filterkey
2006-07-01 09:59:08 -07:00
Bjorn Helgaas
e8c4b9d003 [PATCH] IRQ: warning message cleanup
Make warnings more consistent.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Bjorn Helgaas
17311c03c3 [PATCH] IRQ: Use SA_PERCPU_IRQ, not IRQ_PER_CPU, for irqaction.flags
IRQ_PER_CPU is a bit in the struct irq_desc "status" field, not in the
struct irqaction "flags", so the previous code checked the wrong bit.

SA_PERCPU_IRQ is only used by drivers/char/mmtimer.c for SGI ia64 boxes.

Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Ingo Molnar
ed6f7b10e6 [PATCH] pi-futex: futex_wake() lockup fix
Fix futex_wake() exit condition bug when handling the robust-list with PI
futexes on them.

(reported by Ulrich Drepper, debugged by the lock validator.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Vernon Mauery
a99e4e413e [PATCH] pi-futex: fix mm_struct memory leak
lock_queue was getting called essentially twice in a row and was
continually incrementing the mm_count ref count, thus causing a memory
leak.

Dinakar Guniguntala provided a proper fix for the problem that simply grabs
the spinlock for the hash bucket queue rather than calling lock_queue.

The second time we do a queue_lock in futex_lock_pi, we really only need to
take the hash bucket lock.

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01 09:55:57 -07:00
Al Viro
b915543b46 [PATCH] audit syscall classes
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined
sets of syscalls.  Infrastructure, a couple of classes (with 32bit counterparts
for biarch targets) and actual tie-in on i386, amd64 and ia64.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 07:44:10 -04:00
Darrel Goeddel
6e5a2d1d32 [PATCH] audit: support for object context filters
This patch introduces object audit filters based on the elements
of the SELinux context.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>

 kernel/auditfilter.c           |   25 +++++++++++++++++++++++++
 kernel/auditsc.c               |   40 ++++++++++++++++++++++++++++++++++++++++
 security/selinux/ss/services.c |   18 +++++++++++++++++-
 3 files changed, 82 insertions(+), 1 deletion(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 05:44:19 -04:00
Darrel Goeddel
3a6b9f85c6 [PATCH] audit: rename AUDIT_SE_* constants
This patch renames some audit constant definitions and adds
additional definitions used by the following patch.  The renaming
avoids ambiguity with respect to the new definitions.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>

 include/linux/audit.h          |   15 ++++++++----
 kernel/auditfilter.c           |   50 ++++++++++++++++++++---------------------
 kernel/auditsc.c               |   10 ++++----
 security/selinux/ss/services.c |   32 +++++++++++++-------------
 4 files changed, 56 insertions(+), 51 deletions(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01 05:44:08 -04:00
Amy Griffis
5adc8a6adc [PATCH] add rule filterkey
Add support for a rule key, which can be used to tie audit records to audit
rules.  This is useful when a watched file is accessed through a link or
symlink, as well as for general audit log analysis.

Because this patch uses a string key instead of an integer key, there is a bit
of extra overhead to do the kstrdup() when a rule fires.  However, we're also
allocating memory for the audit record buffer, so it's probably not that
significant.  I went ahead with a string key because it seems more
user-friendly.

Note that the user must ensure that filterkeys are unique.  The kernel only
checks for duplicate rules.

Signed-off-by: Amy Griffis <amy.griffis@hpd.com>
2006-07-01 05:43:06 -04:00
Linus Torvalds
22a3e233ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  Remove obsolete #include <linux/config.h>
  remove obsolete swsusp_encrypt
  arch/arm26/Kconfig typos
  Documentation/IPMI typos
  Kconfig: Typos in net/sched/Kconfig
  v9fs: do not include linux/version.h
  Documentation/DocBook/mtdnand.tmpl: typo fixes
  typo fixes: specfic -> specific
  typo fixes in Documentation/networking/pktgen.txt
  typo fixes: occuring -> occurring
  typo fixes: infomation -> information
  typo fixes: disadvantadge -> disadvantage
  typo fixes: aquire -> acquire
  typo fixes: mecanism -> mechanism
  typo fixes: bandwith -> bandwidth
  fix a typo in the RTC_CLASS help text
  smb is no longer maintained

Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30 15:39:30 -07:00
Andrew Morton
e7b384043e [PATCH] cond_resched() fix
Fix a bug identified by Zou Nan hai <nanhai.zou@intel.com>:

If the system is in state SYSTEM_BOOTING, and need_resched() is true,
cond_resched() returns true even though it didn't reschedule.  Consequently
need_resched() remains true and JBD locks up.

Fix that by teaching cond_resched() to only return true if it really did call
schedule().

cond_resched_lock() and cond_resched_softirq() have a problem too.  If we're
in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
will prevent schedule() from being called.  So on return, need_resched() will
still be true, but cond_resched_lock() has to return 1 to tell the caller that
the lock was dropped.  The caller will probably lock up.

Bottom line: if these functions dropped the lock, they _must_ call schedule()
to clear need_resched().   Make it so.

Also, uninline __cond_resched().  It's largeish, and slowpath.

Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:38 -07:00
David Quigley
8f95dc58d0 [PATCH] SELinux: add security hook call to kill_proc_info_as_uid
This patch adds a call to the extended security_task_kill hook introduced by
the prior patch to the kill_proc_info_as_uid function so that these signals
can be properly mediated by security modules.  It also updates the existing
hook call in check_kill_permission.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:37 -07:00
Christoph Lameter
34aa1330f9 [PATCH] zoned vm counters: zone_reclaim: remove /proc/sys/vm/zone_reclaim_interval
The zone_reclaim_interval was necessary because we were not able to determine
how many unmapped pages exist in a zone.  Therefore we had to scan in
intervals to figure out if any pages were unmapped.

With the zoned counters and NR_ANON_PAGES we now know the number of pagecache
pages and the number of mapped pages in a zone.  So we can simply skip the
reclaim if there is an insufficient number of unmapped pages.  We use
SWAP_CLUSTER_MAX as the boundary.

Drop all support for /proc/sys/vm/zone_reclaim_interval.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30 11:25:35 -07:00
Jörn Engel
6ab3d5624e Remove obsolete #include <linux/config.h>
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 19:25:36 +02:00
Pavel Machek
e02169b682 remove obsolete swsusp_encrypt
Remove SWSUSP_ENCRYPT config option; it is no longer implemented.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:59:59 +02:00
Adrian Bunk
80f7228b59 typo fixes: occuring -> occurring
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30 18:27:16 +02:00
Dave Jones
ae90dd5dbe Move workqueue exports to where the functions are defined.
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30 01:40:45 -04:00
Venkatesh Pallipadi
7a6bc1cdd5 [CPUFREQ] Add queue_delayed_work_on() interface for workqueues.
Add queue_delayed_work_on() interface for workqueues.

Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30 01:33:31 -04:00
Darrel Goeddel
c7bdb545d2 [NETLINK]: Encapsulate eff_cap usage within security framework.
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within
the security framework by extending security_netlink_recv to include a required
capability parameter and converting all direct usage of eff_caps outside
of the lsm modules to use the interface.  It also updates the SELinux
implementation of the security_netlink_send and security_netlink_recv
hooks to take advantage of the sid in the netlink_skb_params struct.
This also enables SELinux to perform auditing of netlink capability checks.
Please apply, for 2.6.18 if possible.

Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by:  James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29 16:57:55 -07:00
Linus Torvalds
3aa590c6b7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits)
  [POWERPC] Use little-endian bit from firmware ibm,pa-features property
  [POWERPC] Make sure smp_processor_id works very early in boot
  [POWERPC] U4 DART improvements
  [POWERPC] todc: add support for Time-Of-Day-Clock
  [POWERPC] Make lparcfg.c work when both iseries and pseries are selected
  [POWERPC] Fix idr locking in init_new_context
  [POWERPC] mpc7448hpc2 (taiga) board config file
  [POWERPC] Add tsi108 pci and platform device data register function
  [POWERPC] Add general support for mpc7448hpc2 (Taiga) platform
  [POWERPC] Correct the MAX_CONTEXT definition
  powerpc: minor cleanups for mpc86xx
  [POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set
  [POWERPC] Simplify the code defining the 64-bit CPU features
  [POWERPC] powerpc: kconfig warning fix
  [POWERPC] Consolidate some of kernel/misc*.S
  [POWERPC] Remove unused function call_with_mmu_off
  [POWERPC] update asm-powerpc/time.h
  [POWERPC] Clean up it_lp_queue.h
  [POWERPC] Skip the "copy down" of the kernel if it is already at zero.
  [POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
  ...
2006-06-29 11:32:34 -07:00
Linus Torvalds
1903ac54f8 Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6:
  [PATCH] i386: export memory more than 4G through /proc/iomem
  [PATCH] 64bit Resource: finally enable 64bit resource sizes
  [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed
  [PATCH] 64bit resource: change pnp core to use resource_size_t
  [PATCH] 64bit resource: change pci core and arch code to use resource_size_t
  [PATCH] 64bit resource: change resource core to use resource_size_t
  [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource
  [PATCH] 64bit resource: fix up printks for resources in misc drivers
  [PATCH] 64bit resource: fix up printks for resources in arch and core code
  [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers
  [PATCH] 64bit resource: fix up printks for resources in video drivers
  [PATCH] 64bit resource: fix up printks for resources in ide drivers
  [PATCH] 64bit resource: fix up printks for resources in mtd drivers
  [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers
  [PATCH] 64bit resource: fix up printks for resources in networks drivers
  [PATCH] 64bit resource: fix up printks for resources in sound drivers
  [PATCH] 64bit resource: C99 changes for struct resource declarations

Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that
was changed by the 64-bit resources had been deleted in the meantime ;)
2006-06-29 10:49:17 -07:00
Ingo Molnar
47c2a3aa44 [PATCH] genirq: add chip->eoi(), fastack -> fasteoi
Clean up the fastack concept by turning it into fasteoi and introducing the
->eoi() method for chips.

This also allows the cleanup of an i386 EOI quirk - now the quirk is
cleanly separated from the pure ACK implementation.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:26 -07:00
Benjamin Herrenschmidt
98bb244b68 [PATCH] genirq: fasteoi handler: handle interrupt disabling
Note when a disable interrupt happened with the fasteoi handler as well so
that delayed disable can be implemented with fasteoi-type controllers.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Ingo Molnar
43f7775944 [PATCH] genirq: more verbose debugging on unexpected IRQ vectors
One frequent sign of IRQ handling bugs is the appearance of unexpected
vectors.  Print out all the IRQ state in that case.  We dont want this patch
upstream, but it is useful during initial testing.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Ingo Molnar
f1c2662cbc [PATCH] genirq: cleanup: no_irq_type -> no_irq_chip rename
Rename no_irq_type to no_irq_chip.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:25 -07:00
Thomas Gleixner
e76de9f8eb [PATCH] genirq: add SA_TRIGGER support
Enable drivers to request an IRQ with a given irq-flow (trigger/polarity)
setting.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
ba9a2331ba [PATCH] genirq: add irq-wake (power-management) support
Enable platforms to set the irq-wake (power-management) properties of an IRQ.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Ingo Molnar
7a55713ab4 [PATCH] genirq: add handle_bad_irq()
Handle bad IRQ vectors via the irqchip mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
dd87eb3a24 [PATCH] genirq: add irq-chip support
Enable platforms to use the irq-chip and irq-flow abstractions: allow setting
of the chip, the type and provide highlevel handlers for common irq-flows.

[rostedt@goodmis.org: misroute-irq: Don't call desc->chip->end because of edge interrupts]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
6a6de9ef58 [PATCH] genirq: core
Core genirq support: add the irq-chip and irq-flow abstractions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Ingo Molnar
a34db9b28a [PATCH] genirq: update copyrights
Update/add copyrights in the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
94d39e1f6e [PATCH] genirq: add IRQ_NOAUTOEN support
Enable platforms to disable the automatic enabling of freshly set up irqs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
6550c775cb [PATCH] genirq: add IRQ_NOREQUEST support
Enable platforms to disable request_irq() for certain interrupts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
3418d72404 [PATCH] genirq: add IRQ_NOPROBE support
Introduce IRQ_NOPROBE: enables platforms to control chip-probing.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:24 -07:00
Thomas Gleixner
a4633adcdb [PATCH] genirq: add genirq sw IRQ-retrigger
Enable platforms that do not have a hardware-assisted hardirq-resend mechanism
to resend them via a softirq-driven IRQ emulation mechanism.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
77a5afecdb [PATCH] genirq: cleanup: no_irq_type cleanups
Clean up no_irq_type: share the NOP functions where possible, and properly
name the ack_bad() function.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
8d28bc751b [PATCH] genirq: doc: handle_IRQ_event() and __do_IRQ() comments
Document handle_IRQ_event() and __do_IRQ().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
c0ad90a32f [PATCH] genirq: add ->retrigger() irq op to consolidate hw_irq_resend()
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations.
(Most architectures had it defined to NOP anyway.)

NOTE: ia64 needs testing. i386 and x86_64 tested.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Thomas Gleixner
096c8131c5 [PATCH] genirq: debug: better debug printout in enable_irq()
Make enable_irq() debug printouts user-readable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
0d7012a968 [PATCH] genirq: cleanup: turn ARCH_HAS_IRQ_PER_CPU into CONFIG_IRQ_PER_CPU
Cleanup: change ARCH_HAS_IRQ_PER_CPU into a Kconfig method.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:23 -07:00
Ingo Molnar
cd916d31cc [PATCH] genirq: cleanup: merge pending_irq_cpumask[] into irq_desc[]
Consolidation: remove the pending_irq_cpumask[NR_IRQS] array and move it into
the irq_desc[NR_IRQS].pending_mask field.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
4a733ee126 [PATCH] genirq: cleanup: merge irq_dir[], smp_affinity_entry[] into irq_desc[]
Consolidation: remove the irq_dir[NR_IRQS] and the smp_affinity_entry[NR_IRQS]
arrays and move them into the irq_desc[] array.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
34ffdb7233 [PATCH] genirq: cleanup: reduce irq_desc_t use, mark it obsolete
Cleanup: remove irq_desc_t use from the generic IRQ code, and mark it
obsolete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
06fcb0c6fb [PATCH] genirq: cleanup: misc code cleanups
Assorted code cleanups to the generic IRQ code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
2e60bbb6d5 [PATCH] genirq: cleanup: remove fastcall
Now that i386 defaults to regparm, explicit uses of fastcall are not needed
anymore.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
a8553acd6c [PATCH] genirq: cleanup: remove irq_descp()
Cleanup: remove irq_descp() - explicit use of irq_desc[] is shorter and more
readable.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
a53da52fd7 [PATCH] genirq: cleanup: merge irq_affinity[] into irq_desc[]
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the
irq_desc[NR_IRQS].affinity field.

[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:22 -07:00
Ingo Molnar
74ffd553a3 [PATCH] genirq: sem2mutex probe_sem -> probing_active
Convert the irq auto-probing semaphore to a mutex.  (This allows us to find
probing API usage bugs sooner, via the mutex debugging code.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Ingo Molnar
d1bef4ed5f [PATCH] genirq: rename desc->handler to desc->chip
This patch-queue improves the generic IRQ layer to be truly generic, by adding
various abstractions and features to it, without impacting existing
functionality.

While the queue can be best described as "fix and improve everything in the
generic IRQ layer that we could think of", and thus it consists of many
smaller features and lots of cleanups, the one feature that stands out most is
the new 'irq chip' abstraction.

The irq-chip abstraction is about describing and coding and IRQ controller
driver by mapping its raw hardware capabilities [and quirks, if needed] in a
straightforward way, without having to think about "IRQ flow"
(level/edge/etc.) type of details.

This stands in contrast with the current 'irq-type' model of genirq
architectures, which 'mixes' raw hardware capabilities with 'flow' details.
The patchset supports both types of irq controller designs at once, and
converts i386 and x86_64 to the new irq-chip design.

As a bonus side-effect of the irq-chip approach, chained interrupt controllers
(master/slave PIC constructs, etc.) are now supported by design as well.

The end result of this patchset intends to be simpler architecture-level code
and more consolidation between architectures.

We reused many bits of code and many concepts from Russell King's ARM IRQ
layer, the merging of which was one of the motivations for this patchset.

This patch:

rename desc->handler to desc->chip.

Originally i did not want to do this, because it's a big patch.  But having
both "desc->handler", "desc->handle_irq" and "action->handler" caused a
large degree of confusion and made the code appear alot less clean than it
truly is.

I have also attempted a dual approach as well by introducing a
desc->chip alias - but that just wasnt robust enough and broke
frequently.

So lets get over with this quickly.  The conversion was done automatically
via scripts and converts all the code in the kernel.

This renaming patch is the first one amongst the patches, so that the
remaining patches can stay flexible and can be merged and split up
without having some big monolithic patch act as a merge barrier.

[akpm@osdl.org: build fix]
[akpm@osdl.org: another build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29 10:26:21 -07:00
Andrew Morton
84860f9979 [PATCH] load_module() cleanup
Undo bizarre declaration in load_module().

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
Arjan van de Ven
f71d20e961 [PATCH] Add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL.  These
will be used as a transition measure for symbols that aren't used in the
kernel and are on the way out.  When a module uses such a symbol, a warning
is printk'd at modprobe time.

The main reason for removing unused exports is size: eacho export takes
roughly between 100 and 150 bytes of kernel space in the binary.  This
patch gives users the option to immediately get this size gain via a config
option.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28 14:59:04 -07:00
David Wilder
c0ce7d0886 [POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
With this patch, kdump uses the firmware soft-reset NMI for two purposes:
1) Initiate the kdump (take a crash dump) by issuing a soft-reset.
2) Break a CPU out of a deadlock condition that is detected during kdump
processing.

When a soft-reset is initiated each CPU will enter
system_reset_exception() and set its corresponding bit in the global
bit-array cpus_in_sr then call die(). When die() finds the CPU's bit set
in cpu_in_sr crash_kexec() is called to initiate a crash dump. The first
CPU to enter crash_kexec() is called the "crashing CPU". All other CPUs
are "secondary CPUs". The secondary CPU's pass through to
crash_kexec_secondary() and sleep. The crashing CPU waits for all CPUs
to enter via soft-reset then boots the kdump kernel (see
crash_soft_reset_check())

When the system crashes due to a panic or exception, crash_kexec() is
called by panic() or die(). The crashing CPU sends an IPI to all other
CPUs to notify them of the pending shutdown. If a CPU is in a deadlock
or hung state with interrupts disabled, the IPI will not be delivered.
The result being, that the kdump kernel is not booted. This problem is
solved with the use of a firmware generated soft-reset. After the
crashing_cpu has issued the IPI, it waits for 10 sec for all CPUs to
enter crash_ipi_callback(). A CPU signifies its entry to
crash_ipi_callback() by setting its corresponding bit in the
cpus_in_crash bit array. After 10 sec, if one or more CPUs have not set
their bit in cpus_in_crash we assume that the CPU(s) is deadlocked. The
operator is then prompted to generate a soft-reset to break the
deadlock. Each CPU enters the soft reset handler as described above.

Two conditions must be handled at this point:
1) The system crashed because the operator generated a soft-reset. See
2) The system had crashed before the soft-reset was generated ( in the
case of a Panic or oops).

The first CPU to enter crash_kexec() uses the state of the kexec_lock to
determine this state. If kexec_lock is already held then condition 2 is
true and crash_kexec_secondary() is called, else; this CPU is flagged as
the crashing CPU, the kexec_lock is acquired and crash_kexec() proceeds
as described above.

Each additional CPUs responding to the soft-reset will pass through
crash_kexec() to kexec_secondary(). All secondary CPUs call
crash_ipi_callback() readying them self's for the shutdown. When ready
they clear their bit in cpus_in_sr. The crashing CPU waits in
kexec_secondary() until all other CPUs have cleared their bits in
cpus_in_sr. The kexec kernel boot is then started.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: David Wilder <dwilder@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-28 15:18:52 +10:00
Jesper Juhl
9a66a53f55 [PATCH] Remove redundant NULL checks before [kv]free - in kernel/
Remove redundant kfree NULL checks from kernel/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Sebastien Dugue
59e0e0ace7 [PATCH] futex_requeue() optimization
In futex_requeue(), when the 2 futexes keys hash to the same bucket, there
is no need to move the futex_q to the end of the bucket list.

Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner
95e02ca9bb [PATCH] rtmutex: Propagate priority settings into PI lock chains
When the priority of a task, which is blocked on a lock, changes we must
propagate this change into the PI lock chain.  Therefor the chain walk code
is changed to get rid of the references to current to avoid false positives
in the deadlock detector, as setscheduler might be called by a task which
holds the lock on which the task whose priority is changed is blocked.

Also add some comments about the get/put_task_struct usage to avoid
confusion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:48 -07:00
Thomas Gleixner
0bafd214e4 [PATCH] rtmutex: Modify rtmutex-tester to test the setscheduler propagation
Make test suite setscheduler calls asynchronously.  Remove the waits in the
test cases and add a new testcase to verify the correctness of the
setscheduler priority propagation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner
e74c69f46d [PATCH] Drop tasklist lock in do_sched_setscheduler
There is no need to hold tasklist_lock across the setscheduler call, when
we pin the task structure with get_task_struct().  Interrupts are disabled
in setscheduler anyway and the permission checks do not need interrupts
disabled.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
c87e2837be [PATCH] pi-futex: futex_lock_pi/futex_unlock_pi support
This adds the actual pi-futex implementation, based on rt-mutexes.

[dino@in.ibm.com: fix an oops-causing race]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
0cdbee9920 [PATCH] pi-futex: rt mutex futex api
Add proxy-locking rt-mutex functionality needed by pi-futexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Thomas Gleixner
61a8712286 [PATCH] pi-futex: rt mutex tester
RT-mutex tester: scriptable tester for rt mutexes, which allows userspace
scripting of mutex unit-tests (and dynamic tests as well), using the actual
rt-mutex implementation of the kernel.

[akpm@osdl.org: fixlet]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
e7eebaf6a8 [PATCH] pi-futex: rt mutex debug
Runtime debugging functionality for rt-mutexes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
23f78d4a03 [PATCH] pi-futex: rt mutex core
Core functions for the rt-mutex subsystem.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:47 -07:00
Ingo Molnar
b29739f902 [PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

This consists of:

 - caching the 'normal' priority in ->normal_prio
 - providing a functions to set/get the priority of the task
 - make sched_setscheduler() aware of boosting

The effective_prio() cleanups also fix a priority-calculation bug pointed out
by Andrey Gelman, in set_user_nice().

has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andrey Gelman <agelman@012.net.il>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Ingo Molnar
e2970f2fb6 [PATCH] pi-futex: futex code cleanups
We are pleased to announce "lightweight userspace priority inheritance" (PI)
support for futexes.  The following patchset and glibc patch implements it,
ontop of the robust-futexes patchset which is included in 2.6.16-mm1.

We are calling it lightweight for 3 reasons:

 - in the user-space fastpath a PI-enabled futex involves no kernel work
   (or any other PI complexity) at all.  No registration, no extra kernel
   calls - just pure fast atomic ops in userspace.

 - in the slowpath (in the lock-contention case), the system call and
   scheduling pattern is in fact better than that of normal futexes, due to
   the 'integrated' nature of FUTEX_LOCK_PI.  [more about that further down]

 - the in-kernel PI implementation is streamlined around the mutex
   abstraction, with strict rules that keep the implementation relatively
   simple: only a single owner may own a lock (i.e.  no read-write lock
   support), only the owner may unlock a lock, no recursive locking, etc.

  Priority Inheritance - why, oh why???
  -------------------------------------

Many of you heard the horror stories about the evil PI code circling Linux for
years, which makes no real sense at all and is only used by buggy applications
and which has horrible overhead.  Some of you have dreaded this very moment,
when someone actually submits working PI code ;-)

So why would we like to see PI support for futexes?

We'd like to see it done purely for technological reasons.  We dont think it's
a buggy concept, we think it's useful functionality to offer to applications,
which functionality cannot be achieved in other ways.  We also think it's the
right thing to do, and we think we've got the right arguments and the right
numbers to prove that.  We also believe that we can address all the
counter-arguments as well.  For these reasons (and the reasons outlined below)
we are submitting this patch-set for upstream kernel inclusion.

What are the benefits of PI?

  The short reply:
  ----------------

User-space PI helps achieving/improving determinism for user-space
applications.  In the best-case, it can help achieve determinism and
well-bound latencies.  Even in the worst-case, PI will improve the statistical
distribution of locking related application delays.

  The longer reply:
  -----------------

Firstly, sharing locks between multiple tasks is a common programming
technique that often cannot be replaced with lockless algorithms.  As we can
see it in the kernel [which is a quite complex program in itself], lockless
structures are rather the exception than the norm - the current ratio of
lockless vs.  locky code for shared data structures is somewhere between 1:10
and 1:100.  Lockless is hard, and the complexity of lockless algorithms often
endangers to ability to do robust reviews of said code.  I.e.  critical RT
apps often choose lock structures to protect critical data structures, instead
of lockless algorithms.  Furthermore, there are cases (like shared hardware,
or other resource limits) where lockless access is mathematically impossible.

Media players (such as Jack) are an example of reasonable application design
with multiple tasks (with multiple priority levels) sharing short-held locks:
for example, a highprio audio playback thread is combined with medium-prio
construct-audio-data threads and low-prio display-colory-stuff threads.  Add
video and decoding to the mix and we've got even more priority levels.

So once we accept that synchronization objects (locks) are an unavoidable fact
of life, and once we accept that multi-task userspace apps have a very fair
expectation of being able to use locks, we've got to think about how to offer
the option of a deterministic locking implementation to user-space.

Most of the technical counter-arguments against doing priority inheritance
only apply to kernel-space locks.  But user-space locks are different, there
we cannot disable interrupts or make the task non-preemptible in a critical
section, so the 'use spinlocks' argument does not apply (user-space spinlocks
have the same priority inversion problems as other user-space locking
constructs).  Fact is, pretty much the only technique that currently enables
good determinism for userspace locks (such as futex-based pthread mutexes) is
priority inheritance:

Currently (without PI), if a high-prio and a low-prio task shares a lock [this
is a quite common scenario for most non-trivial RT applications], even if all
critical sections are coded carefully to be deterministic (i.e.  all critical
sections are short in duration and only execute a limited number of
instructions), the kernel cannot guarantee any deterministic execution of the
high-prio task: any medium-priority task could preempt the low-prio task while
it holds the shared lock and executes the critical section, and could delay it
indefinitely.

  Implementation:
  ---------------

As mentioned before, the userspace fastpath of PI-enabled pthread mutexes
involves no kernel work at all - they behave quite similarly to normal
futex-based locks: a 0 value means unlocked, and a value==TID means locked.
(This is the same method as used by list-based robust futexes.) Userspace uses
atomic ops to lock/unlock these mutexes without entering the kernel.

To handle the slowpath, we have added two new futex ops:

  FUTEX_LOCK_PI
  FUTEX_UNLOCK_PI

If the lock-acquire fastpath fails, [i.e.  an atomic transition from 0 to TID
fails], then FUTEX_LOCK_PI is called.  The kernel does all the remaining work:
if there is no futex-queue attached to the futex address yet then the code
looks up the task that owns the futex [it has put its own TID into the futex
value], and attaches a 'PI state' structure to the futex-queue.  The pi_state
includes an rt-mutex, which is a PI-aware, kernel-based synchronization
object.  The 'other' task is made the owner of the rt-mutex, and the
FUTEX_WAITERS bit is atomically set in the futex value.  Then this task tries
to lock the rt-mutex, on which it blocks.  Once it returns, it has the mutex
acquired, and it sets the futex value to its own TID and returns.  Userspace
has no other work to perform - it now owns the lock, and futex value contains
FUTEX_WAITERS|TID.

If the unlock side fastpath succeeds, [i.e.  userspace manages to do a TID ->
0 atomic transition of the futex value], then no kernel work is triggered.

If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then
FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of
userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes
up any potential waiters.

Note that under this approach, contrary to other PI-futex approaches, there is
no prior 'registration' of a PI-futex.  [which is not quite possible anyway,
due to existing ABI properties of pthread mutexes.]

Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties
of futexes, and all four combinations are possible: futex, robust-futex,
PI-futex, robust+PI-futex.

  glibc support:
  --------------

Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes
(and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX
mutexes.  (PTHREAD_PRIO_PROTECT support will be added later on too, no
additional kernel changes are needed for that).  [NOTE: The glibc patch is
obviously inofficial and unsupported without matching upstream kernel
functionality.]

the patch-queue and the glibc patch can also be downloaded from:

  http://redhat.com/~mingo/PI-futex-patches/

Many thanks go to the people who helped us create this kernel feature: Steven
Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan
van de Ven, Oleg Nesterov and others.  Credits for related prior projects goes
to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others.

Clean up the futex code, before adding more features to it:

 - use u32 as the futex field type - that's the ABI
 - use __user and pointers to u32 instead of unsigned long
 - code style / comment style cleanups
 - rename hash-bucket name from 'bh' to 'hb'.

I checked the pre and post futex.o object files to make sure this
patch has no code effects.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Steven Rostedt
66e5393a78 [PATCH] BUG() if setscheduler is called from interrupt context
Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
This call grabs a spin_lock that is not always protected by interrupts
disabled.  So this means that setscheduler cant be called from interrupt
context.

To prevent this from happening in the future, this patch adds a
BUG_ON(in_interrupt()) in that function.  (Thanks to akpm <aka.  Andrew
Morton> for this suggestion).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:46 -07:00
Oleg Nesterov
9fea80e4d9 [PATCH] sched: uninline task_rq_lock()
Saves 543 bytes from sched.o (gcc 3.3.3).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Siddha, Suresh B
5c45bf279d [PATCH] sched: mc/smt power savings sched policy
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
/sys/devices/system/cpu/ control the MC/SMT power savings policy for the
scheduler.

Based on the values (1-enable, 0-disable) for these controls, sched groups
cpu power will be determined for different domains.  When power savings
policy is enabled and under light load conditions, scheduler will minimize
the physical packages/cpu cores carrying the load and thus conserving
power(with a perf impact based on the workload characteristics...  see OLS
2005 CMP kernel scheduler paper for more details..)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
369381694d [PATCH] sched_domai: Allocate sched_group structures dynamically
As explained here:
	http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2

there is a problem with sharing sched_group structures between two
separate sched_group structures for different sched_domains.

The patch has been tested and found to avoid the kernel lockup problem
described in above URL.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
15f0b676a4 [PATCH] sched_domai: Use kmalloc_node
The sched group structures used to represent various nodes need to be
allocated from respective nodes (as suggested here also:

	http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
d3a5aa9858 [PATCH] sched_domai: Don't use GFP_ATOMIC
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
allocation.

Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Srivatsa Vaddagiri
51888ca25a [PATCH] sched_domain: handle kmalloc failure
Try to handle mem allocation failures in build_sched_domains by bailing out
and cleaning up thus-far allocated memory.  The patch has a direct consequence
that we disable load balancing completely (even at sibling level) upon *any*
memory allocation failure.

[Lee.Schermerhorn@hp.com: bugfix]
Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:45 -07:00
Peter Williams
615052dc3b [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()
Problem:

To help distribute high priority tasks evenly across the available CPUs
move_tasks() does not, under some circumstances, skip tasks whose load
weight is bigger than the designated amount.  Because the highest priority
task on the busiest queue may be on the expired array it may be moved as a
result of this mechanism.  Apart from not being the most desirable way to
redistribute the high priority tasks (we'd rather move the second highest
priority task), there is a risk that this could set up a loop with this
task bouncing backwards and forwards between the two queues.  (This latter
possibility can be demonstrated by running a nice==-20 CPU bound task on an
otherwise quiet 2 CPU system.)

Solution:

Modify the mechanism so that it does not override skip for the highest
priority task on the CPU.  Of course, if there are more than one tasks at
the highest priority then it will allow the override for one of them as
this is a desirable redistribution of high priority tasks.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams
50ddd96917 [PATCH] sched: modify move_tasks() to improve load balancing outcomes
Problem:

The move_tasks() function is designed to move UP TO the amount of load it
is asked to move and in doing this it skips over tasks looking for ones
whose load weights are less than or equal to the remaining load to be
moved.  This is (in general) a good thing but it has the unfortunate result
of breaking one of the original load balancer's good points: namely, that
(within the limits imposed by the active/expired array model and the fact
the expired is processed first) it moves high priority tasks before low
priority ones and this means there's a good chance (see active/expired
problem for why it's only a chance) that the highest priority task on the
queue but not actually on the CPU will be moved to the other CPU where (as
a high priority task) it may preempt the current task.

Solution:

Modify move_tasks() so that high priority tasks are not skipped when moving
them will make them the highest priority task on their new run queue.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Peter Williams
2dd73a4f09 [PATCH] sched: implement smpnice
Problem:

The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.

For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running.  The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each.  However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU.  Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met.  The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.

Solution:

The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance.  Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example.  Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2.  If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop.  If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.

One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.

Notes:

struct task_struct:

has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.

struct runqueue:

has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue.  This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens.  This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.

int try_to_wake_up():

in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on.  While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction.  To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).

int move_tasks():

The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists.  This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function.  The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load.  This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.

struct sched_group *find_busiest_group():

Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created.  A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required.  As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.

A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.

void set_user_nice():

has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.

From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>

With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain.  This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
   priority tasks, with smpnice we would like to see the high priority task
   scheduled on one cpu, two other cpus getting one normal task each and the
   fourth cpu getting the remaining two normal tasks.  but with current
   smpnice extra normal priority task keeps jumping from one cpu to another
   cpu having the normal priority task.  This is because of the
   busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
   cpu with high priority task in max_load calculations but including that in
   total and avg_load calcuations..  leading to max_load < avg_load and load
   balance between cpus running normal priority tasks(2 Vs 1) will always show
   imbalanace as one normal priority and the extra normal priority task will
   keep moving from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
   highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
   priority task each..  P2 and P3 are idle.  With this patch, one of the
   normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
   sense to look at the highest weighted runqueue in the busy group..
   Consider a load balance scenario on a DP with HT system, with Package-0
   containing one high priority and one low priority, Package-1 containing one
   low priority(with other thread being idle)..  Package-1 thinks that it need
   to take the low priority thread from Package-0.  And find_busiest_queue()
   returns the cpu thread with highest priority task..  And ultimately(with
   help of active load balance) we move high priority task to Package-1.  And
   same continues with Package-0 now, moving high priority task from package-1
   to package-0..  Even without the presence of active load balance, load
   balance will fail to balance the above scenario..  Fix find_busiest_queue
   to use "imbalance" when it is lightly loaded.

[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Cc: John Hawkes <hawkes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Kirill Korotaev
efc30814a8 [PATCH] sched: CPU hotplug race vs. set_cpus_allowed()
There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
__migrate_task() doesn't report any err code, so task can be left on its
runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
possible target.  Also, chaning cpus_allowed mask requires rq->lock being
held.

Signed-off-by: Kirill Korotaev <dev@openvz.org>
Acked-By: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt
cc94abfcbc [PATCH] unnecessary long index i in sched
Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
as a long long here.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Con Kolivas
72d2854d4e [PATCH] sched: fix interactive ceiling code
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough.  The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what
exactly it does, just the reason it does it.  Fix it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass.  Fix it.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Opportunity to micro-optimise.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Steven Rostedt
d444886e14 [PATCH] sched: simplify bitmap definition
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chen, Kenneth W
c96d145e71 [PATCH] sched: fix smt nice lock contention and optimization
Initial report and lock contention fix from Chris Mason:

Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5.  We tracked down one of the regressions to lock contention in
schedule heavy workloads (~70,000 context switches per second)

kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks.  The patch below is more of a
discussion point than a suggested fix (although it does reduce lock
contention significantly).  The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two
different siblings in the same cpu.

It is further optimized:

* perform dependent_sleeper check after next task is determined
* convert wake_sleeping_dependent to use trylock
* skip smt runqueue check if trylock fails
* optimize double_rq_lock now that smt nice is converted to trylock
* early exit in searching first SD_SHARE_CPUPOWER domain
* speedup fast path of dependent_sleeper

[akpm@osdl.org: cleanup]
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Chris Mason <mason@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:44 -07:00
Chandra Seetharaman
26c2143b63 [PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only
Make notifier_calls associated with cpu_notifier as __cpuinit.

__cpuinit makes sure that the function is init time only unless
CONFIG_HOTPLUG_CPU is defined.

[akpm@osdl.org: section fix]
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
65edc68c34 [PATCH] cpu hotplug: make [un]register_cpu_notifier init time only
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined).
So, cpu_notifier functionality need to be available only at init time.

This patch makes register_cpu_notifier() available only at init time, unless
CONFIG_HOTPLUG_CPU is defined.

This patch exports register_cpu_notifier() and unregister_cpu_notifier() only
if CONFIG_HOTPLUG_CPU is defined.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
054cc8a2d8 [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17
This patch reverts notifier_block changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:41 -07:00
Chandra Seetharaman
9c7b216d23 [PATCH] cpu hotplug: revert init patch submitted for 2.6.17
In 2.6.17, there was a problem with cpu_notifiers and XFS.  I provided a
band-aid solution to solve that problem.  In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).

We deferred the real fix to 2.6.18.  Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).

If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.

This patch reverts the notifier_call changes made in 2.6.17

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney
c32e066057 [PATCH] rcutorture: add call_rcu_bh() operations
Add operations for the call_rcu_bh() variant of RCU.  Also add an
rcu_batches_completed_bh() function, which is needed by rcutorture.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney
72e9bb5492 [PATCH] rcutorture: add ops vector and Classic RCU ops
Add an ops vector to rcutorture, and add the ops for Classic RCU.  Update
the rcutorture documentation to reflect slight change to the dmesg formats.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Paul E. McKenney
29766f1eb3 [PATCH] rcutorture: catchup doc fixes for idle-hz tests
This just catches the RCU torture documentation up with the recent fixes
that test RCU for architectures that turn of the scheduling-clock interrupt
for idle CPUs and the addition of a SUCCESS/FAILURE indication, fixing up
an obsolete comment as well.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:40 -07:00
Randy Dunlap
1dbe83c344 [PATCH] fix kernel-doc in kernel/ dir
Fix kernel-doc parameters in kernel/

Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1376): No description found for parameter 'u_abs_timeout'
Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_msg_prio'
Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_abs_timeout'
Warning(/var/linsrc/linux-2617-g9//kernel/acct.c:526): No description found for parameter 'pacct'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:39 -07:00
Ingo Molnar
34af946a22 [PATCH] spin/rwlock init cleanups
locking init cleanups:

 - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK()
 - convert rwlocks in a similar manner

this patch was generated automatically.

Motivation:

 - cleanliness
 - lockdep needs control of lock initialization, which the open-coded
   variants do not give
 - it's also useful for -rt and for lock debugging in general

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:39 -07:00
Randy Dunlap
a7807a32bb [PATCH] poison: add & use more constants
Add more poison values to include/linux/poison.h.  It's not clear to me
whether some others should be added or not, so I haven't added any of
these:

./include/linux/libata.h:#define ATA_TAG_POISON		0xfafbfcfdU
./arch/ppc/8260_io/fcc_enet.c:1918:	memset((char *)(&(immap->im_dprambase[(mem_addr+64)])), 0x88, 32);
./drivers/usb/mon/mon_text.c:429:	memset(mem, 0xe5, sizeof(struct mon_event_text));
./drivers/char/ftape/lowlevel/ftape-ctl.c:738:		memset(ft_buffer[i]->address, 0xAA, FT_BUFF_SIZE);
./drivers/block/sx8.c:/* 0xf is just arbitrary, non-zero noise; this is sorta like poisoning */

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
Ingo Molnar
e6e5494cb2 [PATCH] vdso: randomize the i386 vDSO by moving it into a vma
Move the i386 VDSO down into a vma and thus randomize it.

Besides the security implications, this feature also helps debuggers, which
can COW a vma-backed VDSO just like a normal DSO and can thus do
single-stepping and other debugging features.

It's good for hypervisors (Xen, VMWare) too, which typically live in the same
high-mapped address space as the VDSO, hence whenever the VDSO is used, they
get lots of guest pagefaults and have to fix such guest accesses up - which
slows things down instead of speeding things up (the primary purpose of the
VDSO).

There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support
for older glibcs that still rely on a prelinked high-mapped VDSO.  Newer
distributions (using glibc 2.3.3 or later) can turn this option off.  Turning
it off is also recommended for security reasons: attackers cannot use the
predictable high-mapped VDSO page as syscall trampoline anymore.

There is a new vdso=[0|1] boot option as well, and a runtime
/proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned
on/off.

(This version of the VDSO-randomization patch also has working ELF
coredumping, the previous patch crashed in the coredumping code.)

This code is a combined work of the exec-shield VDSO randomization
code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell
started this patch and i completed it.

[akpm@osdl.org: cleanups]
[akpm@osdl.org: compile fix]
[akpm@osdl.org: compile fix 2]
[akpm@osdl.org: compile fix 3]
[akpm@osdl.org: revernt MAXMEM change]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Cc: Gerd Hoffmann <kraxel@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:38 -07:00
KAMEZAWA Hiroyuki
2842f11419 [PATCH] catch valid mem range at onlining memory
This patch allows hot-add memory which is not aligned to section.

Now, hot-added memory has to be aligned to section size.  Considering big
section sized archs, this is not useful.

When hot-added memory is registerd as iomem resoruce by iomem resource
patch, we can make use of that information to detect valid memory range.

Note: With this, not-aligned memory can be registerd. To allow hot-add
      memory with holes, we have to do more work around add_memory().
      (It doesn't allows add memory to already existing mem section.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:36 -07:00
Andrew Morton
5c31f2738a [PATCH] pm_trace is dangerous
CONFIG_PM_TRACES scrogs your RTC.  Mark it as experimental, and defaulting to
`off'.

Also beef up the help message a bit.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:35 -07:00
Randy Dunlap
7f32a25f63 [PATCH] kernel/acct: fix function definition
kernel/acct.c:579:19: warning: non-ANSI function declaration of function 'acct_process'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:32:35 -07:00
Greg Kroah-Hartman
6550e07f41 [PATCH] 64bit Resource: finally enable 64bit resource sizes
Introduce the Kconfig entry and actually switch to a 64bit value, if
wanted, for resource_size_t.

Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:24:00 -07:00
Greg Kroah-Hartman
d75fc8bbcc [PATCH] 64bit resource: change resource core to use resource_size_t
Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com>

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:23:59 -07:00
Greg Kroah-Hartman
685143ac1f [PATCH] 64bit resource: fix up printks for resources in arch and core code
This is needed if we wish to change the size of the resource structures.

Based on an original patch from Vivek Goyal <vgoyal@in.ibm.com> and
Andrew Morton.

(tweaked by Andy Isaacson <adi@hexapodia.org>)

Cc: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andy Isaacson <adi@hexapodia.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27 09:23:59 -07:00
Linus Torvalds
2a2ed2db35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits)
  kbuild: trivial fixes in Makefile
  kbuild: adding symbols in Kconfig and defconfig to TAGS
  kbuild: replace abort() with exit(1)
  kbuild: support for %.symtypes files
  kbuild: fix silentoldconfig recursion
  kbuild: add option for stripping modules while installing them
  kbuild: kill some false positives from modpost
  kbuild: export-symbol usage report generator
  kbuild: fix make -rR breakage
  kbuild: append -dirty for updated but uncommited changes
  kbuild: append git revision for all untagged commits
  kbuild: fix module.symvers parsing in modpost
  kbuild: ignore make's built-in rules & variables
  kbuild: bugfix with initramfs
  kbuild: modpost build fix
  kbuild: check license compatibility when building modules
  kbuild: export-type enhancement to modpost.c
  kbuild: add dependency on kernel.release to the package targets
  kbuild: `make kernelrelease' speedup
  kconfig: KCONFIG_OVERWRITECONFIG
  ...
2006-06-26 11:05:15 -07:00
Linus Torvalds
81a07d7588 Merge branch 'x86-64'
* x86-64: (83 commits)
  [PATCH] x86_64: x86_64 stack usage debugging
  [PATCH] x86_64: (resend) x86_64 stack overflow debugging
  [PATCH] x86_64: msi_apic.c build fix
  [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs
  [PATCH] x86_64: Avoid broadcasting NMI IPIs
  [PATCH] x86_64: fix apic error on bootup
  [PATCH] x86_64: enlarge window for stack growth
  [PATCH] x86_64: Minor string functions optimizations
  [PATCH] x86_64: Move export symbols to their C functions
  [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR
  [PATCH] x86_64: Fix modular pc speaker
  [PATCH] x86_64: remove sys32_ni_syscall()
  [PATCH] x86_64: Do not use -ffunction-sections for modules
  [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle
  [PATCH] x86_64: adjust kstack_depth_to_print default
  [PATCH] i386/x86-64: adjust /proc/interrupts column headings
  [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels
  [PATCH] x86_64: Fix fast check in safe_smp_processor_id
  [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information
  [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
  ...

Manual resolve of trivial conflict in arch/i386/kernel/Makefile
2006-06-26 10:51:09 -07:00
Andi Kleen
495ab9c045 [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.

Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.

Cc: Nick Piggin <npiggin@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:21 -07:00
Jan Beulich
83f4fcce7f [PATCH] x86_64: allow unwinder to build without module support
Add proper conditionals to be able to build with CONFIG_MODULES=n.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:18 -07:00
Jan Beulich
c33bd9aac0 [PATCH] i386/x86-64: fall back to old-style call trace if no unwinding
If no unwinding is possible at all for a certain exception instance,
fall back to the old style call trace instead of not showing any trace
at all.

Also, allow setting the stack trace mode at the command line.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:18 -07:00
Jan Beulich
4552d5dc08 [PATCH] x86_64: reliable stack trace support
These are the generic bits needed to enable reliable stack traces based
on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will
enable x86-64 and i386 to make use of this.

Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities
for improvement.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:17 -07:00
Andi Kleen
bebfa1013e [PATCH] x86_64: Add compat_printk and sysctl to turn off compat layer warnings
Sometimes e.g. with crashme the compat layer warnings can be noisy.
Add a way to turn them off by gating all output through compat_printk
that checks a global sysctl. The default is not changed.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:48:16 -07:00
Peter Williams
b78709cfd4 [PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval()
The introduction of SCHED_BATCH scheduling class with a value of 3 means
that the expression (p->policy & SCHED_FIFO) will return true if policy
is SCHED_BATCH or SCHED_FIFO.

Unfortunately, this expression is used in sys_sched_rr_get_interval()
and in the absence of a comment to say that this is intentional I
presume that it is unintentional and erroneous.

The fix is to change the expression to (p->policy == SCHED_FIFO).

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 10:02:41 -07:00
Oleg Nesterov
cf2dfbfbf4 [PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXIT
After the previous patch SIGNAL_GROUP_EXIT implies a pending SIGKILL, we
can remove this check from copy_process() because we already checked
!signal_pending().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:27 -07:00
Oleg Nesterov
d5f70c00ad [PATCH] coredump: kill ptrace related stuff
With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to
the thread group.  This means that a TASK_TRACED task

	1. Will be awakened by signal_wake_up(1)

	2. Can't sleep again via ptrace_notify()

	3. Can't go to do_signal_stop() after return
	   from ptrace_stop() in get_signal_to_deliver()

So we can remove all ptrace related stuff from coredump path.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:27 -07:00
Eric W. Biederman
df26c40e56 [PATCH] proc: Cleanup proc_fd_access_allowed
In process of getting proc_fd_access_allowed to work it has developed a few
warts.  In particular the special case that always allows introspection and
the special case to allow inspection of kernel threads.

The special case for introspection is needed for /proc/self/mem.

The special case for kernel threads really should be overridable
by security modules.

So consolidate these checks into ptrace.c:may_attach().

The check to always allow introspection is trivial.

The check to allow access to kernel threads, and zombies is a little
trickier.  mem_read and mem_write already verify an mm exists so it isn't
needed twice.  proc_fd_access_allowed only doesn't want a check to verify
task->mm exits, s it prevents all access to kernel threads.  So just move
the task->mm check into ptrace_attach where it is needed for practical
reasons.

I did a quick audit and none of the security modules in the kernel seem to
care if they are passed a task without an mm into security_ptrace.  So the
above move should be safe and it allows security modules to come up with
more restrictive policy.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman
13b41b0949 [PATCH] proc: Use struct pid not struct task_ref
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so
that they work with struct pid instead of struct task_ref.

Mostly this is a straight 1-1 substitution.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:26 -07:00
Eric W. Biederman
99f8955183 [PATCH] proc: don't lock task_structs indefinitely
Every inode in /proc holds a reference to a struct task_struct.  If a
directory or file is opened and remains open after the the task exits this
pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
file descriptor is about 10K.

Normally I would figure a reasonable per user process limit is about 100
processes.  With 80 processes, with a 1000 file descriptors each I can trigger
the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
data.

This patch replaces the struct task_struct pointer with a pointer to a struct
task_ref which has a struct task_struct pointer.  The so the pinning of dead
tasks does not happen.

The code now has to contend with the fact that the task may now exit at any
time.  Which is a little but not muh more complicated.

With this change it takes about 1000 processes each opening up 1000 file
descriptors before I can trigger the OOM killer.  Much better.

[mlp@google.com: task_mmu small fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Paul Jackson <pj@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Albert Cahalan <acahalan@gmail.com>
Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:25 -07:00
Eric W. Biederman
48e6484d49 [PATCH] proc: Rewrite the proc dentry flush on exit optimization
To keep the dcache from filling up with dead /proc entries we flush them on
process exit.  However over the years that code has gotten hairy with a
dentry_pointer and a lock in task_struct and misdocumented as a correctness
feature.

I have rewritten this code to look and see if we have a corresponding entry in
the dcache and if so flush it on process exit.  This removes the extra fields
in the task_struct and allows me to trivially handle the case of a
/proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:24 -07:00
Anil S Keshavamurthy
e6f47f978b [PATCH] Notify page fault call chain
With this patch Kprobes now registers for page fault notifications only when
their is an active probe registered.  Once all the active probes are
unregistered their is no need to be notified of page faults and kprobes
unregisters itself from the page fault notifications.  Hence we will have ZERO
side effects when no probes are active.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
Anil S Keshavamurthy
3d5631e063 [PATCH] Kprobes registers for notify page fault
Kprobes now registers for page fault notifications.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavmurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
mao, bibo
3672165677 [PATCH] Kprobe: multi kprobe posthandler for booster
If there are multi kprobes on the same probepoint, there will be one extra
aggr_kprobe on the head of kprobe list.  The aggr_kprobe has
aggr_post_handler/aggr_break_handler whether the other kprobe
post_hander/break_handler is NULL or not.  This patch modifies this, only
when there is one or more kprobe in the list whose post_handler is not
NULL, post_handler of aggr_kprobe will be set as aggr_post_handler.

[soshima@redhat.com: !CONFIG_PREEMPT fix]
Signed-off-by: bibo, mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Yumiko Sugita <sugita@sdl.hitachi.co.jp>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Satoshi Oshima <soshima@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:22 -07:00
Roman Zippel
19923c190e [PATCH] fix and optimize clock source update
This fixes the clock source updates in update_wall_time() to correctly
track the time coming in via current_tick_length().  Optimize the fast
paths to be as short as possible to keep the overhead low.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz
a275254975 [PATCH] time: rename clocksource functions
As suggested by Roman Zippel, change clocksource functions to use
clocksource_xyz rather then xyz_clocksource to avoid polluting the
namespace.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz
5d0cf410e9 [PATCH] Time: i386 Clocksource Drivers
Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc).
With this patch, the conversion of the i386 arch to the generic timekeeping
code should be complete.

The patch should be fairly straight forward, only adding the new clocksources.

[hirofumi@mail.parknet.co.jp: acpi_pm cleanup]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:21 -07:00
john stultz
cf3c769b4b [PATCH] Time: Introduce arch generic time accessors
Introduces clocksource switching code and the arch generic time accessor
functions that use the clocksource infrastructure.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
5eb6d20533 [PATCH] Time: Use clocksource abstraction for NTP adjustments
Instead of incrementing xtime by tick_nsec + ntp adjustments, use the
clocksource abstraction to increment and scale time.  Using the clocksource
abstraction allows other clocksources to be used consistently in the face of
late or lost ticks, while preserving the existing behavior via the jiffies
clocksource.

This removes the need to keep time_phase adjustments as we just use the
current_tick_length() function as the NTP interface and accumulate time using
shifted nanoseconds.

The basics of this design was by Roman Zippel, however it is my own
interpretation and implementation, so the credit should go to him and the
blame to me.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
260a42309b [PATCH] Time: Let user request precision from current_tick_length()
Change the current_tick_length() function so it takes an argument which
specifies how much precision to return in shifted nanoseconds.  This provides
a simple way to convert between NTPs internal nanoseconds shifted by
(SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
clocksource abstraction.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
ad596171ed [PATCH] Time: Use clocksource infrastructure for update_wall_time
Modify the update_wall_time function so it increments time using the
clocksource abstraction instead of jiffies.  Since the only clocksource driver
currently provided is the jiffies clocksource, this should result in no
functional change.  Additionally, a timekeeping_init and timekeeping_resume
function has been added to initialize and maintain some of the new timekeping
state.

[hirofumi@mail.parknet.co.jp: fixlet]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
john stultz
734efb467b [PATCH] Time: Clocksource Infrastructure
This introduces the clocksource management infrastructure.  A clocksource is a
driver-like architecture generic abstraction of a free-running counter.  This
code defines the clocksource structure, and provides management code for
registering, selecting, accessing and scaling clocksources.

Additionally, this includes the trivial jiffies clocksource, a lowest common
denominator clocksource, provided mainly for use as an example.

[hirofumi@mail.parknet.co.jp: Don't enable IRQ too early]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:20 -07:00
Ingo Molnar
81615b624a [PATCH] Convert kernel/cpu.c to mutexes
Convert kernel/cpu.c from semaphore to mutex.

I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to
fit mutex semantics.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Ingo Molnar
1fb00c6cbd [PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqs
It seems ppc64 wants to lock mutexes in early bootup code, with interrupts
disabled, and they expect interrupts to stay disabled, else they crash.

Work around this bug by making mutex debugging variants save/restore irq
flags.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:16 -07:00
Linus Torvalds
3448097fcc Revert "swsusp special saveable pages support" commits
This reverts commits

  3e3318dee0 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages
  b6370d96e0 [PATCH] swsusp: i386 mark special saveable/unsaveable pages
  ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support

because not only do they apparently cause page faults on x86, the
infrastructure doesn't compile on powerpc.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 18:41:00 -07:00
Linus Torvalds
72cf2709bf Fix PM_TRACE dependency: works only on 32-bit x86 for now
Not that x86-64 and other architecture support should be difficult to
add (trivial fixups to the data format and add the proper linker script
entry).

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:04:15 -07:00
KaiGai Kohei
77787bfb44 [PATCH] pacct: none-delayed process accounting accumulation
In current 2.6.17 implementation, signal_struct refered from task_struct is
used for per-process data structure.  The pacct facility also uses it as a
per-process data structure to store stime, utime, minflt, majflt.  But those
members are saved in __exit_signal().  It's too late.

For example, if some threads exits at same time, pacct facility has a
possibility to drop accountings for a part of those threads.  (see, the
following 'The results of original 2.6.17 kernel') I think accounting
information should be completely collected into the per-process data structure
before writing out an accounting record.

This patch fixes this matter.  Accumulation of stime, utime, minflt and majflt
are done before generating accounting record.

[mingo@elte.hu: fix acct_collect() siglock bug found by lockdep]
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:25 -07:00
KaiGai Kohei
f6ec29a42d [PATCH] pacct: avoidance to refer the last thread as a representation of the process
When pacct facility generate an 'ac_flag' field in accounting record, it
refers a task_struct of the thread which died last in the process.  But any
other task_structs are ignored.

Therefore, pacct facility drops ASU flag even if root-privilege operations are
used by any other threads except the last one.  In addition, AFORK flag is
always set when the thread of group-leader didn't die last, although this
process has called execve() after fork().

We have a same matter in ac_exitcode.  The recorded ac_exitcode is an exit
code of the last thread in the process.  There is a possibility this exitcode
is not the group leader's one.
2006-06-25 10:01:25 -07:00
KaiGai Kohei
0e4648141a [PATCH] pacct: add pacct_struct to fix some pacct bugs.
The pacct facility need an i/o operation when an accounting record is
generated.  There is a possibility to wake OOM killer up.  If OOM killer is
activated, it kills some processes to make them release process memory
regions.

But acct_process() is called in the killed processes context before calling
exit_mm(), so those processes cannot release own memory.  In the results, any
processes stop in this point and it finally cause a system stall.
2006-06-25 10:01:25 -07:00
Randy Dunlap
9e37bd301e [PATCH] kthread: move kernel-doc and put it into DocBook
Move kthread API kernel-doc from kthread.h to kthread.c & fix it.
Add kthread API to kernel-api DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:24 -07:00
Randy Dunlap
fa9799e33d [PATCH] ktime/hrtimer: fix kernel-doc comments
Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:23 -07:00
Heiko Carstens
fc75cdfa5b [PATCH] cpu hotplug: fix CPU_UP_CANCEL handling
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be
called with CPU_UP_CANCELED.  A few of these callbacks assume that on
CPU_UP_PREPARE a pointer to task has been stored in a percpu array.  This
assumption is not true if CPU_UP_PREPARE fails and the following calls to
kthread_bind() in CPU_UP_CANCELED will cause an addressing exception
because of passing a NULL pointer.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Serge E. Hallyn
8bdd1d1250 [PATCH] kthread: convert stop_machine into a kthread
- Update stop_machine.c to spawn stop_machine as kthreads rather than the
  deprecated kernel_threads.

- Update stop_machine to use the more efficient kthread_bind() before
  running task in place of set_cpus_allowed() after.

[akpm@osdl.org: remove now-wrong set_cpus_allowed()]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:22 -07:00
Anton Blanchard
2aa92581fb [PATCH] Link error when futexes are disabled on 64bit architectures
If futexes are disabled we fail to link on ppc64.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:17 -07:00
akpm@osdl.org
838cd153a5 [PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__
I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs.

Among the NPTL test failures seen are some arising from sigsuspend problems
for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked
despite glibc's carefully excluding it from sets of signals to block.
Specifically, testing suggests it blocks signal N^32 instead of signal N,
so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead.

glibc's sigset_t uses an array of unsigned long, as does the kernel.
In both cases, signal N+1 is represented as
(1UL << (N % (8 * sizeof (unsigned long)))) in word number
(N / (8 * sizeof (unsigned long))).

Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an
array of 64-bit words.  For little-endian, the layout is the same, with
signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for
big-endian, userspace has that layout while in the kernel each 8 bytes have
the two halves swapped from the userspace layout.

The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace
sigset to kernel format.  If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses
logic of the form

  set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 )

to convert the userspace sigset to a kernel one.  This looks correct to me
for both big and little endian, given that in userspace compat->sig[1] will
represent signals 33-64, and so will the high 32 bits of set->sig[0] in the
kernel.  If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for
__MIPSEL__, it uses

  set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 );

which seems incorrect for both big and little endian, and would
explain the observed symptoms.

This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect
then that macro serves no purpose, in which case something like the
following patch would seem appropriate to remove it.

Signed-off-by: Joseph Myers <joseph@codesourcery.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Stephen Hemminger
eab03ac7bd [PATCH] Get rid of /proc/sys/proc
The table is empty, why does it still exist?

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:15 -07:00
Jan Engelhardt
3b9c04106b [PATCH] printk time parameter
Currently, enabling/disabling printk timestamps is only possible through
reboot (bootparam) or recompile.  I normally do not run with timestamps
(since syslog handles that in a good manner), but for measuring small
kernel delays (e.g.  irq probing - see parport thread) I needed subsecond
precision, but then again, just for some minutes rather than all kernel
messages to come.  The following patch adds a module_param() with which the
timestamps can be en-/disabled in a live system through
/sys/modules/printk/parameters/printk_time.

Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:13 -07:00
Matt Helsley
11e64757f9 [PATCH] Remove unecessary NULL check in kernel/acct.c
copy_process() appears to be the only caller of acct_clear_integrals() and
does not pass in NULL task pointers.  Remove the unecessary check.

Signed-off-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:09 -07:00
Andreas Mohr
3b364b8d58 [PATCH] constify parts of kernel/power/
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:08 -07:00
Andrew Morton
b61367732f [PATCH] schedule_on_each_cpu(): reduce kmalloc() size
schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU
64-bit.

Rework it so that we do one 8192-byte allocation and then a pile of tiny ones,
via alloc_percpu().  This has a much higher chance of success (100% in the
current VM).

This also has the effect of reducing the memory requirements from NR_CPUS*n to
num_possible_cpus()*n.

Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:07 -07:00
Adrian Bunk
83cc5ed3c4 [PATCH] kernel/sys.c: cleanups
- proper prototypes for the following functions:
  - ctrl_alt_del()  (in include/linux/reboot.h)
  - getrusage()     (in include/linux/resource.h)
- make the following needlessly global functions static:
  - kernel_restart_prepare()
  - kernel_kexec()

[akpm@osdl.org: compile fix]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:06 -07:00
Michael Ellerman
76a8ad2939 [PATCH] Make printk work for really early debugging
Currently printk is no use for early debugging because it refuses to
actually print anything to the console unless
cpu_online(smp_processor_id()) is true.

The stated explanation is that console drivers may require per-cpu
resources, or otherwise barf, because the system is not yet setup
correctly.  Fair enough.

However some console drivers might be quite happy running early during
boot, in fact we have one, and so it'd be nice if printk understood that.

So I added a flag (which I would have called CON_BOOT, but that's taken)
called CON_ANYTIME, which indicates that a console is happy to be called
anytime, even if the cpu is not yet online.

Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus
console driver that BUG()s if called while offline.  No problems AFAICT.
Built for i386 UP & SMP.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:05 -07:00
Alan Stern
bbb1747d4e [PATCH] Allow raw_notifier callouts to unregister themselves
Since raw_notifier chains don't benefit from any centralized locking
protections, they shouldn't suffer from the associated limitations.  Under
some circumstances it might make sense for a raw_notifier callout routine
to unregister itself from the notifier chain.  This patch (as678) changes
the notifier core to allow for such things.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Paul Mackerras
bfe5d83419 [PATCH] Define __raw_get_cpu_var and use it
There are several instances of per_cpu(foo, raw_smp_processor_id()), which
is semantically equivalent to __get_cpu_var(foo) but without the warning
that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled.  For
those architectures with optimized per-cpu implementations, namely ia64,
powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
on those platforms.

This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
raw_smp_processor_id()) on architectures that use the generic per-cpu
implementation, and turns into __get_cpu_var(x) on the architectures that
have an optimized per-cpu implementation.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:01:01 -07:00
Jesper Juhl
f867d2a2e5 [PATCH] ensure NULL deref can't possibly happen in is_exported()
If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is
given a NULL 'mod' and lookup_symbol(name, __start___ksymtab,
__stop___ksymtab) returns 0, then we'll end up dereferencing a NULL
pointer.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25 10:00:59 -07:00
Linus Torvalds
eb71c87a49 Add some basic resume trace facilities
Considering that there isn't a lot of hw we can depend on during resume,
this is about as good as it gets.

This is x86-only for now, although the basic concept (and most of the
code) will certainly work on almost any platform.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-24 14:44:01 -07:00
Eric Sesterhenn
125e18745f [PATCH] More BUG_ON conversion
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Acked-by: "Salyzyn, Mark" <mark_salyzyn@adaptec.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Jan Beulich
908dcecda1 [PATCH] adjust handle_IRR_event() return type
Correct the return type of handle_IRQ_event() (inconsistency noticed during
Xen development), and remove redundant declarations.  The return type
adjustment required breaking out the definition of irqreturn_t into a
separate header, in order to satisfy current include order dependencies.

Signed-off-by: Jan Beulich <jbeulich@novell.com>

Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Porpoise
3439dd86e3 [PATCH] When CONFIG_BASE_SMALL=1, cascade() may enter an infinite loop
When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop.
Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list
base->tv5 may cascade into base->tv5.  So, the kernel enters the infinite
loop in the function cascade().

I created a test module to verify this bug, and a patch to fix it.

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/timer.h>
#if 0
#include <linux/kdb.h>
#else
#define kdb_printf printk
#endif

#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
#define TVN_SIZE (1 << TVN_BITS)
#define TVR_SIZE (1 << TVR_BITS)
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)

#define TV_SIZE(N)  (N*TVN_BITS  + TVR_BITS)

struct timer_list timer0;
struct timer_list dummy_timer1;
struct timer_list dummy_timer2;

void dummy_timer_fun(unsigned long data) {
}
unsigned long j=0;
void check_timer_base(unsigned long data)
{
        kdb_printf("check_timer_base %08x\n",jiffies);
        mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF);
}

int init_module(void)
{
        init_timer(&timer0);
        timer0.data = (unsigned long)0;
        timer0.function = check_timer_base;
        mod_timer(&timer0,jiffies+1);

        init_timer(&dummy_timer1);
        dummy_timer1.data = (unsigned long)0;
        dummy_timer1.function = dummy_timer_fun;

        init_timer(&dummy_timer2);
        dummy_timer2.data = (unsigned long)0;
        dummy_timer2.function = dummy_timer_fun;

        j=jiffies;
        j&=(~((1<<TV_SIZE(3))-1));
        j+=(1<<TV_SIZE(3));
        j+=(1<<TV_SIZE(4));

        kdb_printf("mod_timer %08x\n",j);

        mod_timer(&dummy_timer1, j );
        mod_timer(&dummy_timer2, j );

        return 0;
}

void cleanup_module()
{
        del_timer_sync(&timer0);
        del_timer_sync(&dummy_timer1);
        del_timer_sync(&dummy_timer2);
}

(Cleanups from Oleg)

[oleg@tv-sign.ru: use list_replace_init()]
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:08 -07:00
Oleg Nesterov
626ab0e69d [PATCH] list: use list_replace_init() instead of list_splice_init()
list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1.  We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Randy Dunlap
862f5f0133 [PATCH] Doc: add audit & acct to DocBook
Fix one audit kernel-doc description (one parameter was missing).
Add audit*.c interfaces to DocBook.
Add BSD accounting interfaces to DocBook.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Paul E. McKenney
d83015b8f6 [PATCH] Make RCU API inaccessible to non-GPL Linux kernel modules
Remove synchronize_kernel() (deprecated 2-APR-2005 in
http://lkml.org/lkml/2005/4/3/11) and makes the RCU API inaccessible to
non-GPL Linux kernel modules (as was announced more than one year ago in
http://lkml.org/lkml/2005/4/3/8).  Tested on x86 and ppc64.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Jes Sorensen
55f4e8d156 [PATCH] kernel/sys.c doesn't need init.h
kernel/sys.c doesn't have anything in it relying on linux/init.h -
remove the include.

Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:07 -07:00
Andrew Morton
57ae250861 [PATCH] CONFIG_NET=n build fix
Cc: Greg KH <greg@kroah.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:06 -07:00
Andreas Mohr
83d4e6e7fb [PATCH] make noirqdebug/irqfixup __read_mostly, add (un)likely()
Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:05 -07:00
Daniel Walker
89d0cf01c0 [PATCH] invert irq/migration.c brach prediction
If you get to that point in the code it means that desc->move_irq is set,
pending_irq_cpumask[irq] and cpu_online_map should have a value.  Still
pretty good chance anding those two you'll still have a value.  So these
two branch predictors should be inverted.

Signed-off-by: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Ingo Molnar
8e0a43d8fa [PATCH] cond_resched() might_sleep() fix
add the __might_sleep() check back to cond_resched().

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Prasanna Meda
6e66726047 [PATCH] dup fd error fix
Set errorp in dup_fd, it will be used in sys_unshare also.

Signed-off-by: Prasanna Meda <mlp@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:04 -07:00
Andrew Morton
0ae26f1b31 [PATCH] mmput() might sleep
exit_aio() and exit_mmap() can sleep.  But it's easy to accidentally call
mmput() from inside locks.

Cc: Dave Peterson <dsp@llnl.gov>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:03 -07:00
Jeff Moyer
c330dda908 [PATCH] Add a sysfs file to determine if a kexec kernel is loaded
Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded.  Each
file contains a simple boolean value indicating whether the relevant kernel
has been loaded into memory.  The motivation for this is geared around
support.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:02 -07:00
Rafael J. Wysocki
968808b895 [PATCH] swsusp: use less memory during resume
Make swsusp allocate only as much memory as needed to store the image data
and metadata during resume.

Without this patch swsusp additionally allocates many page frames that will
conflict with the "original" locations of the image data and are considered
as "unsafe", treating them as "eaten" pages (ie.  allocated but unusable).

The patch makes swsusp allocate as many pages as it'll need to store the
data read from the image in one shot, creating a list of allocated "safe"
pages, and use the observation that all pages allocated by it are marked
with the PG_nosave and PG_nosave_free flags set.   Namely, when it's about
to load an image page, swsusp can check whether the page frame
corresponding to the "original" location of this page has been allocated
(ie.  if the page frame has the PG_nosave and PG_nosave_free flags set) and
if so, it can load the page directly into this page frame.   Otherwise it
uses an allocated "safe" page from the list to store the data that will be
copied to their "original" location later on.

This allows us to save many page copyings and page allocations during
resume and in the future it may allow us to load images greater than 50% of
the normal zone.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: "Pavel Machek" <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:43:00 -07:00
Adrian Bunk
7bff24e255 [PATCH] kernel/power/snapshot.c: cleanups
- make needlessly global functions static
- make dummy functions static inline

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Rafael J. Wysocki
a938c356d5 [PATCH] swsusp: take lowmem reserves into account
swsusp allocates memory from the normal zone, so it cannot use lowmem
reserve pages from the lower zones.  Therefore it should not count these
pages as available to it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Shaohua Li
ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support
1. Add architecture specific pages save/restore support.  Next two patches
   will use this to save/restore 'ACPI NVS' pages.

2. Allow reserved pages 'nosave'.  This could avoid save/restore BIOS
   reserved pages.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:59 -07:00
Zhang Yanmin
1b61b910e9 [PATCH] x86: kernel irq balance doesn't work
On i386, kernel irq balance doesn't work.

1) In function do_irq_balance, after kernel finds the min_loaded cpu but
   before calling set_pending_irq to really pin the selected_irq to the
   target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
   Later on, when the irq is acked, kernel would calls
   move_native_irq=>desc->handler->set_affinity to change the irq affinity.
    However, every function pointed by
   hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
   always changes irq_affinity[irq] to cpumask.  Next time when recalling
   do_irq_balance, it has to do cpu_ands again with
   irq_affinity[selected_irq], but irq_affinity[selected_irq] already
   becomes one cpu selected by the first irq balance.

2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
   issue.

[akpm@osdl.org: cleanups]
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:57 -07:00
David Quigley
22fb52dd73 [PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c)
Add a security hook call to enable security modules to control the ability
to attach a task to a cpuset.  While limited control over this operation is
possible via permission checks on the pseudo fs interface, those checks are
not sufficient to control access to the target task, which is looked up in
this function.  The existing task_setscheduler hook is re-used for this
operation since this falls under the same class of operations.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:54 -07:00
David Quigley
e7834f8fcc [PATCH] SELinux: add security hooks to {get,set}affinity
This patch adds LSM hooks into the setaffinity and getaffinity functions to
enable security modules to control these operations between tasks with
task_setscheduler and task_getscheduler LSM hooks.

Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov>
Acked-by:  Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
9216dfad4f [PATCH] move_pages: fix 32 -> 64 bit compat function
The definition of the third parameter is a pointer to an array of virtual
addresses which give us some trouble.  The existing code calculated the
wrong address in the array since I used void to avoid having to specify a
type.

I now use the correct type "compat_uptr_t __user *" in the definition of
the function in kernel/compat.c.

However, I used __u32 in syscalls.h.  Would have to include compat.h there
in order to provide the same definition which would generate an ugly
include situation.

On both ia64 and x86_64 compat_uptr_t is u32. So this works although
parameter declarations differ.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
1b2db9fb7a [PATCH] sys_move_pages: 32bit support (i386, x86_64)
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)

Add support for move_pages() on i386 and also add the compat functions
necessary to run 32 bit binaries on x86_64.

Add compat_sys_move_pages to the x86_64 32bit binary layer.  Note that it is
not up to date so I added the missing pieces.  Not sure if this is done the
right way.

[akpm@osdl.org: compile fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Christoph Lameter
742755a1d8 [PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.

long move_pages(pid, number_of_pages_to_move,
		addresses_of_pages[],
		nodes[] or NULL,
		status[],
		flags);

The addresses of pages is an array of void * pointing to the
pages to be moved.

The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.

The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.

Possible page states in status[]:

0..MAX_NUMNODES	The page is now on the indicated node.

-ENOENT		Page is not present

-EACCES		Page is mapped by multiple processes and can only
		be moved if MPOL_MF_MOVE_ALL is specified.

-EPERM		The page has been mlocked by a process/driver and
		cannot be moved.

-EBUSY		Page is busy and cannot be moved. Try again later.

-EFAULT		Invalid address (no VMA or zero page).

-ENOMEM		Unable to allocate memory on target node.

-EIO		Unable to write back page. The page must be written
		back in order to move it since the page is dirty and the
		filesystem does not provide a migration function that
		would allow the moving of dirty pages.

-EINVAL		A dirty page cannot be moved. The filesystem does not provide
		a migration function and has no ability to write back pages.

The flags parameter indicates what types of pages to move:

MPOL_MF_MOVE	Move pages that are only mapped by the process.

MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
		Requires sufficient capabilities.

Possible return codes from move_pages()

-ENOENT		No pages found that would require moving. All pages
		are either already on the target node, not present, had an
		invalid address or could not be moved because they were
		mapped by multiple processes.

-EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
		to migrate pages in a kernel thread.

-EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
		or an attempt to move a process belonging to another user.

-EACCES		One of the target nodes is not allowed by the current cpuset.

-ENODEV		One of the target nodes is not online.

-ESRCH		Process does not exist.

-E2BIG		Too many pages to move.

-ENOMEM		Not enough memory to allocate control array.

-EFAULT		Parameters could not be accessed.

A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

From: Christoph Lameter <clameter@sgi.com>

  Detailed results for sys_move_pages()

  Pass a pointer to an integer to get_new_page() that may be used to
  indicate where the completion status of a migration operation should be
  placed.  This allows sys_move_pags() to report back exactly what happened to
  each page.

  Wish there would be a better way to do this. Looks a bit hacky.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Jes Sorensen <jes@trained-monkey.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:53 -07:00
Rafael J. Wysocki
d6277db4ab [PATCH] swsusp: rework memory shrinker
Rework the swsusp's memory shrinker in the following way:

- Simplify balance_pgdat() by removing all of the swsusp-related code
  from it.

- Make shrink_all_memory() use shrink_slab() and a new function
  shrink_all_zones() which calls shrink_active_list() and
  shrink_inactive_list() directly for each zone in a way that's optimized
  for suspend.

In shrink_all_memory() we try to free exactly as many pages as the caller
asks for, preferably in one shot, starting from easier targets.   If slab
caches are huge, they are most likely to have enough pages to reclaim.
 The inactive lists are next (the zones with more inactive pages go first)
etc.

Each time shrink_all_memory() attempts to shrink the active and inactive
lists for each zone in 5 passes.   In the first pass, only the inactive
lists are taken into consideration.   In the next two passes the active
lists are also shrunk, but mapped pages are not reclaimed.   In the last
two passes the active and inactive lists are shrunk and mapped pages are
reclaimed as well.  The aim of this is to alter the reclaim logic to choose
the best pages to keep on resume and improve the responsiveness of the
resumed system.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:48 -07:00
KAMEZAWA Hiroyuki
fadd8fbd15 [PATCH] support for panic at OOM
This patch adds panic_on_oom sysctl under sys.vm.

When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
processes.  And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
the same way as it does today.  Of course, the default value is 0 and only
root can modifies it.

In general, oom_killer works well and kill rogue processes.  So the whole
system can survive.  But there are environments where panic is preferable
rather than kill some processes.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:47 -07:00
David Howells
726c334223 [PATCH] VFS: Permit filesystem to perform statfs with a known root dentry
Give the statfs superblock operation a dentry pointer rather than a superblock
pointer.

This complements the get_sb() patch.  That reduced the significance of
sb->s_root, allowing NFS to place a fake root there.  However, NFS does
require a dentry to use as a target for the statfs operation.  This permits
the root in the vfsmount to be used instead.

linux/mount.h has been added where necessary to make allyesconfig build
successfully.

Interest has also been expressed for use with the FUSE and XFS filesystems.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
David Howells
454e2398be [PATCH] VFS: Permit filesystem to override root dentry on mount
Extend the get_sb() filesystem operation to take an extra argument that
permits the VFS to pass in the target vfsmount that defines the mountpoint.

The filesystem is then required to manually set the superblock and root dentry
pointers.  For most filesystems, this should be done with simple_set_mnt()
which will set the superblock pointer and then set the root dentry to the
superblock's s_root (as per the old default behaviour).

The get_sb() op now returns an integer as there's now no need to return the
superblock pointer.

This patch permits a superblock to be implicitly shared amongst several mount
points, such as can be done with NFS to avoid potential inode aliasing.  In
such a case, simple_set_mnt() would not be called, and instead the mnt_root
and mnt_sb would be set directly.

The patch also makes the following changes:

 (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
     pointer argument and return an integer, so most filesystems have to change
     very little.

 (*) If one of the convenience function is not used, then get_sb() should
     normally call simple_set_mnt() to instantiate the vfsmount. This will
     always return 0, and so can be tail-called from get_sb().

 (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
     dcache upon superblock destruction rather than shrink_dcache_anon().

     This is required because the superblock may now have multiple trees that
     aren't actually bound to s_root, but that still need to be cleaned up. The
     currently called functions assume that the whole tree is rooted at s_root,
     and that anonymous dentries are not the roots of trees which results in
     dentries being left unculled.

     However, with the way NFS superblock sharing are currently set to be
     implemented, these assumptions are violated: the root of the filesystem is
     simply a dummy dentry and inode (the real inode for '/' may well be
     inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
     with child trees.

     [*] Anonymous until discovered from another tree.

 (*) The documentation has been adjusted, including the additional bit of
     changing ext2_* into foo_* in the documentation.

[akpm@osdl.org: convert ipath_fs, do other stuff]
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nathan Scott <nathans@sgi.com>
Cc: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 07:42:45 -07:00
Linus Torvalds
45c091bb2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
  [POWERPC] re-enable OProfile for iSeries, using timer interrupt
  [POWERPC] support ibm,extended-*-frequency properties
  [POWERPC] Extra sanity check in EEH code
  [POWERPC] Dont look for class-code in pci children
  [POWERPC] Fix mdelay badness on shared processor partitions
  [POWERPC] disable floating point exceptions for init
  [POWERPC] Unify ppc syscall tables
  [POWERPC] mpic: add support for serial mode interrupts
  [POWERPC] pseries: Print PCI slot location code on failure
  [POWERPC] spufs: one more fix for 64k pages
  [POWERPC] spufs: fail spu_create with invalid flags
  [POWERPC] spufs: clear class2 interrupt status before wakeup
  [POWERPC] spufs: fix Makefile for "make clean"
  [POWERPC] spufs: remove stop_code from struct spu
  [POWERPC] spufs: fix spu irq affinity setting
  [POWERPC] spufs: further abstract priv1 register access
  [POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
  [POWERPC] spufs: dont try to access SPE channel 1 count
  [POWERPC] spufs: use kzalloc in create_spu
  [POWERPC] spufs: fix initial state of wbox file
  ...

Manually resolved conflicts in:
	drivers/net/phy/Makefile
	include/asm-powerpc/spu.h
2006-06-22 22:11:30 -07:00
Ravikiran G Thirumalai
de047c1bcd [PATCH] avoid tasklist_lock at getrusage for multithreaded case too
Avoid taking tasklist_lock for at getrusage for the multithreaded case too.
We don't need to take the tasklist lock for thread traversal of a process
since Oleg's do-__unhash_process-under-siglock.patch and related work.

Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:57 -07:00
Andrew Morton
6cc0719181 [PATCH] suspend_console() warning fix
kernel/power/main.c: In function 'suspend_prepare':
kernel/power/main.c:89: warning: implicit declaration of function 'suspend_console'
kernel/power/main.c: In function 'suspend_finish':
kernel/power/main.c:137: warning: implicit declaration of function 'resume_console'

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:56 -07:00
Michael LeMay
d720024e94 [PATCH] selinux: add hooks for key subsystem
Introduce SELinux hooks to support the access key retention subsystem
within the kernel.  Incorporate new flask headers from a modified version
of the SELinux reference policy, with support for the new security class
representing retained keys.  Extend the "key_alloc" security hook with a
task parameter representing the intended ownership context for the key
being allocated.  Attach security information to root's default keyrings
within the SELinux initialization routine.

Has passed David's testsuite.

Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22 15:05:55 -07:00
Linus Torvalds
d9eaec9e29 Merge branch 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (25 commits)
  [PATCH] make set_loginuid obey audit_enabled
  [PATCH] log more info for directory entry change events
  [PATCH] fix AUDIT_FILTER_PREPEND handling
  [PATCH] validate rule fields' types
  [PATCH] audit: path-based rules
  [PATCH] Audit of POSIX Message Queue Syscalls v.2
  [PATCH] fix se_sen audit filter
  [PATCH] deprecate AUDIT_POSSBILE
  [PATCH] inline more audit helpers
  [PATCH] proc_loginuid_write() uses simple_strtoul() on non-terminated array
  [PATCH] update of IPC audit record cleanup
  [PATCH] minor audit updates
  [PATCH] fix audit_krule_to_{rule,data} return values
  [PATCH] add filtering by ppid
  [PATCH] log ppid
  [PATCH] collect sid of those who send signals to auditd
  [PATCH] execve argument logging
  [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES
  [PATCH] audit_panic() is audit-internal
  [PATCH] inotify (5/5): update kernel documentation
  ...

Manual fixup of conflict in unclude/linux/inotify.h
2006-06-20 15:37:56 -07:00
Linus Torvalds
2edc322d42 Merge git://git.infradead.org/~dwmw2/rbtree-2.6
* git://git.infradead.org/~dwmw2/rbtree-2.6:
  [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency
  Update UML kernel/physmem.c to use rb_parent() accessor macro
  [RBTREE] Update hrtimers to use rb_parent() accessor macro.
  [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node.
  [RBTREE] Merge colour and parent fields of struct rb_node.
  [RBTREE] Remove dead code in rb_erase()
  [RBTREE] Update JFFS2 to use rb_parent() accessor macro.
  [RBTREE] Update eventpoll.c to use rb_parent() accessor macro.
  [RBTREE] Update key.c to use rb_parent() accessor macro.
  [RBTREE] Update ext3 to use rb_parent() accessor macro.
  [RBTREE] Change rbtree off-tree marking in I/O schedulers.
  [RBTREE] Add accessor macros for colour and parent fields of rb_node
2006-06-20 14:51:22 -07:00
Linus Torvalds
be967b7e2f Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (199 commits)
  [MTD] NAND: Fix breakage all over the place
  [PATCH] NAND: fix remaining OOB length calculation
  [MTD] NAND Fixup NDFC merge brokeness
  [MTD NAND] S3C2410 driver cleanup
  [MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation.
  [JFFS2] Check CRC32 on dirent and data nodes each time they're read
  [JFFS2] When retiring nextblock, allocate a node_ref for the wasted space
  [JFFS2] Mark XATTR support as experimental, for now
  [JFFS2] Don't trust node headers before the CRC is checked.
  [MTD] Restore MTD_ROM and MTD_RAM types
  [MTD] assume mtd->writesize is 1 for NOR flashes
  [MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles
  [MTD] Prepare physmap for 64-bit-resources
  [JFFS2] Fix more breakage caused by janitorial meddling.
  [JFFS2] Remove stray __exit from jffs2_compressors_exit()
  [MTD] Allow alternate JFFS2 mount variant for root filesystem.
  [MTD] Disconnect struct mtd_info from ABI
  [MTD] replace MTD_RAM with MTD_GENERIC_TYPE
  [MTD] replace MTD_ROM with MTD_GENERIC_TYPE
  [MTD] remove a forgotten MTD_XIP
  ...
2006-06-20 14:50:31 -07:00
Steve Grubb
41757106b9 [PATCH] make set_loginuid obey audit_enabled
Hi,

I was doing some testing and noticed that when the audit system was disabled,
I was still getting messages about the loginuid being set. The following patch
makes audit_set_loginuid look at in_syscall to determine if it should create
an audit event. The loginuid will continue to be set as long as there is a context.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:29 -04:00
Amy Griffis
9c937dcc71 [PATCH] log more info for directory entry change events
When an audit event involves changes to a directory entry, include
a PATH record for the directory itself.  A few other notable changes:

    - fixed audit_inode_child() hooks in fsnotify_move()
    - removed unused flags arg from audit_inode()
    - added audit log routines for logging a portion of a string

Here's some sample output.

before patch:
type=SYSCALL msg=audit(1149821605.320:26): arch=40000003 syscall=39 success=yes exit=0 a0=bf8d3c7c a1=1ff a2=804e1b8 a3=bf8d3c7c items=1 ppid=739 pid=800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
type=CWD msg=audit(1149821605.320:26):  cwd="/root"
type=PATH msg=audit(1149821605.320:26): item=0 name="foo" parent=164068 inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

after patch:
type=SYSCALL msg=audit(1149822032.332:24): arch=40000003 syscall=39 success=yes exit=0 a0=bfdd9c7c a1=1ff a2=804e1b8 a3=bfdd9c7c items=2 ppid=714 pid=777 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255
type=CWD msg=audit(1149822032.332:24):  cwd="/root"
type=PATH msg=audit(1149822032.332:24): item=0 name="/root" inode=164068 dev=03:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_dir_t:s0
type=PATH msg=audit(1149822032.332:24): item=1 name="foo" inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:28 -04:00
Amy Griffis
6a2bceec0e [PATCH] fix AUDIT_FILTER_PREPEND handling
Clear AUDIT_FILTER_PREPEND flag after adding rule to list.  This
fixes three problems when a rule is added with the -A syntax:

    - auditctl displays filter list as "(null)"
    - the rule cannot be removed using -d
    - a duplicate rule can be added with -a

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:28 -04:00
Al Viro
0a73dccc4f [PATCH] validate rule fields' types
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:27 -04:00
Amy Griffis
f368c07d72 [PATCH] audit: path-based rules
In this implementation, audit registers inotify watches on the parent
directories of paths specified in audit rules.  When audit's inotify
event handler is called, it updates any affected rules based on the
filesystem event.  If the parent directory is renamed, removed, or its
filesystem is unmounted, audit removes all rules referencing that
inotify watch.

To keep things simple, this implementation limits location-based
auditing to the directory entries in an existing directory.  Given
a path-based rule for /foo/bar/passwd, the following table applies:

    passwd modified -- audit event logged
    passwd replaced -- audit event logged, rules list updated
    bar renamed     -- rule removed
    foo renamed     -- untracked, meaning that the rule now applies to
		       the new location

Audit users typically want to have many rules referencing filesystem
objects, which can significantly impact filtering performance.  This
patch also adds an inode-number-based rule hash to mitigate this
situation.

The patch is relative to the audit git tree:
http://kernel.org/git/?p=linux/kernel/git/viro/audit-current.git;a=summary
and uses the inotify kernel API:
http://lkml.org/lkml/2006/6/1/145

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:27 -04:00
George C. Wilson
20ca73bc79 [PATCH] Audit of POSIX Message Queue Syscalls v.2
This patch adds audit support to POSIX message queues.  It applies cleanly to
the lspp.b15 branch of Al Viro's git tree.  There are new auxiliary data
structures, and collection and emission routines in kernel/auditsc.c.  New hooks
in ipc/mqueue.c collect arguments from the syscalls.

I tested the patch by building the examples from the POSIX MQ library tarball.
Build them -lrt, not against the old MQ library in the tarball.  Here's the URL:
http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz
Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive,
mq_notify, mq_getsetattr.  mq_unlink has no new hooks.  Please see the
corresponding userspace patch to get correct output from auditd for the new
record types.

[fixes folded]

Signed-off-by: George Wilson <ltcgcw@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:26 -04:00
Al Viro
014149cce1 [PATCH] deprecate AUDIT_POSSBILE
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:25 -04:00
Al Viro
d8945bb51a [PATCH] inline more audit helpers
pull checks for ->audit_context into inlined wrappers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:25 -04:00
Linda Knippers
ac03221a4f [PATCH] update of IPC audit record cleanup
The following patch addresses most of the issues with the IPC_SET_PERM
records as described in:
https://www.redhat.com/archives/linux-audit/2006-May/msg00010.html
and addresses the comments I received on the record field names.

To summarize, I made the following changes:

1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM
   record is emitted in the failure case as well as the success case.
   This matches the behavior in sys_shmctl().  I could simplify the
   code in sys_msgctl() and semctl_down() slightly but it would mean
   that in some error cases we could get an IPC_SET_PERM record
   without an IPC record and that seemed odd.

2. No change to the IPC record type, given no feedback on the backward
   compatibility question.

3. Removed the qbytes field from the IPC record.  It wasn't being
   set and when audit_ipc_obj() is called from ipcperms(), the
   information isn't available.  If we want the information in the IPC
   record, more extensive changes will be necessary.  Since it only
   applies to message queues and it isn't really permission related, it
   doesn't seem worth it.

4. Removed the obj field from the IPC_SET_PERM record.  This means that
   the kern_ipc_perm argument is no longer needed.

5. Removed the spaces and renamed the IPC_SET_PERM field names.  Replaced iuid and
   igid fields with ouid and ogid in the IPC record.

I tested this with the lspp.22 kernel on an x86_64 box.  I believe it
applies cleanly on the latest kernel.

-- ljk

Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:24 -04:00
Serge E. Hallyn
5d136a010d [PATCH] minor audit updates
Just a few minor proposed updates.  Only the last one will
actually affect behavior.  The rest are just misleading
code.

Several AUDIT_SET functions return 'old' value, but only
return value <0 is checked for.  So just return 0.

propagate audit_set_rate_limit and audit_set_backlog_limit
error values

In audit_buffer_free, the audit_freelist_count was being
incremented even when we discard the return buffer, so
audit_freelist_count can end up wrong.  This could cause
the actual freelist to shrink over time, eventually
threatening to degrate audit performance.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:23 -04:00
Amy Griffis
0a3b483e83 [PATCH] fix audit_krule_to_{rule,data} return values
Don't return -ENOMEM when callers of these functions are checking for
a NULL return.  Bug noticed by Serge Hallyn.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:23 -04:00
Al Viro
3c66251e57 [PATCH] add filtering by ppid
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:22 -04:00
Al Viro
f46038ff7d [PATCH] log ppid
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:22 -04:00
Al Viro
e1396065e0 [PATCH] collect sid of those who send signals to auditd
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:21 -04:00
Al Viro
473ae30bc7 [PATCH] execve argument logging
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:21 -04:00
Al Viro
9044e6bca5 [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES
We should not send a pile of replies while holding audit_netlink_mutex
since we hold the same mutex when we receive commands.  As the result,
we can get blocked while sending and sit there holding the mutex while
auditctl is unable to send the next command and get around to receiving
what we'd sent.

Solution: create skb and put them into a queue instead of sending;
once we are done, send what we've got on the list.  The former can
be done synchronously while we are handling AUDIT_LIST or AUDIT_LIST_RULES;
we are holding audit_netlink_mutex at that point.  The latter is done
asynchronously and without messing with audit_netlink_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:20 -04:00
Amy Griffis
2d9048e201 [PATCH] inotify (1/5): split kernel API from userspace support
The following series of patches introduces a kernel API for inotify,
making it possible for kernel modules to benefit from inotify's
mechanism for watching inodes.  With these patches, inotify will
maintain for each caller a list of watches (via an embedded struct
inotify_watch), where each inotify_watch is associated with a
corresponding struct inode.  The caller registers an event handler and
specifies for which filesystem events their event handler should be
called per inotify_watch.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Acked-by: Robert Love <rml@novell.com>
Acked-by: John McCutchan <john@johnmccutchan.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20 05:25:17 -04:00
Linus Torvalds
557240b48e Add support for suspending and resuming the whole console subsystem
Trying to suspend/resume with console messages flying all around is
doomed to failure, when the devices that the messages are trying to
go to are being shut down.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-19 18:16:01 -07:00
Oleg Nesterov
f53ae1dc34 [PATCH] arm_timer: remove a racy and obsolete PF_EXITING check
arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state)
in run_posix_cpu_timers().

However, for some reason it does so only for CPUCLOCK_PERTHREAD
case (which is imho wrong).

Also, this check is not reliable, PF_EXITING could be set on
another cpu without any locks/barriers just after the check,
so it can't prevent from attaching the timer to the exiting
task.

The previous patch makes this check unneeded.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Oleg Nesterov
30f1e3dd8c [PATCH] run_posix_cpu_timers: remove a bogus BUG_ON()
do_exit() clears ->it_##clock##_expires, but nothing prevents
another cpu to attach the timer to exiting process after that.
arm_timer() tries to protect against this race, but the check
is racy.

After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and
before do_exit() calls 'schedule() local timer interrupt can find
tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu
does sys_wait4) interrupted task has ->signal == NULL.

At this moment exiting task has no pending cpu timers, they were
cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(),
so we can just return from irq.

John Stultz recently confirmed this bug, see

	http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Oleg Nesterov
8f17fc20bf [PATCH] check_process_timers: fix possible lockup
If the local timer interrupt happens just after do_exit() sets PF_EXITING
(and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call
check_process_timers() with tasklist_lock + ->siglock held and

	check_process_timers:

		t = tsk;
		do {
			....

			do {
				t = next_thread(t);
			} while (unlikely(t->flags & PF_EXITING));
		} while (t != tsk);

the outer loop will never stop.

Actually, the window is bigger.  Another process can attach the timer
after ->it_xxx_expires was cleared (see the next commit) and the 'if
(PF_EXITING)' check in arm_timer() is racy (see the one after that).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17 10:52:13 -07:00
Sam Ravnborg
b817f6feff kbuild: check license compatibility when building modules
Modules that uses GPL symbols can no longer be build with kbuild,
the build will fail during the modpost step.
When a GPL-incompatible module uses a EXPORT_SYMBOL_GPL_FUTURE symbol
then warn during modpost so author are actually notified.

The actual license compatibility check is shared with the kernel
to make sure it is in sync.

Patch originally from: Andreas Gruenbacher <agruen@suse.de> and
Ram Pai <linuxram@us.ibm.com>

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-06-09 21:53:55 +02:00
Anton Blanchard
651d765d0b [PATCH] Add a prctl to change the endianness of a process.
This new prctl is intended for changing the execution mode of the
processor, on processors that support both a little-endian mode and a
big-endian mode.  It is intended for use by programs such as
instruction set emulators (for example an x86 emulator on PowerPC),
which may find it convenient to use the processor in an alternate
endianness mode when executing translated instructions.

Note that this does not imply the existence of a fully-fledged ABI for
both endiannesses, or of compatibility code for converting system
calls done in the non-native endianness mode.  The program is expected
to arrange for all of its system call arguments to be presented in the
native endianness.

Switching between big and little-endian mode will require some care in
constructing the instruction sequence for the switch.  Generally the
instructions up to the instruction that invokes the prctl system call
will have to be in the old endianness, and subsequent instructions
will have to be in the new endianness.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-09 21:24:13 +10:00
Stephen Hemminger
8d16b76421 [PATCH] hrtimer: export symbols
From: Stephen Hemminger <shemminger@osdl.org>

I want to use the hrtimer's in the netem (Network Emulator) qdisc.  But the
necessary symbols aren't exported for module use.

Also needed by SystemTap.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Stone, Joshua I" <joshua.i.stone@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-31 16:27:11 -07:00
Linus Torvalds
f1adad78dd Revert "[PATCH] sched: fix interactive task starvation"
This reverts commit 5ce74abe78 (and its
dependent commit 8a5bc075b8), because of
audio underruns.

Reported by Rene Herman <rene.herman@keyaccess.nl>, who also pinpointed
the exact cause of the underruns:

  "Audio underruns galore, with only ogg123 and firefox (browsing the
   GIT tree online is also a nice trigger by the way).

   If I back it out, everything is fine for me again."

Cc: Rene Herman <rene.herman@keyaccess.nl>
Cc: Mike Galbraith <efault@gmx.de>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 18:54:09 -07:00
Zachary Amsden
0662b71322 [PATCH] Fix a NO_IDLE_HZ timer bug
Under certain timing conditions, a race during boot occurs where timer
ticks are being processed on remote CPUs.  The remote timer ticks can
increment jiffies, and if this happens during a window when a timeout is
very close to expiring but a local tick has not yet been delivered, you can
end up with

1) No softirq pending
2) A local timer wheel which is not synced to jiffies
3) No high resolution timer active
4) A local timer which is supposed to fire before the current jiffies value.

In this circumstance, the comparison in next_timer_interrupt overflows,
because the base of the comparison for high resolution timers is jiffies,
but for the softirq timer wheel, it is relative the the current base of the
wheel (jiffies_base).

Signed-off-by: Zachary Amsden <zach@vmware.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:21 -07:00
Paul Jackson
92d1dbd274 [PATCH] cpuset: might_sleep_if check in cpuset_zones_allowed
It's too easy to incorrectly call cpuset_zone_allowed() in an atomic
context without __GFP_HARDWALL set, and when done, it is not noticed until
a tight memory situation forces allocations to be tried outside the current
cpuset.

Add a 'might_sleep_if()' check, to catch this earlier on, instead of
waiting for a similar check in the mutex_lock() code, which is only rarely
invoked.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
Paul Jackson
36be57ffe3 [PATCH] cpuset: update cpuset_zones_allowed comment
Update the kernel/cpuset.c:cpuset_zone_allowed() comment.

The rule for when mm/page_alloc.c should call cpuset_zone_allowed()
was intended to be:

  Don't call cpuset_zone_allowed() if you can't sleep, unless you
  pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
  the code that might scan up ancestor cpusets and sleep.

The explanation of this rule in the comment above cpuset_zone_allowed() was
stale, as a result of a restructuring of some __alloc_pages() code in
November 2005.

Rewrite that comment ...

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21 12:59:18 -07:00
David Woodhouse
18594822fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-16 01:19:52 +01:00
Trent Piepho
5e37661389 [PATCH] symbol_put_addr() locks kernel
Even since a previous patch:

Fix race between CONFIG_DEBUG_SLABALLOC and modules
Sun, 27 Jun 2004 17:55:19 +0000 (17:55 +0000)
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=92b3db26d31cf21b70e3c1eadc56c179506d8fbe

The function symbol_put_addr() will deadlock the kernel.

symbol_put_addr() would acquire modlist_lock, then while holding the lock call
two functions kernel_text_address() and module_text_address() which also try
to acquire the same lock.  This deadlocks the kernel of course.

This patch changes symbol_put_addr() to not acquire the modlist_lock, it
doesn't need it since it never looks at the module list directly.  Also, it
now uses core_kernel_text() instead of kernel_text_address().  The latter has
an additional check for addr inside a module, but we don't need to do that
since we call module_text_address() (the same function kernel_text_address
uses) ourselves.

Signed-off-by: Trent Piepho <xyzzy@speakeasy.org>
Cc: Zwane Mwaikambo <zwane@fsmlabs.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Johannes Stezenbach <js@linuxtv.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Heiko Carstens
986733e01d [PATCH] RCU: introduce rcu_needs_cpu() interface
With "Paul E. McKenney" <paulmck@us.ibm.com>

Introduce rcu_needs_cpu() interface.  This can be used to tell if there
will be a new rcu batch on a cpu soon by looking at the curlist pointer.
This can be used to avoid to enter a tickless idle state where the cpu
would miss that a new batch is ready when rcu_start_batch would be called
on a different cpu.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15 11:20:55 -07:00
Linus Torvalds
f358166a94 ptrace_attach: fix possible deadlock schenario with irqs
Eric Biederman points out that we can't take the task_lock while holding
tasklist_lock for writing, because another CPU that holds the task lock
might take an interrupt that then tries to take tasklist_lock for writing.

Which would be a nasty deadlock, with one CPU spinning forever in an
interrupt handler (although admittedly you need to really work at
triggering it ;)

Since the ptrace_attach() code is special and very unusual, just make it
be extra careful, and use trylock+repeat to avoid the possible deadlock.

Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-11 11:08:49 -07:00
David Woodhouse
6f18a022fb Finally remove the obnoxious inter_module_xxx()
This was already a bad plan when I argued against adding it in the first
place. Good riddance.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-08 22:40:05 +01:00
Linus Torvalds
f5b40e363a Fix ptrace_attach()/ptrace_traceme()/de_thread() race
This holds the task lock (and, for ptrace_attach, the tasklist_lock)
over the actual attach event, which closes a race between attacking to a
thread that is either doing a PTRACE_TRACEME or getting de-threaded.

Thanks to Oleg Nesterov for reminding me about this, and Chris Wright
for noticing a lost return value in my first version.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-07 10:49:33 -07:00
Steve Grubb
2ad312d209 [PATCH] Audit Filter Performance
While testing the watch performance, I noticed that selinux_task_ctxid()
was creeping into the results more than it should. Investigation showed
that the function call was being called whether it was needed or not. The
below patch fixes this.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:07 -04:00
Steve Grubb
073115d6b2 [PATCH] Rework of IPC auditing
1) The audit_ipc_perms() function has been split into two different
functions:
        - audit_ipc_obj()
        - audit_ipc_set_perm()

There's a key shift here...  The audit_ipc_obj() collects the uid, gid,
mode, and SElinux context label of the current ipc object.  This
audit_ipc_obj() hook is now found in several places.  Most notably, it
is hooked in ipcperms(), which is called in various places around the
ipc code permforming a MAC check.  Additionally there are several places
where *checkid() is used to validate that an operation is being
performed on a valid object while not necessarily having a nearby
ipcperms() call.  In these locations, audit_ipc_obj() is called to
ensure that the information is captured by the audit system.

The audit_set_new_perm() function is called any time the permissions on
the ipc object changes.  In this case, the NEW permissions are recorded
(and note that an audit_ipc_obj() call exists just a few lines before
each instance).

2) Support for an AUDIT_IPC_SET_PERM audit message type.  This allows
for separate auxiliary audit records for normal operations on an IPC
object and permissions changes.  Note that the same struct
audit_aux_data_ipcctl is used and populated, however there are separate
audit_log_format statements based on the type of the message.  Finally,
the AUDIT_IPC block of code in audit_free_aux() was extended to handle
aux messages of this new type.  No more mem leaks I hope ;-)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:04 -04:00
Steve Grubb
ce29b682e2 [PATCH] More user space subject labels
Hi,

The patch below builds upon the patch sent earlier and adds subject label to
all audit events generated via the netlink interface. It also cleans up a few
other minor things.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:10:01 -04:00
Steve Grubb
e7c3497013 [PATCH] Reworked patch for labels on user space messages
The below patch should be applied after the inode and ipc sid patches.
This patch is a reworking of Tim's patch that has been updated to match
the inode and ipc patches since its similar.

[updated:
>  Stephen Smalley also wanted to change a variable from isec to tsec in the
>  user sid patch.                                                              ]

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:58 -04:00
Steve Grubb
9c7aa6aa74 [PATCH] change lspp ipc auditing
Hi,

The patch below converts IPC auditing to collect sid's and convert to context
string only if it needs to output an audit record. This patch depends on the
inode audit change patch already being applied.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:56 -04:00
Steve Grubb
1b50eed9ca [PATCH] audit inode patch
Previously, we were gathering the context instead of the sid. Now in this patch,
we gather just the sid and convert to context only if an audit event is being
output.

This patch brings the performance hit from 146% down to 23%

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:53 -04:00
Darrel Goeddel
3dc7e3153e [PATCH] support for context based audit filtering, part 2
This patch provides the ability to filter audit messages based on the
elements of the process' SELinux context (user, role, type, mls sensitivity,
and mls clearance).  It uses the new interfaces from selinux to opaquely
store information related to the selinux context and to filter based on that
information.  It also uses the callback mechanism provided by selinux to
refresh the information when a new policy is loaded.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:09:36 -04:00
Al Viro
97e94c4530 [PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:21 -04:00
Al Viro
5411be59db [PATCH] drop task argument of audit_syscall_{entry,exit}
... it's always current, and that's a good thing - allows simpler locking.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:18 -04:00
Al Viro
e495149b17 [PATCH] drop gfp_mask in audit_log_exit()
now we can do that - all callers are process-synchronous and do not hold
any locks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:16 -04:00
Al Viro
fa84cb935d [PATCH] move call of audit_free() into do_exit()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:13 -04:00
Al Viro
45d9bb0e37 [PATCH] deal with deadlocks in audit_free()
Don't assume that audit_log_exit() et.al. are called for the context of
current; pass task explictly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01 06:06:07 -04:00
Andrew Morton
13e87ec686 [PATCH] request_irq(): remove warnings from irq probing
- Add new SA_PROBEIRQ which suppresses the new sharing-mismatch warning.
  Some drivers like to use request_irq() to find an unused interrupt slot.

- Use it in i82365.c

- Kill unused SA_PROBE.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 08:33:46 -07:00
dean gaudet
47bb789973 [PATCH] off-by-1 in kernel/power/main.c
There's an off-by-1 in kernel/power/main.c:state_store() ...  if your
kernel just happens to have some non-zero data at pm_states[PM_SUSPEND_MAX]
(i.e.  one past the end of the array) then it'll let you write anything you
want to /sys/power/state and in response the box will enter S5.

Signed-off-by: dean gaudet <dean@arctic.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28 08:33:46 -07:00
Chandra Seetharaman
83d722f7e1 [PATCH] Remove __devinit and __cpuinit from notifier_call definitions
Few of the notifier_chain_register() callers use __init in the definition
of notifier_call.  It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).

This patch fixes all such usages to _not_ have the notifier_call __init
section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:30:03 -07:00
Chandra Seetharaman
649bbaa484 [PATCH] Remove __devinitdata from notifier block definitions
Few of the notifier_chain_register() callers use __devinitdata in the
definition of notifier_block data structure.  It is incorrect as the
data structure should be available after the initializations (they do
not unregister them during initializations).

This was leading to an oops when notifier_chain_register() call is
invoked for those callback chains after initialization.

This patch fixes all such usages to _not_ have the notifier_block data
structure in the init data section.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26 08:27:50 -07:00
David Woodhouse
ed198cb497 [RBTREE] Update hrtimers to use rb_parent() accessor macro.
Also switch it to use the same method of using off-tree nodes as
everyone else now does -- set them to point to themselves.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-22 02:38:50 +01:00
Linus Torvalds
402a26f0c0 Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] block/elevator.c: remove unused exports
  [PATCH] splice: fix smaller sized splice reads
  [PATCH] Don't inherit ->splice_pipe across forks
  [patch] cleanup: use blk_queue_stopped
  [PATCH] Document online io scheduler switching
2006-04-20 08:17:04 -07:00
Ananth N Mavinakayanahalli
7522a8423b [PATCH] kprobes: NULL out non-relevant fields in struct kretprobe
In cases where a struct kretprobe's *_handler fields are non-NULL, it is
possible to cause a system crash, due to the possibility of calls ending up
in zombie functions.  Documentation clearly states that unused *_handlers
should be set to NULL, but kprobe users sometimes fail to do so.

Fix it by setting the non-relevant fields of the struct kretprobe to NULL.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20 07:54:03 -07:00
Jens Axboe
a0aa7f68af [PATCH] Don't inherit ->splice_pipe across forks
It's really task private, so clear that field on fork after copying
task structure.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-20 13:05:33 +02:00
OGAWA Hirofumi
5a7b46b369 [PATCH] Add more prevent_tail_call()
Those also break userland regs like following.

   00000000 <sys_chown16>:
      0:	0f b7 44 24 0c       	movzwl 0xc(%esp),%eax
      5:	83 ca ff             	or     $0xffffffff,%edx
      8:	0f b7 4c 24 08       	movzwl 0x8(%esp),%ecx
      d:	66 83 f8 ff          	cmp    $0xffffffff,%ax
     11:	0f 44 c2             	cmove  %edx,%eax
     14:	66 83 f9 ff          	cmp    $0xffffffff,%cx
     18:	0f 45 d1             	cmovne %ecx,%edx
     1b:	89 44 24 0c          	mov    %eax,0xc(%esp)
     1f:	89 54 24 08          	mov    %edx,0x8(%esp)
     23:	e9 fc ff ff ff       	jmp    24 <sys_chown16+0x24>

where the tailcall at the end overwrites the incoming stack-frame.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
[ I would _really_ like to have a way to tell gcc about calling
  conventions. The "prevent_tail_call()" macro is pretty ugly ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 16:27:18 -07:00
Rafael J. Wysocki
4a3b98a422 [PATCH] swsusp: prevent possible image corruption on resume
The function free_pagedir() used by swsusp for freeing its internal data
structures clears the PG_nosave and PG_nosave_free flags for each page
being freed.

However, during resume PG_nosave_free set means that the page in
question is "unsafe" (ie.  it will be overwritten in the process of
restoring the saved system state from the image), so it should not be
used for the image data.

Therefore free_pagedir() should not clear PG_nosave_free if it's called
during resume (otherwise "unsafe" pages freed by it may be used for
storing the image data and the data may get corrupted later on).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:49 -07:00
Eric W. Biederman
5e85d4abe3 [PATCH] task: Make task list manipulations RCU safe
While we can currently walk through thread groups, process groups, and
sessions with just the rcu_read_lock, this opens the door to walking the
entire task list.

We already have all of the other RCU guarantees so there is no cost in
doing this, this should be enough so that proc can stop taking the
tasklist lock during readdir.

prev_task was killed because it has no users, and using it will miss new
tasks when doing an rcu traversal.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19 09:13:49 -07:00
Eric W. Biederman
64541d1970 [PATCH] kill unushed __put_task_struct_cb
Somehow in the midst of dotting i's and crossing t's during
the merge up to rc1 we wound up keeping __put_task_struct_cb
when it should have been killed as it no longer has any users.
Sorry I probably should have caught this while it was
still in the -mm tree.

Having the old code there gets confusing when reading
through the code and trying to understand what is
happening.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14 17:43:57 -07:00
Adrian Bunk
78a596b449 [PATCH] remove kernel/power/pm.c:pm_unregister()
Since the last user is removed in -mm, we can now remove this long deprecated
function.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-14 12:25:26 -07:00
Roland McGrath
e57a505984 [PATCH] fix non-leader exec under ptrace
This reverts most of commit 30e0fca6c1.
It broke the case of non-leader MT exec when ptraced.
I think the bug it was intended to fix was already addressed by commit
788e05a67c.

Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14 08:59:13 -07:00
Oleg Nesterov
a145410dcc [PATCH] __group_complete_signal: remove bogus BUG_ON
Commit e56d090310

   [PATCH] RCU signal handling

made this BUG_ON() unsafe. This code runs under ->siglock,
while switch_exec_pids() takes tasklist_lock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 07:34:01 -07:00
Linus Torvalds
88dd9c16ce Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
  [PATCH] vfs: add splice_write and splice_read to documentation
  [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
  [PATCH] splice: warning fix
  [PATCH] another round of fs/pipe.c cleanups
  [PATCH] splice: comment styles
  [PATCH] splice: add Ingo as addition copyright holder
  [PATCH] splice: unlikely() optimizations
  [PATCH] splice: speedups and optimizations
  [PATCH] pipe.c/fifo.c code cleanups
  [PATCH] get rid of the PIPE_*() macros
  [PATCH] splice: speedup __generic_file_splice_read
  [PATCH] splice: add direct fd <-> fd splicing support
  [PATCH] splice: add optional input and output offsets
  [PATCH] introduce a "kernel-internal pipe object" abstraction
  [PATCH] splice: be smarter about calling do_page_cache_readahead()
  [PATCH] splice: optimize the splice buffer mapping
  [PATCH] splice: cleanup __generic_file_splice_read()
  [PATCH] splice: only call wake_up_interruptible() when we really have to
  [PATCH] splice: potential !page dereference
  [PATCH] splice: mark the io page as accessed
2006-04-11 06:34:02 -07:00
Joe Korty
5ef37b1964 [PATCH] add cpu_relax to hrtimer_cancel
Add a cpu_relax() to the hand-coded spinwait in hrtimer_cancel().

Signed-off-by: Joe Korty <joe.korty@ccur.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:42 -07:00
Christoph Hellwig
d824e66a9a [PATCH] build kernel/irq/migration.c only if CONFIG_GENERIC_PENDING_IRQ is set
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:41 -07:00
Adrian Bunk
aa7271076a [PATCH] the scheduled unexport of panic_timeout
Implement the scheduled unexport of panic_timeout.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:40 -07:00
Andrew Morton
ba6edfcd17 [PATCH] timer initialisation fix
We need the boot CPU's tvec_bases[] entry to be initialised super-early in
boot, for early_serial_setup().  That runs within setup_arch(), before even
per-cpu areas are initialised.

The patch changes tvec_bases to use compile-time initialisation, and adds a
separate array `tvec_base_done' to keep track of which CPU has had its
tvec_bases[] entry initialised (because we can no longer use the zeroness of
that tvec_bases[] entry to determine whether it has been initialised).

Thanks to Eugene Surovegin <ebs@ebshome.net> for diagnosing this.

Cc: Eugene Surovegin <ebs@ebshome.net>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:40 -07:00
Hyok S. Choi
3016b42153 [PATCH] frv: define MMU mode specific syscalls as 'cond_syscall' and clean up unneeded macros
For some architectures, a few syscalls are not linked in noMMU mode.  In
that case, the MMU depending syscalls are needed to be defined as
'cond_syscall'.  For example, ARM architecture selectively links sys_mlock
by the mode configuration.

In case of FRV, it has been managed by #ifdef CONFIG_MMU macro in
arch/frv/kernel/entry.S.  However these conditional macros are just
duplicates if they were defined as cond_syscall.  Compilation test is done
with FRV toolchains for both of MMU and noMMU mode.

Signed-off-by: Hyok S. Choi <hyok.choi@samsung.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:33 -07:00
Mike Galbraith
8a5bc075b8 [PATCH] sched: don't awaken RT tasks on expired array
RT tasks are being awakened on the expired array when expired_starving() is
true, whereas they really should be excluded.  Fix.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Mike Galbraith
5ce74abe78 [PATCH] sched: fix interactive task starvation
Fix a starvation problem that occurs when a stream of highly interactive tasks
delay an array switch for extended periods despite EXPIRED_STARVING(rq) being
true.  AFAIKT, the only choice is to enqueue awakening tasks on the expired
array in this case.

Without this patch, it can be nearly impossible to remotely login to a busy
server, and interactive shell commands can starve for minutes.

Also, convert the EXPIRED_STARVING macro into an inline function which humans
can understand.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
Jens Axboe
b92ce55893 [PATCH] splice: add direct fd <-> fd splicing support
It's more efficient for sendfile() emulation. Basically we cache an
internal private pipe and just use that as the intermediate area for
pages. Direct splicing is not available from sys_splice(), it is only
meant to be used for sendfile() emulation.

Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
exit for the normal fast path.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-11 13:52:07 +02:00
Jordan Hargrave
b20367a6c2 [PATCH] x86_64: Fix drift with HPET timer enabled
If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
This is due to the HPET timer not being initialized with the correct
setting (still using PIT count).

If HZ changes, this drift can become even more pronounced.

HPET patch initializes tick_nsec with correct tick_nsec settings for
HPET timer.

Vojtech comments:

  "It's not entirely correct (it assumes the HPET ticks totally
   exactly), but it's significantly better than assuming the PIT error
   there."

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-09 11:53:53 -07:00
Eric Sesterhenn
9f31252cb6 BUG_ON() Conversion in kernel/signal.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:45:55 +02:00
Eric Sesterhenn
fda8bd78a1 BUG_ON() Conversion in kernel/signal.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:44:47 +02:00
Eric Sesterhenn
524223ca81 BUG_ON() Conversion in kernel/ptrace.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02 13:43:40 +02:00
Kalin KOZHUHAROV
8ba8e95ed1 Fix comments: s/granuality/granularity/
I was grepping through the code and some `grep ganularity -R .` didn't
catch what I thought. Then looking closer I saw the term "granuality"
used in only four places (in comments) and granularity in many more
places describing the same idea. Some other facts:

dictionary.com does not know such a word
define:granuality on google is not found (and pages for granuality are
mostly related to patches to the kernel)
it has not been discussed as a term on LKML, AFAICS (=Can Search)

To be consistent, I think granularity should be used everywhere.

Signed-off-by: Kalin KOZHUHAROV <kalin@thinrope.net>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:41:22 +02:00
Eric Sesterhenn
8abd8e298e BUG_ON() Conversion in kernel/printk.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:21:17 +02:00
Adrian Bunk
3e6e952d1d help text: SOFTWARE_SUSPEND doesn't need ACPI
The note that SOFTWARE_SUSPEND doesn't need APM is helpful, but nowadays
the information that it doesn't need ACPI, too, is even more helpful.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01 01:03:08 +02:00
Kirill Korotaev
4286229868 [PATCH] wrong error path in dup_fd() leading to oopses in RCU
Wrong error path in dup_fd() - it should return NULL on error,
not an address of already freed memory :/

Triggered by OpenVZ stress test suite.

What is interesting is that it was causing different oopses in RCU like
below:
Call Trace:
   [<c013492c>] rcu_do_batch+0x2c/0x80
   [<c0134bdd>] rcu_process_callbacks+0x3d/0x70
   [<c0126cf3>] tasklet_action+0x73/0xe0
   [<c01269aa>] __do_softirq+0x10a/0x130
   [<c01058ff>] do_softirq+0x4f/0x60
   =======================
   [<c0113817>] smp_apic_timer_interrupt+0x77/0x110
   [<c0103b54>] apic_timer_interrupt+0x1c/0x24
  Code:  Bad EIP value.
   <0>Kernel panic - not syncing: Fatal exception in interrupt

Signed-Off-By: Pavel Emelianov <xemul@sw.ru>
Signed-Off-By: Dmitry Mishin <dim@openvz.org>
Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-Off-By: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:25:46 -08:00
Eric W. Biederman
92476d7fc0 [PATCH] pidhash: Refactor the pid hash table
Simplifies the code, reduces the need for 4 pid hash tables, and makes the
code more capable.

In the discussions I had with Oleg it was felt that to a large extent the
cleanup itself justified the work.  With struct pid being dynamically
allocated meant we could create the hash table entry when the pid was
allocated and free the hash table entry when the pid was freed.  Instead of
playing with the hash lists when ever a process would attach or detach to a
process.

For myself the fact that it gave what my previous task_ref patch gave for free
with simpler code was a big win.  The problem is that if you hold a reference
to struct task_struct you lock in 10K of low memory.  If you do that in a user
controllable way like /proc does, with an unprivileged but hostile user space
application with typical resource limits of 1000 fds and 100 processes I can
trigger the OOM killer by consuming all of low memory with task structs, on a
machine wight 1GB of low memory.

If I instead hold a reference to struct pid which holds a pointer to my
task_struct, I don't suffer from that problem because struct pid is 2 orders
of magnitude smaller.  In fact struct pid is small enough that most other
kernel data structures dwarf it, so simply limiting the number of referring
data structures is enough to prevent exhaustion of low memory.

This splits the current struct pid into two structures, struct pid and struct
pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
struct pid_link is the per process linkage into the hash tables and lives in
struct task_struct.  struct pid is given an indepedent lifetime, and holds
pointers to each of the pid types.

The independent life of struct pid simplifies attach_pid, and detach_pid,
because we are always manipulating the list of pids and not the hash table.
In addition in giving struct pid an indpendent life it makes the concept much
more powerful.

Kernel data structures can now embed a struct pid * instead of a pid_t and
not suffer from pid wrap around problems or from keeping unnecessarily
large amounts of memory allocated.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:19:00 -08:00
Eric W. Biederman
8c7904a00b [PATCH] task: RCU protect task->usage
A big problem with rcu protected data structures that are also reference
counted is that you must jump through several hoops to increase the reference
count.  I think someone finally implemented atomic_inc_not_zero(&count) to
automate the common case.  Unfortunately this means you must special case the
rcu access case.

When data structures are only visible via rcu in a manner that is not
determined by the reference count on the object (i.e.  tasks are visible until
their zombies are reaped) there is a much simpler technique we can employ.
Simply delaying the decrement of the reference count until the rcu interval is
over.

What that means is that the proc code that looks up a task and later
wants to sleep can now do:

rcu_read_lock();
task = find_task_by_pid(some_pid);
if (task) {
	get_task_struct(task);
}
rcu_read_unlock();

The effect on the rest of the kernel is that put_task_struct becomes cheaper
and immediate, and in the case where the task has been reaped it frees the
task immediate instead of unnecessarily waiting an until the rcu interval is
over.

Cleanup of task_struct does not happen when its reference count drops to
zero, instead cleanup happens when release_task is called.  Tasks can only
be looked up via rcu before release_task is called.  All rcu protected
members of task_struct are freed by release_task.

Therefore we can move call_rcu from put_task_struct into release_task.  And
we can modify release_task to not immediately release the reference count
but instead have it call put_task_struct from the function it gives to
call_rcu.

The end result:

- get_task_struct is safe in an rcu context where we have just looked
  up the task.

- put_task_struct() simplifies into its old pre rcu self.

This reorganization also makes put_task_struct uncallable from modules as
it is not exported but it does not appear to be called from any modules so
this should not be an issue, and is trivially fixed.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Andrew Morton
158d9ebd19 [PATCH] resurrect __put_task_struct
This just got nuked in mainline.  Bring it back because Eric's patches use it.

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Eric W. Biederman
390e2ff077 [PATCH] Make setsid() more robust
The core problem: setsid fails if it is called by init.  The effect in 2.6.16
and the earlier kernels that have this problem is that if you do a "ps -j 1 or
ps -ej 1" you will see that init and several of it's children have process
group and session == 0.  Instead of process group == session == 1.  Despite
init calling setsid.

The reason it fails is that daemonize calls set_special_pids(1,1) on kernel
threads that are launched before /sbin/init is called.

The only remaining effect in that current->signal->leader == 0 for init
instead of 1.  And the setsid call fails.  No one has noticed because
/sbin/init does not check the return value of setsid.

In 2.4 where we don't have the pidhash table, and daemonize doesn't exist
setsid actually works for init.

I care a lot about pid == 1 not being a special case that we leave broken,
because of the container/jail work that I am doing.

- Carefully allow init (pid == 1) to call setsid despite the kernel using
  its session.

- Use find_task_by_pid instead of find_pid because find_pid taking a
  pidtype is going away.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Thomas Gleixner
9741ef964d [PATCH] futex: check and validate timevals
The futex timeval is not checked for correctness.  The change does not
break existing applications as the timeval is supplied by glibc (and glibc
always passes a correct value), but the glibc-internal tests for this
functionality fail.

Signed-off-by: Thomas Gleixner <tglx@tglx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
d425b274ba [PATCH] sched: activate SCHED BATCH expired
To increase the strength of SCHED_BATCH as a scheduling hint we can
activate batch tasks on the expired array since by definition they are
latency insensitive tasks.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
7c4bb1f9b3 [PATCH] sched: remove on runqueue requeueing
On runqueue time is used to elevate priority in schedule().

In the code it currently requeues tasks even if their priority is not
elevated, which would end up placing them at the end of their runqueue
array effectively delaying them instead of improving their priority.

Bug spotted by Mike Galbraith <efault@gmx.de>

This patch removes this requeueing.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
5138930e6a [PATCH] sched: include noninteractive sleep in idle detect
Tasks waiting in SLEEP_NONINTERACTIVE state can now get to best priority so
they need to be included in the idle detection code.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:59 -08:00
Con Kolivas
e72ff0bb2c [PATCH] sched: dont decrease idle sleep avg
We watch for tasks that sleep extended periods and don't allow one single
prolonged sleep period from elevating priority to maximum bonus to prevent cpu
bound tasks from getting high priority with single long sleeps.  There is a
bug in the current code that also penalises tasks that already have high
priority.  Correct that bug.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Con Kolivas
e7c38cb49c [PATCH] sched: make task_noninteractive use sleep_type
Alterations to the pipe code in the kernel made it possible for relative
starvation to occur with tasks that slept waiting on a pipe getting unfair
priority bonuses even if they were otherwise fully cpu bound so the
TASK_NONINTERACTIVE flag was introduced which prevented any change to
sleep_avg while sleeping waiting on a pipe.  This change also leads to the
converse though, preventing any priority boost from occurring in truly
interactive tasks that wait on pipes.

Convert the TASK_NONINTERACTIVE flag to set sleep_type to SLEEP_NONINTERACTIVE
which will allow a linear bonus to priority based on sleep time thus allowing
interactive tasks to get high priority if they sleep enough.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Con Kolivas
3dee386e14 [PATCH] sched: cleanup task_activated()
The activated flag in task_struct is used to track different sleep types and
its usage is somewhat obfuscated.  Convert the variable to an enum with more
descriptive names without altering the function.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Jack Steiner
db1b1fefc2 [PATCH] sched: reduce overhead of calc_load
Currently, count_active_tasks() calls both nr_running() &
nr_interruptible().  Each of these functions does a "for_each_cpu" & reads
values from the runqueue of each cpu.  Although this is not a lot of
instructions, each runqueue may be located on different node.  Depending on
the architecture, a unique TLB entry may be required to access each
runqueue.

Since there may be more runqueues than cpu TLB entries, a scan of all
runqueues can trash the TLB.  Each memory reference incurs a TLB miss &
refill.

In addition, the runqueue cacheline that contains nr_running &
nr_uninterruptible may be evicted from the cache between the two passes.
This causes unnecessary cache misses.

Combining nr_running() & nr_interruptible() into a single function
substantially reduces the TLB & cache misses on large systems.  This should
have no measureable effect on smaller systems.

On a 128p IA64 system running a memory stress workload, the new function
reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.

Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Dimitri Sivanich
3055addadb [PATCH] hrtimer: call get_softirq_time() only when necessary in run_hrtimer_queue()
It seems that run_hrtimer_queue() is calling get_softirq_time() more
often than it needs to.

With this patch, it only calls get_softirq_time() if there's a
pending timer.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Thomas Gleixner
669d7868ae [PATCH] hrtimer: use generic sleeper for nanosleep
Replace the nanosleep private sleeper functionality by the generic hrtimer
sleeper.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Thomas Gleixner
00362e33f6 [PATCH] hrtimer: create generic sleeper
The removal of the data field in the hrtimer structure enforces the
embedding of the timer into another data structure.  nanosleep now uses a
private implementation of the most common used timer callback function
(simple task wakeup).

In order to avoid the reimplentation of such functionality all over the
place a generic hrtimer_sleeper functionality is created.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:58 -08:00
Andrew Morton
7529c30116 [PATCH] modules: permit Dual-MIT/GPL licenses
One of the LEDs driver files wants to use this.

Probably drivers/mtd/maps/ipaq-flash.c wants to convert as well - right now
it'll be tainting the kernel.

Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Bowler <jbowler@acm.org>
Cc: "'Richard Purdie'" <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:56 -08:00
Paul Jackson
e4e364e865 [PATCH] cpuset: memory migration interaction fix
Fix memory migration so that it works regardless of what cpuset the invoking
task is in.

If a task invoked a memory migration, by doing one of:

       1) writing a different nodemask to a cpuset 'mems' file, or

       2) writing a tasks pid to a different cpuset's 'tasks' file,
          where the cpuset had its 'memory_migrate' option turned on, then the
          allocation of the new pages for the migrated task(s) was constrained
          by the invoking tasks cpuset.

If this task wasn't in a cpuset that allowed the requested memory nodes, the
memory migration would happen to some other nodes that were in that invoking
tasks cpuset.  This was usually surprising and puzzling behaviour: Why didn't
the pages move?  Why did the pages move -there-?

To fix this, temporarilly change the invoking tasks 'mems_allowed' task_struct
field to the nodes the migrating tasks is moving to, so that new pages can be
allocated there.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:55 -08:00
Paul Jackson
2741a559a0 [PATCH] cpuset: unsafe mm reference fix
Fix unsafe reference to a tasks mm struct, by moving the reference inside of a
convenient nearby properly guarded code block.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:55 -08:00
Paul Jackson
4a01c8d5be [PATCH] cpuset: task_lock comment fix
Fix cpuset comment involving case of a tasks cpuset pointer being NULL.
Thanks to "the_top_cpuset_hack", this code no longer sees NULL task->cpuset
pointers.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:55 -08:00
KaiGai Kohei
bb231fe3a5 [PATCH] Fix pacct bug in multithreading case.
I noticed a bug on the process accounting facility.  In multi-threading
process, some data would be recorded incorrectly when the group_leader dies
earlier than one or more threads.  The attached patch fixes this problem.

See below.  'bugacct' is a test program that create a worker thread after 4
seconds sleeping, then the group_leader dies soon.  The worker thread
consume CPU/Memory for 6 seconds, then exit.  We can estimate 10 seconds as
etime and 6 seconds as stime + utime.  This is a sample program which the
group_leader dies earlier than other threads.

The results of same binary execution on different kernel are below.
-- accounted records --------------------
         |   btime  | utime | stime | etime | minflt | majflt |   comm  |
original | 13:16:40 |  0.00 |  0.00 |  6.10 |    171 |      0 | bugacct |
 patched | 13:20:21 |  5.83 |  0.18 | 10.03 |  32776 |      0 | bugacct |
(*) bugacct allocates 128MB memory, thus 128MB / 4KB = 32768 of minflt is
    appropriate.

-- Test results in original kernel ------
$ date; time -p ./bugacct
Tue Mar 28 13:16:36 JST 2006  <- But pacct said btime is 13:16:40
real 10.11                    <- But pacct said etime is 6.10
user 5.96                     <- But pacct said utime is 0.00
sys 0.14                      <- But pacct said stime is 0.00
$
-- Test results in patched kernel -------
$ date; time -p ./bugacct
Tue Mar 28 13:20:21 JST 2006
real 10.04
user 5.83
sys 0.19
$

In the original 2.6.16 kernel, pacct records btime, utime, stime, etime and
minflt incorrectly.  In my opinion, this problem is caused by an assumption
that group_leader dies last.

The following section calculates process running time for etime and btime.
But it means running time of the thread that dies last, not process.  The
start_time of the first thread in the process (group_leader) should be
reduced from uptime to calculate etime and btime correctly.

   ---- do_acct_process() in kernel/acct.c:
   /* calculate run_time in nsec*/
   do_posix_clock_monotonic_gettime(&uptime);
   run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec;
   run_time -= (u64)current->start_time.tv_sec*NSEC_PER_SEC
                                   + current->start_time.tv_nsec;
   ----

The following section calculates stime and utime of the process.
But it might count the utime and stime of the group_leader duplicatly
and ignore the utime and stime of the thread dies last, when one or
more threads remain after group_leader dead.
The ac_utime should be calculated as the sum of the signal->utime
and utime of the thread dies last. The ac_stime should be done also.

   ---- do_acct_process() in kernel/acct.c:
   jiffies = cputime_to_jiffies(cputime_add(current->group_leader->utime,
                                            current->signal->utime));
   ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies));
   jiffies = cputime_to_jiffies(cputime_add(current->group_leader->stime,
                                            current->signal->stime));
   ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies));
   ----

The part of the minflt/majflt calculation has same problem.
This patch solves those problems, I think.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:54 -08:00
OGAWA Hirofumi
9b41046cd0 [PATCH] Don't pass boot parameters to argv_init[]
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).

And __setup() is used in obsolete_checksetup().

	start_kernel()
		-> parse_args()
			-> unknown_bootoption()
				-> obsolete_checksetup()

If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.

If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func().  If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].

Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.

This patch fixes a wrong usage of it, however fixes obvious one only.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:53 -08:00
Oleg Nesterov
a2c348fe01 [PATCH] __mod_timer: simplify ->base changing
Since base and new_base are of the same type now, we can save one 'if'
branch and simplify the code a bit.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:53 -08:00
Oleg Nesterov
3691c5199e [PATCH] kill __init_timer_base in favor of boot_tvec_bases
Commit a4a6198b80:
	[PATCH] tvec_bases too large for per-cpu data

introduced "struct tvec_t_base_s boot_tvec_bases" which is visible at
compile time.  This means we can kill __init_timer_base and move
timer_base_s's content into tvec_t_base_s.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:52 -08:00
Pavel Machek
85b6bce365 [PATCH] Fix suspend with traced tasks
strace /bin/bash misbehaves after resume; this fixes it.

(akpm: it's scary calling refrigerator() in state TASK_TRACED, but it seems to
do the right thing).

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31 12:18:50 -08:00
Oleg Nesterov
547679087b [PATCH] send_sigqueue: simplify and fix the race
send_sigqueue() checks PF_EXITING, then locks p->sighand->siglock.  This is
unsafe: 'p' can exit in between and set ->sighand = NULL.  The race is
theoretical, the window is tiny and irqs are disabled by the caller, so I
don't think we need the fix for -stable tree.

Convert send_sigqueue() to use lock_task_sighand() helper.

Also, delete 'p->flags & PF_EXITING' re-check, it is unneeded and the
comment is wrong.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
a1d5e21e3e [PATCH] do_notify_parent_cldstop: remove 'to_self' param
The previous patch has changed callsites of do_notify_parent_cldstop() so that
to_self == (->ptrace & PT_PTRACED) always (as it should be).  We can remove
this parameter now.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
883606a7c9 [PATCH] finish_stop: don't check stop_count < 0
Remove an obscure 'stop_count < 0' check in finish_stop().  The previous patch
made this case impossible.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
dac27f4a09 [PATCH] simplify do_signal_stop()
do_signal_stop() considers 'thread_group_empty()' as a special case.
This was needed to avoid taking tasklist_lock. Since this lock is
unneeded any longer, we can remove this special case and simplify
the code even more.

Also, before this patch, finish_stop() was called with stop_count == -1
for 'thread_group_empty()' case. This is not strictly wrong, but confusing
and unneeded.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
a7e5328a06 [PATCH] cleanup __exit_signal->cleanup_sighand path
Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal().  This
makes the exit path more understandable and allows us to do
cleanup_sighand() outside of ->siglock protected section.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
4a2c7a7837 [PATCH] make fork() atomic wrt pgrp/session signals
Eric W. Biederman wrote:
>
> Ok. SUSV3/Posix is clear, fork is atomic with respect
> to signals.  Either a signal comes before or after a
> fork but not during. (See the rationale section).
> http://www.opengroup.org/onlinepubs/000095399/functions/fork.html
>
> The tasklist_lock does not stop forks from adding to a process
> group. The forks stall while the tasklist_lock is held, but a fork
> that began before we grabbed the tasklist_lock simply completes
> afterwards, and the child does not receive the signal.

This also means that SIGSTOP or sig_kernel_coredump() signal can't
be delivered to pgrp/session reliably.

With this patch copy_process() returns -ERESTARTNOINTR when it
detects a pending signal, fork() will be restarted transparently
after handling the signals.

This patch also deletes now unneeded "group_stop_count > 0" check,
copy_process() can no longer succeed while group stop in progress.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
47e65328a7 [PATCH] pids: kill PIDTYPE_TGID
This patch kills PIDTYPE_TGID pid_type thus saving one hash table in
kernel/pid.c and speeding up subthreads create/destroy a bit.  It is also a
preparation for the further tref/pids rework.

This patch adds 'struct list_head thread_group' to 'struct task_struct'
instead.

We don't detach group leader from PIDTYPE_PID namespace until another
thread inherits it's ->pid == ->tgid, so we are safe wrt premature
free_pidmap(->tgid) call.

Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID).
Should the need arise, we can use find_task_by_pid()->group_leader.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-By: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
88531f725b [PATCH] do_sigaction: don't take tasklist_lock
do_sigaction() does not need tasklist_lock anymore, we can simplify the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:44 -08:00
Oleg Nesterov
aacc90944d [PATCH] do_group_exit: don't take tasklist_lock
do_group_exit() takes tasklist_lock for zap_other_threads(), this is unneeded
now.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
a122b341b7 [PATCH] do_signal_stop: don't take tasklist_lock
do_signal_stop() does not need tasklist_lock anymore.  So it does not need to
do misc re-checks, and we can simplify the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
6108ccd3e2 [PATCH] relax sig_needs_tasklist()
handle_stop_signal() does not need tasklist_lock for SIG_KERNEL_STOP_MASK
signals anymore.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
7d7185c818 [PATCH] sys_times: don't take tasklist_lock
sys_times: don't take tasklist_lock

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
5876700cd3 [PATCH] do __unhash_process() under ->siglock
This patch moves __unhash_process() call from realease_task() to
__exit_signal(), so __detach_pid() is called with ->siglock held.

This means we don't need tasklist_lock to iterate over thread group anymore:

	copy_process() was already changed to do attach_pid()
	under ->siglock.

	Eric's "pidhash-kill-switch_exec_pids.patch" from -mm
	changed de_thread() so it doesn't touch PIDTYPE_TGID.

NOTE: de_thread() still needs some attention.  It still changes task->pid
lockless.  Taking ->sighand.siglock here allows to do more tasklist_lock
removals.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
35f5cad8c4 [PATCH] revert "Optimize sys_times for a single thread process"
This patch reverts 'CONFIG_SMP && thread_group_empty()' optimization in
sys_times().  The reason is that the next patch breaks memory ordering which
is needed for that optimization.

tasklist_lock in sys_times() will be eliminated completely by further patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
6a14c5c9da [PATCH] move __exit_signal() to kernel/exit.c
__exit_signal() is private to release_task() now.  I think it is better to
make it static in kernel/exit.c and export flush_sigqueue() instead - this
function is much more simple and straightforward.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
c81addc9d3 [PATCH] rename __exit_sighand to cleanup_sighand
Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to
copy_sighand().

This matches copy_signal/cleanup_signal naming, and I think it is easier to
follow.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
29ff471234 [PATCH] cleanup __exit_signal()
This patch factors out duplicated code under 'if' branches.  Also, BUG_ON()
conversions and whitespace cleanups.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:43 -08:00
Oleg Nesterov
6b3934ef52 [PATCH] copy_process: cleanup bad_fork_cleanup_signal
__exit_signal() does important cleanups atomically under ->siglock.  It is
also called from copy_process's error path.  This is not good, for example we
can't move __unhash_process() under ->siglock for that reason.

We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under
'bad_fork_cleanup_sighand:' label.  For copy_process() case it is sufficient
to just backout copy_signal(), nothing more.

Again, nobody can see this task yet.  For CLONE_THREAD case we just decrement
signal->count, otherwise nobody can see this ->signal and we can free it
lockless.

This patch assumes it is safe to do exit_thread_group_keys() without
tasklist_lock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
7001510d0c [PATCH] copy_process: cleanup bad_fork_cleanup_sighand
The only caller of exit_sighand(tsk) is copy_process's error path.  We can
call __exit_sighand() directly and kill exit_sighand().

This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no
external references, nobody can see it, and

	IF (clone_flags & CLONE_SIGHAND)
		At least 'current' has a reference to ->sighand, this
		means atomic_dec_and_test(sighand->count) can't be true.

	ELSE
		Nobody can see this ->sighand, this means we can free it
		without any locking.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
a9e88e84b5 [PATCH] introduce sig_needs_tasklist() helper
In my opinion this patch cleans up the code.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
f63ee72e0f [PATCH] introduce lock_task_sighand() helper
Add lock_task_sighand() helper and converts group_send_sig_info() to use
it.  Hopefully we will have more users soon.

This patch also removes '!sighand->count' and '!p->usage' checks, I think
they both are bogus, racy and unneeded (but probably it makes sense to
restore them as BUG_ON()s).

->sighand is cleared and it's ->count is decremented in release_task() with
sighand->siglock held, so it is a bug to have '!p->usage || !->count' after
we already locked and verified it is the same.  On the other hand, an
already dead task without ->sighand can have a non-zero ->usage due to
ptrace, for example.

If we read the stale value of ->sighand we must see the change after
spin_lock(), because that change was done while holding that same old
->sighand.siglock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
aa1757f90b [PATCH] convert sighand_cache to use SLAB_DESTROY_BY_RCU
This patch borrows a clever Hugh's 'struct anon_vma' trick.

Without tasklist_lock held we can't trust task->sighand until we locked it
and re-checked that it is still the same.

But this means we don't need to defer 'kmem_cache_free(sighand)'.  We can
return the memory to slab immediately, all we need is to be sure that
sighand->siglock can't dissapear inside rcu protected section.

To do so we need to initialize ->siglock inside ctor function,
SLAB_DESTROY_BY_RCU does the rest.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:42 -08:00
Oleg Nesterov
1f09f9749c [PATCH] release_task: replace open-coded ptrace_unlink()
Use ptrace_unlink() instead of open-coding.  No changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
8292d633ad [PATCH] wait_for_helper: trivial style cleanup
Use NULL instead of (... *)0

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
6ac781b11a [PATCH] reparent_thread: use remove_parent/add_parent
Use remove_parent/add_parent instead of open coding.

No changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
73b9ebfe12 [PATCH] pidhash: don't count idle threads
fork_idle() does unhash_process() just after copy_process().  Contrary,
boot_cpu's idle thread explicitely registers itself for each pid_type with nr
= 0.

copy_process() already checks p->pid != 0 before process_counts++, I think we
can just skip attach_pid() calls and job control inits for idle threads and
kill unhash_process().  We don't need to cleanup ->proc_dentry in fork_idle()
because with this patch idle threads are never hashed in
kernel/pid.c:pid_hash[].

We don't need to hash pid == 0 in pidmap_init().  free_pidmap() is never
called with pid == 0 arg, so it will never be reused.  So it is still possible
to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV.

However with this patch we don't hash pid == 0 for PIDTYPE_PID case.  We still
have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and
kernel threads which don't call daemonize().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
c97d98931a [PATCH] kill SET_LINKS/REMOVE_LINKS
Both SET_LINKS() and SET_LINKS/REMOVE_LINKS() have exactly one caller, and
these callers already check thread_group_leader().

This patch kills theese macros, they mix two different things: setting
process's parent and registering it in init_task.tasks list.  Callers are
updated to do these actions by hand.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
9b678ece42 [PATCH] don't use REMOVE_LINKS/SET_LINKS for reparenting
There are places where kernel uses REMOVE_LINKS/SET_LINKS while changing
process's ->parent.  Use add_parent/remove_parent instead, they don't abuse
of global process list.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
8fafabd86f [PATCH] remove add_parent()'s parent argument
add_parent(p, parent) is always called with parent == p->parent, and it makes
no sense to do it differently.  This patch removes this argument.

No changes in affected .o files.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:41 -08:00
Oleg Nesterov
d799f03597 [PATCH] choose_new_parent: remove unused arg, sanitize exit_state check
'child_reaper' arg is not used in choose_new_parent().

"->exit_state >= EXIT_ZOMBIE" check is a leftover, was
valid when EXIT_ZOMBIE lived in ->state var.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:40 -08:00
Eric W. Biederman
d73d65293e [PATCH] pidhash: kill switch_exec_pids
switch_exec_pids is only called from de_thread by way of exec, and it is
only called when we are exec'ing from a non thread group leader.

Currently switch_exec_pids gives the leader the pid of the thread and
unhashes and rehashes all of the process groups.  The leader is already in
the EXIT_DEAD state so no one cares about it's pids.  The only concern for
the leader is that __unhash_process called from release_task will function
correctly.  If we don't touch the leader at all we know that
__unhash_process will work fine so there is no need to touch the leader.

For the task becomming the thread group leader, we just need to give it the
pid of the old thread group leader, add it to the task list, and attach it
to the session and the process group of the thread group.

Currently de_thread is also adding the task to the task list which is just
silly.

Currently the only leader of __detach_pid besides detach_pid is
switch_exec_pids because of the ugly extra work that was being
performed.

So this patch removes switch_exec_pids because it is doing too much, it is
creating an unnecessary special case in pid.c, duing work duplicated in
de_thread, and generally obscuring what it is going on.

The necessary work is added to de_thread, and it seems to be a little
clearer there what is going on.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:40 -08:00
Eric W. Biederman
fef23e7fbb [PATCH] exec: allow init to exec from any thread.
After looking at the problem of init calling exec some more I figured out
an easy way to make the code work.

The actual symptom without out this patch is that all threads will die
except pid == 1, and the thread calling exec.  The thread calling exec will
wait forever for pid == 1 to die.

Since pid == 1 does not install a handler for SIGKILL it will never die.

This modifies the tests for init from current->pid == 1 to the equivalent
current == child_reaper.  And then it causes exec in the ugly case to
modify child_reaper.

The only weird symptom is that you wind up with an init process that
doesn't have the oldest start time on the box.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 18:36:40 -08:00
Andrew Morton
ec7e15d648 [PATCH] compat_sys_futex() warning fix
kernel/futex_compat.c: In function `compat_sys_futex':
kernel/futex_compat.c:140: warning: passing arg 1 of `do_futex' makes integer from pointer without a cast
kernel/futex_compat.c:140: warning: passing arg 5 of `do_futex' makes integer from pointer without a cast

Not sure what Ingo was thinking of here.  Put the casts back in.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:09 -08:00
KAMEZAWA Hiroyuki
0a94502277 [PATCH] for_each_possible_cpu: fixes for generic part
replaces for_each_cpu with for_each_possible_cpu().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:05 -08:00
Eric Sesterhenn
b9e20a9200 [PATCH] Change dash2underscore() return value to char
Since dash2underscore() just operates and returns chars, I guess its safe
to change the return value to a char.  With my .config, this reduces its
size by 5 bytes.

   text    data     bss     dec     hex filename
   4155     152       0    4307    10d3 params.o.orig
   4150     152       0    4302    10ce params.o

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:03 -08:00
Andrew Morton
f83ca9fe3e [PATCH] symversion warning fix
gcc-4.2:

kernel/module.c: In function '__find_symbol':
kernel/module.c:158: warning: the address of '__start___kcrctab', will always evaluate as 'true'
kernel/module.c:165: warning: the address of '__start___kcrctab_gpl', will always evaluate as 'true'
kernel/module.c:182: warning: the address of '__start___kcrctab_gpl_future', will always evaluate as 'true'

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-28 09:16:02 -08:00
Alan Stern
e041c68341 [PATCH] Notifier chain update: API changes
The kernel's implementation of notifier chains is unsafe.  There is no
protection against entries being added to or removed from a chain while the
chain is in use.  The issues were discussed in this thread:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

	"Blocking" chains are always called from a process context
	and the callout routines are allowed to sleep;

	"Atomic" chains can be called from an atomic context and
	the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API.  Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name).  New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain.  The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed.  For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections.  (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with.  For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem.  Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain.  (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization.  Instead we use RCU.  The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications.  None
of them use the raw API, so for the moment it is only a placeholder.

  ATOMIC CHAINS
  -------------
arch/i386/kernel/traps.c:		i386die_chain
arch/ia64/kernel/traps.c:		ia64die_chain
arch/powerpc/kernel/traps.c:		powerpc_die_chain
arch/sparc64/kernel/traps.c:		sparc64die_chain
arch/x86_64/kernel/traps.c:		die_chain
drivers/char/ipmi/ipmi_si_intf.c:	xaction_notifier_list
kernel/panic.c:				panic_notifier_list
kernel/profile.c:			task_free_notifier
net/bluetooth/hci_core.c:		hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c:	ip_conntrack_expect_chain
net/ipv6/addrconf.c:			inet6addr_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_chain
net/netfilter/nf_conntrack_core.c:	nf_conntrack_expect_chain
net/netlink/af_netlink.c:		netlink_chain

  BLOCKING CHAINS
  ---------------
arch/powerpc/platforms/pseries/reconfig.c:	pSeries_reconfig_chain
arch/s390/kernel/process.c:		idle_chain
arch/x86_64/kernel/process.c		idle_notifier
drivers/base/memory.c:			memory_chain
drivers/cpufreq/cpufreq.c		cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c		cpufreq_transition_notifier_list
drivers/macintosh/adb.c:		adb_client_list
drivers/macintosh/via-pmu.c		sleep_notifier_list
drivers/macintosh/via-pmu68k.c		sleep_notifier_list
drivers/macintosh/windfarm_core.c	wf_client_list
drivers/usb/core/notify.c		usb_notifier_list
drivers/video/fbmem.c			fb_notifier_list
kernel/cpu.c				cpu_chain
kernel/module.c				module_notify_list
kernel/profile.c			munmap_notifier
kernel/profile.c			task_exit_notifier
kernel/sys.c				reboot_notifier_list
net/core/dev.c				netdev_chain
net/decnet/dn_dev.c:			dnaddr_chain
net/ipv4/devinet.c:			inetaddr_chain

It's possible that some of these classifications are wrong.  If they are,
please let us know or submit a patch to fix them.  Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:50 -08:00
Ingo Molnar
8f17d3a504 [PATCH] lightweight robust futexes updates
- fix: initialize the robust list(s) to NULL in copy_process.

- doc update

- cleanup: rename _inuser to _inatomic

- __user cleanups and other small cleanups

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Ingo Molnar
34f192c652 [PATCH] lightweight robust futexes: compat
32-bit syscall compatibility support.  (This patch also moves all futex
related compat functionality into kernel/futex_compat.c.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Ingo Molnar
0771dfefc9 [PATCH] lightweight robust futexes: core
Add the core infrastructure for robust futexes: structure definitions, the new
syscalls and the do_exit() based cleanup mechanism.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:49 -08:00
Siddha, Suresh B
0806903316 [PATCH] sched: fix group power for allnodes_domains
Current sched groups power calculation for allnodes_domains is wrong.  We
should really be using cumulative power of the physical packages in that
group (similar to the calculation in node_domains)

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:44 -08:00
Siddha, Suresh B
1e9f28fa1e [PATCH] sched: new sched domain for representing multi-core
Add a new sched domain for representing multi-core with shared caches
between cores.  Consider a dual package system, each package containing two
cores and with last level cache shared between cores with in a package.  If
there are two runnable processes, with this appended patch those two
processes will be scheduled on different packages.

On such systems, with this patch we have observed 8% perf improvement with
specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2
users).

This new domain will come into play only on multi-core systems with shared
caches.  On other systems, this sched domain will be removed by domain
degeneration code.  This new domain can be also used for implementing power
savings policy (see OLS 2005 CMP kernel scheduler paper for more details..
I will post another patch for power savings policy soon)

Most of the arch/* file changes are for cpu_coregroup_map() implementation.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:43 -08:00
Andreas Mohr
77e4bfbcf0 [PATCH] Small schedule() optimization
small schedule() microoptimization.

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:43 -08:00
Martin Andersson
013d386814 [PATCH] sched: fix task interactivity calculation
Is a truncation error in kernel/sched.c triggered when the nice value is
negative.  The affected code is used in the TASK_INTERACTIVE macro.

The code is:
#define SCALE(v1,v1_max,v2_max) \
	(v1) * (v2_max) / (v1_max)

which is used in this way:
SCALE(TASK_NICE(p), 40, MAX_BONUS)

Comments in the code says:
  * This part scales the interactivity limit depending on niceness.
  *
  * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
  * Here are a few examples of different nice levels:
  *
  *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
  *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
  *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
  *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
  *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
  *
  * (the X axis represents the possible -5 ... 0 ... +5 dynamic
  *  priority range a task can explore, a value of '1' means the
  *  task is rated interactive.)

However, the current code does not scale it linearly and the result differs
from the given examples.  If the mathematical function "floor" is used when
the nice value is negative instead of the truncation one gets when using
integer division, the result conforms to the documentation.

Output of TASK_INTERACTIVE when using the kernel code:
nice    dynamic priorities
-20     1     1     1     1     1     1     1     1     1     0     0
-19     1     1     1     1     1     1     1     1     0     0     0
-18     1     1     1     1     1     1     1     1     0     0     0
-17     1     1     1     1     1     1     1     1     0     0     0
-16     1     1     1     1     1     1     1     1     0     0     0
-15     1     1     1     1     1     1     1     0     0     0     0
-14     1     1     1     1     1     1     1     0     0     0     0
-13     1     1     1     1     1     1     1     0     0     0     0
-12     1     1     1     1     1     1     1     0     0     0     0
-11     1     1     1     1     1     1     0     0     0     0     0
-10     1     1     1     1     1     1     0     0     0     0     0
  -9     1     1     1     1     1     1     0     0     0     0     0
  -8     1     1     1     1     1     1     0     0     0     0     0
  -7     1     1     1     1     1     0     0     0     0     0     0
  -6     1     1     1     1     1     0     0     0     0     0     0
  -5     1     1     1     1     1     0     0     0     0     0     0
  -4     1     1     1     1     1     0     0     0     0     0     0
  -3     1     1     1     1     0     0     0     0     0     0     0
  -2     1     1     1     1     0     0     0     0     0     0     0
  -1     1     1     1     1     0     0     0     0     0     0     0
  0      1     1     1     1     0     0     0     0     0     0     0
  1      1     1     1     1     0     0     0     0     0     0     0
  2      1     1     1     1     0     0     0     0     0     0     0
  3      1     1     1     1     0     0     0     0     0     0     0
  4      1     1     1     0     0     0     0     0     0     0     0
  5      1     1     1     0     0     0     0     0     0     0     0
  6      1     1     1     0     0     0     0     0     0     0     0
  7      1     1     1     0     0     0     0     0     0     0     0
  8      1     1     0     0     0     0     0     0     0     0     0
  9      1     1     0     0     0     0     0     0     0     0     0
10      1     1     0     0     0     0     0     0     0     0     0
11      1     1     0     0     0     0     0     0     0     0     0
12      1     0     0     0     0     0     0     0     0     0     0
13      1     0     0     0     0     0     0     0     0     0     0
14      1     0     0     0     0     0     0     0     0     0     0
15      1     0     0     0     0     0     0     0     0     0     0
16      0     0     0     0     0     0     0     0     0     0     0
17      0     0     0     0     0     0     0     0     0     0     0
18      0     0     0     0     0     0     0     0     0     0     0
19      0     0     0     0     0     0     0     0     0     0     0

Output of TASK_INTERACTIVE when using "floor"
nice    dynamic priorities
-20     1     1     1     1     1     1     1     1     1     0     0
-19     1     1     1     1     1     1     1     1     1     0     0
-18     1     1     1     1     1     1     1     1     1     0     0
-17     1     1     1     1     1     1     1     1     1     0     0
-16     1     1     1     1     1     1     1     1     0     0     0
-15     1     1     1     1     1     1     1     1     0     0     0
-14     1     1     1     1     1     1     1     1     0     0     0
-13     1     1     1     1     1     1     1     1     0     0     0
-12     1     1     1     1     1     1     1     0     0     0     0
-11     1     1     1     1     1     1     1     0     0     0     0
-10     1     1     1     1     1     1     1     0     0     0     0
  -9     1     1     1     1     1     1     1     0     0     0     0
  -8     1     1     1     1     1     1     0     0     0     0     0
  -7     1     1     1     1     1     1     0     0     0     0     0
  -6     1     1     1     1     1     1     0     0     0     0     0
  -5     1     1     1     1     1     1     0     0     0     0     0
  -4     1     1     1     1     1     0     0     0     0     0     0
  -3     1     1     1     1     1     0     0     0     0     0     0
  -2     1     1     1     1     1     0     0     0     0     0     0
  -1     1     1     1     1     1     0     0     0     0     0     0
   0     1     1     1     1     0     0     0     0     0     0     0
   1     1     1     1     1     0     0     0     0     0     0     0
   2     1     1     1     1     0     0     0     0     0     0     0
   3     1     1     1     1     0     0     0     0     0     0     0
   4     1     1     1     0     0     0     0     0     0     0     0
   5     1     1     1     0     0     0     0     0     0     0     0
   6     1     1     1     0     0     0     0     0     0     0     0
   7     1     1     1     0     0     0     0     0     0     0     0
   8     1     1     0     0     0     0     0     0     0     0     0
   9     1     1     0     0     0     0     0     0     0     0     0
  10     1     1     0     0     0     0     0     0     0     0     0
  11     1     1     0     0     0     0     0     0     0     0     0
  12     1     0     0     0     0     0     0     0     0     0     0
  13     1     0     0     0     0     0     0     0     0     0     0
  14     1     0     0     0     0     0     0     0     0     0     0
  15     1     0     0     0     0     0     0     0     0     0     0
  16     0     0     0     0     0     0     0     0     0     0     0
  17     0     0     0     0     0     0     0     0     0     0     0
  18     0     0     0     0     0     0     0     0     0     0     0
  19     0     0     0     0     0     0     0     0     0     0     0

Signed-off-by: Martin Andersson <martin.andersson@control.lth.se>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Con Kolivas <kernel@kolivas.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:44:43 -08:00
Linus Torvalds
9ae21d1bb3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
  drivers/char/ftape/lowlevel/fdc-io.c: Correct a comment
  Kconfig help: MTD_JEDECPROBE already supports Intel
  Remove ugly debugging stuff
  do_mounts.c: Minor ROOT_DEV comment cleanup
  BUG_ON() Conversion in drivers/s390/block/dasd_devmap.c
  BUG_ON() Conversion in mm/mempool.c
  BUG_ON() Conversion in mm/memory.c
  BUG_ON() Conversion in kernel/fork.c
  BUG_ON() Conversion in ipc/sem.c
  BUG_ON() Conversion in fs/ext2/
  BUG_ON() Conversion in fs/hfs/
  BUG_ON() Conversion in fs/dcache.c
  BUG_ON() Conversion in fs/buffer.c
  BUG_ON() Conversion in input/serio/hp_sdc_mlc.c
  BUG_ON() Conversion in md/dm-table.c
  BUG_ON() Conversion in md/dm-path-selector.c
  BUG_ON() Conversion in drivers/isdn
  BUG_ON() Conversion in drivers/char
  BUG_ON() Conversion in drivers/mtd/
2006-03-26 09:41:18 -08:00
bibo mao
c6fd91f0bd [PATCH] kretprobe instance recycled by parent process
When kretprobe probes the schedule() function, if the probed process exits
then schedule() will never return, so some kretprobe instances will never
be recycled.

In this patch the parent process will recycle retprobe instances of the
probed function and there will be no memory leak of kretprobe instances.

Signed-off-by: bibo mao <bibo.mao@intel.com>
Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:04 -08:00
Roman Zippel
05cfb614dd [PATCH] hrtimers: remove data field
The nanosleep cleanup allows to remove the data field of hrtimer.  The
callback function can use container_of() to get it's own data.  Since the
hrtimer structure is anyway embedded in other structures, this adds no
overhead.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:03 -08:00
Roman Zippel
df869b630d [PATCH] hrtimers: remove nsec_t typedef
nsec_t predates ktime_t and has mostly been superseded by it.  In the few
places that are left it's better to make it explicit that we're dealing with
64 bit values here.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:03 -08:00
Roman Zippel
b75f7a51ca [PATCH] hrtimers: remove state field
Remove the state field and encode this information in the rb_node similiar to
normal timer.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Roman Zippel
432569bb9d [PATCH] hrtimers: simplify nanosleep
nanosleep is the only user of the expired state, so let it manage this itself,
which makes the hrtimer code a bit simpler.  The remaining time is also only
calculated if requested.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Roman Zippel
3b98a53281 [PATCH] hrtimers: posix-timer: cleanup common_timer_get()
Cleanup common_timer_get() a little.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Roman Zippel
44f2147551 [PATCH] hrtimers: pass current time to hrtimer_forward()
Pass current time to hrtimer_forward().  This allows to use the softirq time
in the timer base when the forward function is called from the timer callback.
 Other places pass current time with a call to timer->base->get_time().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Thomas Gleixner
92127c7a45 [PATCH] hrtimers: optimize softirq runqueues
The hrtimer softirq is called from the timer softirq every tick.  Retrieve the
current time from xtime and wall_to_monotonic instead of calling
base->get_time() for each timer base.  Store the time in the base structure
and provide a hook once clock source abstractions are in place and to keep the
code open for new base clocks.

Based on a patch from: Roman Zippel <zippel@linux-m68k.org>

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:57:02 -08:00
Stephen Rothwell
3158e9411a [PATCH] consolidate sys32/compat_adjtimex
Create compat_sys_adjtimex and use it an all appropriate places.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:57 -08:00
Con Kolivas
e655a250d5 [PATCH] swswsup: return correct load_image error
If there's an error in load_image() we should return that without checking
snapshot_image_loaded.

Signed-off-by: Con Kolivas <kernel@kolivas.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:55 -08:00
Ingo Molnar
cd7b24bb18 [PATCH] warn if free_irq() is called from IRQ context
Warn if free_irq() is called in IRQ context - free_irq() can execute /proc
VFS work, which must not be done in IRQ context.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26 08:56:53 -08:00
Eric Sesterhenn
910dea7fdd BUG_ON() Conversion in kernel/fork.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-26 18:29:26 +02:00
Linus Torvalds
1b9a391736 Merge branch 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (22 commits)
  [PATCH] fix audit_init failure path
  [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
  [PATCH] sem2mutex: audit_netlink_sem
  [PATCH] simplify audit_free() locking
  [PATCH] Fix audit operators
  [PATCH] promiscuous mode
  [PATCH] Add tty to syscall audit records
  [PATCH] add/remove rule update
  [PATCH] audit string fields interface + consumer
  [PATCH] SE Linux audit events
  [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
  [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
  [PATCH] Fix IA64 success/failure indication in syscall auditing.
  [PATCH] Miscellaneous bug and warning fixes
  [PATCH] Capture selinux subject/object context information.
  [PATCH] Exclude messages by message type
  [PATCH] Collect more inode information during syscall processing.
  [PATCH] Pass dentry, not just name, in fsnotify creation hooks.
  [PATCH] Define new range of userspace messages.
  [PATCH] Filter rule comparators
  ...

Fixed trivial conflict in security/selinux/hooks.c
2006-03-25 09:24:53 -08:00
Linus Torvalds
1e8c573933 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (21 commits)
  BUG_ON() Conversion in drivers/video/
  BUG_ON() Conversion in drivers/parisc/
  BUG_ON() Conversion in drivers/block/
  BUG_ON() Conversion in sound/sparc/cs4231.c
  BUG_ON() Conversion in drivers/s390/block/dasd.c
  BUG_ON() Conversion in lib/swiotlb.c
  BUG_ON() Conversion in kernel/cpu.c
  BUG_ON() Conversion in ipc/msg.c
  BUG_ON() Conversion in block/elevator.c
  BUG_ON() Conversion in fs/coda/
  BUG_ON() Conversion in fs/binfmt_elf_fdpic.c
  BUG_ON() Conversion in input/serio/hil_mlc.c
  BUG_ON() Conversion in md/dm-hw-handler.c
  BUG_ON() Conversion in md/bitmap.c
  The comment describing how MS_ASYNC works in msync.c is confusing
  rcu: undeclared variable used in documentation
  fix typos "wich" -> "which"
  typo patch for fs/ufs/super.c
  Fix simple typos
  tabify drivers/char/Makefile
  ...
2006-03-25 08:41:09 -08:00
Roman Zippel
5ddcfa878d [PATCH] remove pps support
This removes the support for pps.  It's completely unused within the kernel
and is basically in the way for further cleanups.  It should be easier to
readd proper support for it after the rest has been converted to NTP4
(where the pps mechanisms are quite different from NTP3 anyway).

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:02 -08:00
Dimitri Sivanich
f516342745 [PATCH] Add SA_PERCPU_IRQ flag support
Add support for SA_PERCPU_IRQ (only mmtimer.c uses this at this stage).

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:01 -08:00
Eric Dumazet
d74beb9f33 [PATCH] Use unsigned int types for a faster bsearch
This patch avoids arithmetic on 'signed' types that are slower than
'unsigned'.  This saves space and cpu cycles.

size of kernel/sys.o before the patch (gcc-3.4.5)

    text    data     bss     dec     hex filename
   10924     252       4   11180    2bac kernel/sys.o

size of kernel/sys.o after the patch
    text    data     bss     dec     hex filename
   10903     252       4   11159    2b97 kernel/sys.o

I noticed that gcc-4.1.0 (from Fedora Core 5) even uses idiv instruction for
(a+b)/2 if a and b are signed.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:01 -08:00
Ashok Raj
34f361ade2 [PATCH] Check if cpu can be onlined before calling smp_prepare_cpu()
- Moved check for online cpu out of smp_prepare_cpu()

- Moved default declaration of smp_prepare_cpu() to kernel/cpu.c

- Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since
  its called from cpu_up() as well now.

- Removed clearing from cpu_present_map during cpu_offline as it breaks
  using cpu_up() directly during a subsequent online operation.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:23:01 -08:00
Eric Dumazet
231bed2058 [PATCH] No need to protect current->group_info in sys_getgroups(), in_group_p() and in_egroup_p()
While doing some benchmarks of an Apache/PHP SMP server, I noticed high
oprofile numbers in in_group_p() and _atomic_dec_and_lock().

rank  percent
  1     4.8911 % __link_path_walk
  2     4.8503 % __d_lookup
*3     4.2911 % _atomic_dec_and_lock
  4     3.9307 % __copy_to_user_ll
  5     4.9004 % sysenter_past_esp
*6     3.3248 % in_group_p

It appears that in_group_p() does an uncessary

get_group_info(current->group_info); /* atomic_inc() */
  ... /* access current->group_info */
put_group_info(current->group_info); /* _atomic_dec_and_lock */

It is not necessary to do this, because the current task holds a reference
on its own group_info, and this reference cannot change during the lookup.

This patch deletes the get_group_info()/put_group_info() pair from
sys_getgroups(), in_group_p() and in_egroup_p() functions.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Tim Hockin <thockin@hockin.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:58 -08:00
Andrew Morton
05eeae208d [PATCH] find_task_by_pid() needs tasklist_lock
A couple of places are forgetting to take it.

The kswapd case is probably unimportant.  keventd_create_kthread() was racy.

The whole thing is a bit flakey: you start a kernel thread, get its pid from
kernel_thread() then look up its task_struct.

a) It assumes that pid recycling takes a "long" time.

b) We get a task_struct but no reference was taken on it.  The owner of the
   kswapd and kthread task_struct*'s must assume that the new thread won't
   exit unexpectedly.  Because if it does, they're left holding dead memory
   and any attempt to control or stop that task will crash.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:57 -08:00
Chris Wright
12b5989be1 [PATCH] refactor capable() to one implementation, add __capable() helper
Move capable() to kernel/capability.c and eliminate duplicate
implementations.  Add __capable() function which can be used to check for
capabiilty of any process.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:56 -08:00
Bryan Holty
501f2499b8 [PATCH] IRQ: prevent enabling of previously disabled interrupt
This fix prevents re-disabling and enabling of a previously disabled
interrupt.  On an SMP system with irq balancing enabled; If an interrupt is
disabled from within its own interrupt context with disable_irq_nosync and is
also earmarked for processor migration, the interrupt is blindly moved to the
other processor and enabled without regard for its current "enabled" state.
If there is an interrupt pending, it will unexpectedly invoke the irq handler
on the new irq owning processor (even though the irq was previously disabled)

The more intuitive fix would be to invoke disable_irq_nosync and
enable_irq, but since we already have the desc->lock from __do_IRQ, we
cannot call them directly.  Instead we can use the same logic to disable
and enable found in disable_irq_nosync and enable_irq, with regards to the
desc->depth.

This now prevents a disabled interrupt from being re-disabled, and more
importantly prevents a disabled interrupt from being incorrectly enabled on
a different processor.

Signed-off-by: Bryan Holty <lgeek@frontiernet.net>
Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:55 -08:00
Andrew Morton
c777ac5594 [PATCH] irq: uninline migration functions
Uninline some massive IRQ migration functions.  Put them in the new
kernel/irq/migration.c.

Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:55 -08:00
Adrian Bunk
9871728b75 [PATCH] kernel/params.c: make param_array() static
param_array() in kernel/params.c can now become static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:52 -08:00
Rusty Russell
8d3b33f67f [PATCH] Remove MODULE_PARM
MODULE_PARM was actually breaking: recent gcc version optimize them out as
unused.  It's time to replace the last users, which are generally in the
most unloved drivers anyway.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:52 -08:00
Thomas Gleixner
7d99b7d634 [PATCH] Validate and sanitze itimer timeval from userspace
According to the specification the timevals must be validated and an
errorcode -EINVAL returned in case the timevals are not in canonical form.
This check was never done in Linux.

The pre 2.6.16 code converted invalid timevals silently.  Negative timeouts
were converted by the timeval_to_jiffies conversion to the maximum timeout.

hrtimers and the ktime_t operations expect timevals in canonical form.
Otherwise random results might happen on 32 bits machines due to the
optimized ktime_add/sub operations.  Negative timeouts are treated as
already expired.  This might break applications which work on pre 2.6.16.

To prevent random behaviour and API breakage the timevals are checked and
invalid timevals sanitized in a simliar way as the pre 2.6.16 code did.

Invalid timevals are reported with a per boot limited number of kernel
messages so applications which use this misfeature can be corrected.

After a grace period of one year the sanitizing should be replaced by a
correct validation check.  This is also documented in
Documentation/feature-removal-schedule.txt

The validation and sanitizing is done inside do_setitimer so all callers
(sys_setitimer, compat_sys_setitimer, osf_setitimer) are catched.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:49 -08:00
Thomas Gleixner
c08b8a4910 [PATCH] sys_alarm() unsigned signed conversion fixup
alarm() calls the kernel with an unsigend int timeout in seconds.  The
value is stored in the tv_sec field of a struct timeval to setup the
itimer.  The tv_sec field of struct timeval is of type long, which causes
the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX.

Before the hrtimer merge (pre 2.6.16) such a negative value was converted
to the maximum jiffies timeout by the timeval_to_jiffies conversion.  It's
not clear whether this was intended or just happened to be done by the
timeval_to_jiffies code.

hrtimers expect a timeval in canonical form and treat a negative timeout as
already expired.  This breaks the legitimate usage of alarm() with a
timeout value > INT_MAX seconds.

For 32 bit machines it is therefor necessary to limit the internal seconds
value to avoid API breakage.  Instead of doing this in all implementations
of sys_alarm the duplicated sys_alarm code is moved into a common function
in itimer.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:48 -08:00
Andrew Morton
185ae6d7a3 [PATCH] timer irq driven soft watchdog fix
I seem to have lost this hunk in yesterday's patch.  It brings the
coming-online CPU's softlockup timer up to date so we don't get false-positive
tripups during CPU hot-add.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25 08:22:48 -08:00
Eric Sesterhenn
6978c7052f BUG_ON() Conversion in kernel/cpu.c
this changes if() BUG(); constructs to BUG_ON() which is
cleaner, contains unlikely() and can better optimized away.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-24 18:45:21 +01:00
Andrew Morton
f125b56113 [PATCH] fix build error if CONFIG_SYSFS=n
uevent_seqnum and uevent_helper are only defined if CONFIG_HOTPLUG=y,
CONFIG_NET=n.

(I stole this back from Greg's tree - it makes allnoconfig work).

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:31 -08:00
Davi Arnaut
24277dda3a [PATCH] strndup_user: convert module
Change hand-coded userspace string copying to strndup_user.

Signed-off-by: Davi Arnaut <davi.arnaut@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:31 -08:00
Ingo Molnar
6687a97d40 [PATCH] timer-irq-driven soft-watchdog, cleanups
Make the softlockup detector purely timer-interrupt driven, removing
softirq-context (timer) dependencies.  This means that if the softlockup
watchdog triggers, it has truly observed a longer than 10 seconds
scheduling delay of a SCHED_FIFO prio 99 task.

(the patch also turns off the softlockup detector during the initial bootup
phase and does small style fixes)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Sergey Vlasov
6a4d11c2ab [PATCH] Fix module refcount leak in __set_personality()
If the change of personality does not lead to change of exec domain,
__set_personality() returned without releasing the module reference
acquired by lookup_exec_domain().

Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Andrew Morton
d3561f78fd [PATCH] RLIMIT_CPU: document wrong return value
Document the fact that setrlimit(RLIMIT_CPU) doesn't return error codes when
it should.  I don't think we can fix this without a 2.7.x..

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Andrew Morton
e0661111e5 [PATCH] RLIMIT_CPU: fix handling of a zero limit
At present the kernel doesn't honour an attempt to set RLIMIT_CPU to zero
seconds.  But the spec says it should, and that's what 2.4.x does.

Fixing this for real would involve some complexity (such as adding a new
it-has-been-set flag to the task_struct, and testing that everwhere, instead
of overloading the value of it_prof_expires).

Given that a 2.4 kernel won't actually send the signal until one second has
expired anyway, let's just handle this case by treating the caller's
zero-seconds as one second.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
Andrew Morton
ec9e16bacd [PATCH] sys_setrlimit() cleanup
- Whitespace cleanups

- Make that expression comprehensible.

There's a potential logic change here: we do the "is it_prof_expires equal to
zero" test after converting it to seconds, rather than doing the comparison
between raw cputime_t's.

But given that it's in units of seconds anyway, that shouldn't change
anything.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ulrich Weigand <uweigand@de.ibm.com>
Cc: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:30 -08:00
John Z. Bohach
2ea1c5392c [PATCH] console_setup() depends (wrongly?) on CONFIG_PRINTK
It appears that console_setup() code only gets compiled into the kernel if
CONFIG_PRINTK is enabled.  One detrimental side-effect of this is that
serial8250_console_setup() never gets invoked when CONFIG_PRINTK is not
set, resulting in baud rate not being read/parsed from command line (i.e.
console=ttyS0,115200n8 is ignored, at least the baud rate part...)

Attached patch moves console_setup() code from inside

#ifdef CONFIG_PRINTK

to outside (in printk.c), removing dependence on said config. option.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:27 -08:00
Paul Jackson
29afd49b72 [PATCH] cpuset: remove useless local variable initialization
Remove a useless variable initialization in cpuset __cpuset_zone_allowed().
 The local variable 'allowed' is unconditionally set before use, later on
in the code, so does not need to be initialized.

Not that it seems to matter to the code generated any, as the compiler
optimizes out the superfluous assignment anyway.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:24 -08:00
Paul Jackson
151a44202d [PATCH] cpuset: don't need to mark cpuset_mems_generation atomic
Drop the atomic_t marking on the cpuset static global
cpuset_mems_generation.  Since all access to it is guarded by the global
manage_mutex, there is no need for further serialization of this value.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:24 -08:00
Paul Jackson
8488bc359d [PATCH] cpuset: remove unnecessary NULL check
Remove a no longer needed test for NULL cpuset pointer, with a little
comment explaining why the test isn't needed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Paul Jackson
c61afb181c [PATCH] cpuset memory spread slab cache optimizations
The hooks in the slab cache allocator code path for support of NUMA
mempolicies and cpuset memory spreading are in an important code path.  Many
systems will use neither feature.

This patch optimizes those hooks down to a single check of some bits in the
current tasks task_struct flags.  For non NUMA systems, this hook and related
code is already ifdef'd out.

The optimization is done by using another task flag, set if the task is using
a non-default NUMA mempolicy.  Taking this flag bit along with the
PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
memory spreading' patch set, one can check for the combination of any of these
special case memory placement mechanisms with a single test of the current
tasks task_struct flags.

This patch also tightens up the code, to save a few bytes of kernel text
space, and moves some of it out of line.  Due to the nested inlines called
from multiple places, we were ending up with three copies of this code, which
once we get off the main code path (for local node allocation) seems a bit
wasteful of instruction memory.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:23 -08:00
Paul Jackson
825a46af5a [PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative
memory allocation policy that can be applied to certain kinds of memory
allocations, such as the page cache (file system buffers) and some slab caches
(such as inode caches).

The policy is called "memory spreading." If enabled, it spreads out these
kinds of memory allocations over all the nodes allowed to a task, instead of
preferring to place them on the node where the task is executing.

All other kinds of allocations, including anonymous pages for a tasks stack
and data regions, are not affected by this policy choice, and continue to be
allocated preferring the node local to execution, as modified by the NUMA
mempolicy.

There are two boolean flag files per cpuset that control where the kernel
allocates pages for the file system buffers and related in kernel data
structures.  They are called 'memory_spread_page' and 'memory_spread_slab'.

If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
kernel will spread the file system buffers (page cache) evenly over all the
nodes that the faulting task is allowed to use, instead of preferring to put
those pages on the node where the task is running.

If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
kernel will spread some file system related slab caches, such as for inodes
and dentries evenly over all the nodes that the faulting task is allowed to
use, instead of preferring to put those pages on the node where the task is
running.

The implementation is simple.  Setting the cpuset flags 'memory_spread_page'
or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
subsequently joins that cpuset.  In subsequent patches, the page allocation
calls for the affected page cache and slab caches are modified to perform an
inline check for these flags, and if set, a call to a new routine
cpuset_mem_spread_node() returns the node to prefer for the allocation.

The cpuset_mem_spread_node() routine is also simple.  It uses the value of a
per-task rotor cpuset_mem_spread_rotor to select the next node in the current
tasks mems_allowed to prefer for the allocation.

This policy can provide substantial improvements for jobs that need to place
thread local data on the corresponding node, but that need to access large
file system data sets that need to be spread across the several nodes in the
jobs cpuset in order to fit.  Without this patch, especially for jobs that
might have one thread reading in the data set, the memory allocation across
the nodes in the jobs cpuset can become very uneven.

A couple of Copyright year ranges are updated as well.  And a couple of email
addresses that can be found in the MAINTAINERS file are removed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Paul Jackson
8a39cc60bf [PATCH] cpuset use combined atomic_inc_return calls
Replace pairs of calls to <atomic_inc, atomic_read>, with a single call
atomic_inc_return, saving a few bytes of source and kernel text.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Paul Jackson
7b5b9ef0e1 [PATCH] cpuset cleanup not not operators
Since the test_bit() bit operator is boolean (return 0 or 1), the double not
"!!" operations needed to convert a scalar (zero or not zero) to a boolean are
not needed.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Paul E. McKenney
95c3832272 [PATCH] rcutorture: tag success/failure line with module parameters
A long-running rcutorture test can overflow dmesg, so that the line
containing the module parameters is lost.  Although it is usually possible
to retrieve this information from the log files, it is much better to just
tag it onto the final success/failure line so that it may be easily found.
This patch does just that.

Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:22 -08:00
Jan Beulich
a4a6198b80 [PATCH] tvec_bases too large for per-cpu data
With internal Xen-enabled kernels we see the kernel's static per-cpu data
area exceed the limit of 32k on x86-64, and even native x86-64 kernels get
fairly close to that limit.  I generally question whether it is reasonable
to have data structures several kb in size allocated as per-cpu data when
the space there is rather limited.

The biggest arch-independent consumer is tvec_bases (over 4k on 32-bit
archs, over 8k on 64-bit ones), which now gets converted to use dynamically
allocated memory instead.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:21 -08:00
Oleg Nesterov
caa9ee771d [PATCH] rcu_process_callbacks: don't cli() while testing ->nxtlist
__rcu_process_callbacks() disables interrupts to protect itself from
call_rcu() which adds new entries to ->nxtlist.

However we can check "->nxtlist != NULL" with interrupts enabled, we can't
get "false positives" because call_rcu() can only change this condition
from 0 to 1.

Tested with rcutorture.ko.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Bart Samwel
cba9f33d13 [PATCH] Range checking in do_proc_dointvec_(userhz_)jiffies_conv
When (integer) sysctl values are in either seconds or centiseconds, but
represented internally as jiffies, the allowable value range is decreased.
This patch adds range checks to the conversion routines.

For values in seconds: maximum LONG_MAX / HZ.

For values in centiseconds: maximum (LONG_MAX / HZ) * USER_HZ.

(BTW, does anyone else feel that an interface in seconds should not be
accepting negative values?)

Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Bart Samwel
ed5b43f15a [PATCH] Represent laptop_mode as jiffies internally
Make that the internal value for /proc/sys/vm/laptop_mode is stored as
jiffies instead of seconds.  Let the sysctl interface do the conversions,
instead of doing on-the-fly conversions every time the value is used.

Add a description of the fact that laptop_mode doubles as a flag and a
timeout to the comment above the laptop_mode variable.

Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Bart Samwel
f6ef943813 [PATCH] Represent dirty_*_centisecs as jiffies internally
Make that the internal values for:

/proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs

are stored as jiffies instead of centiseconds.  Let the sysctl interface do
the conversions with full precision using clock_t_to_jiffies, instead of
doing overflow-sensitive on-the-fly conversions every time the values are
used.

Cons: apparent precision loss if HZ is not a multiple of 100, because of
conversion back and forth.  This is a common problem for all sysctl values
that use proc_dointvec_userhz_jiffies.  (There is only one other in-tree
use, in net/core/neighbour.c.)

Signed-off-by: Bart Samwel <bart@samwel.tk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Andrew Morton
36f574135e [PATCH] free_uid() locking improvement
Reduce lock hold times in free_uid().

Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-24 07:33:20 -08:00
Jens Axboe
2056a782f8 [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 20:00:26 +01:00
Tom Zanussi
6dac40a7ce [PATCH] relay: consolidate sendfile() and read() code
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 19:58:45 +01:00
Jens Axboe
221415d762 [PATCH] relay: add sendfile() support
Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 19:57:55 +01:00
Jens Axboe
b86ff981a8 [PATCH] relay: migrate from relayfs to a generic relay API
Original patch from Paul Mundt, sysfs parts removed by me since they
were broken.

Signed-off-by: Jens Axboe <axboe@suse.de>
2006-03-23 19:56:55 +01:00
Adrian Bunk
2178426d26 [PATCH] kernel/rcupdate.c: make two structs static
This patch makes two needlessly global structs static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:17 -08:00
Oleg Nesterov
ee25e96fcd [PATCH] BUILD_LOCK_OPS: cleanup preempt_disable() usage
This patch changes the code from:

	preempt_disable();
	for (;;) {
		...
		preempt_disable();
	}
to:
	for (;;) {
		preempt_disable();
		...
	}

which seems more clean to me and saves a couple of bytes for
each function.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:16 -08:00
Andrew Morton
dd287796d6 [PATCH] pause_on_oops command line option
Attempt to fix the problem wherein people's oops reports scroll off the screen
due to repeated oopsing or to oopses on other CPUs.

If this happens the user can reboot with the `pause_on_oops=<seconds>' option.
It will allow the first oopsing CPU to print an oops record just a single
time.  Second oopsing attempts, or oopses on other CPUs will cause those CPUs
to enter a tight loop until the specified number of seconds have elapsed.

The patch implements the infrastructure generically in the expectation that
architectures other than x86 will find it useful.

Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:16 -08:00
Ingo Molnar
91368d73e4 [PATCH] make bug messages more consistent
Consolidate all kernel bug printouts to begin with the "BUG: " string.
Makes it easier to find them in large bootup logs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:16 -08:00
Oleg Nesterov
a26fd335b4 [PATCH] sigprocmask: kill unneeded temp var
Cleanup, remove unneeded double copying of current->blocked.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:15 -08:00
Ashutosh Naik
6389a38511 [PATCH] kernel/module.c Semaphore to Mutex Conversion for module_mutex
This patch converts the module_mutex semaphore to a mutex.

Signed-off-by: Ashutosh Naik <ashutosh.naik@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:14 -08:00
Ingo Molnar
7a7d1cf954 [PATCH] sem2mutex: kprobes
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:12 -08:00
Ingo Molnar
70522e121a [PATCH] sem2mutex: tty
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:11 -08:00
Arjan van de Ven
97d1f15b7e [PATCH] sem2mutex: kernel/
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
Ingo Molnar
9331b3157c [PATCH] convert kernel/rcupdate.c:rcu_barrier_sema to mutex
Convert kernel/rcupdate's rcu_barrier_sema to mutex.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
Ingo Molnar
3d3f26a7ba [PATCH] kernel/cpuset.c, mutex conversion
convert cpuset.c's callback_sem and manage_sem to mutexes.
Build and boot tested by Ingo.
Build, boot, unit and stress tested by pj.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:10 -08:00
Ravikiran G Thirumalai
2dd0ebcd2a [PATCH] Avoid taking global tasklist_lock for single threadedprocess at getrusage()
Avoid taking the global tasklist_lock when possible, if a process is single
threaded during getrusage().  Any avoidance of tasklist_lock is good for
NUMA boxes (and possibly for large SMPs).  Thanks to Oleg Nesterov for
review and suggestions.

Signed-off-by: Nippun Goel <nippung@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:09 -08:00
Eric Dumazet
0c9e63fd38 [PATCH] Shrinks sizeof(files_struct) and better layout
1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
   platforms, lowering kmalloc() allocated space by 50%.

2) Reduce the size of (files_struct), using a special 32 bits (or
   64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
   close_on_exec_init and open_fds_init fields.  This save some ram (248
   bytes per task) as most tasks dont open more than 32 files.  D-Cache
   footprint for such tasks is also reduced to the minimum.

3) Reduce size of allocated fdset.  Currently two full pages are
   allocated, that is 32768 bits on x86 for example, and way too much.  The
   minimum is now L1_CACHE_BYTES.

UP and SMP should benefit from this patch, because most tasks will touch
only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
(next_fd, close_on_exec_init, open_fds_init, fd_array[0 ..  2] being in the
same cache line)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:09 -08:00
Luca Tettamanti
9b238205ba [PATCH] swsusp: add s2ram ioctl to userland interface
Add the SNAPSHOT_S2RAM ioctl to the snapshot device.

This ioctl allows a userland application to make the system (previously frozen
with the SNAPSHOT_FREE ioctl) enter the S3 state without freezing processes
and disabling nonboot CPUs for the second time.

This will allow us to implement the suspend-to-disk-and-RAM (STDR)
functionality in the userland suspend tools.

Signed-off-by: Luca Tettamanti <kronos.it@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
94c188d329 [PATCH] swsusp: let userland tools switch console on suspend
Remove the console-switching code from the suspend part of the swsusp userland
interface and let the userland tools switch the console.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Shaohua Li
e4e4d66556 [PATCH] swsusp: drain high mem pages
Highmem could be in pcp list as well.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
fc558a7496 [PATCH] swsusp: finally solve mysqld problem
This patch from Pavel moves userland freeze signals handling into more logical
place.  It now hits even with mysqld running.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Pavel Machek
ce6ed29f31 [PATCH] suspend: make progress printing prettier
Combination of printk/pr_debug led to <7> in the middle of the line, and we
printed way too many dots.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
02aaeb9b95 [PATCH] swsusp: freeze user space processes first
Allow swsusp to freeze processes successfully under heavy load by freezing
userspace processes before kernel threads.

[Thanks to Nigel Cunningham <nigel@suspend2.net> for suggesting the
way to go.]

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:08 -08:00
Rafael J. Wysocki
6e1819d615 [PATCH] swsusp: userland interface
This patch introduces a user space interface for swsusp.

The interface is based on a special character device, called the snapshot
device, that allows user space processes to perform suspend and resume-related
operations with the help of some ioctls and the read()/write() functions.
 Additionally it allows these processes to allocate free swap pages from a
selected swap partition, called the resume partition, so that they know which
sectors of the resume partition are available to them.

The interface uses the same low-level system memory snapshot-handling
functions that are used by the built-it swap-writing/reading code of swsusp.

The interface documentation is included in the patch.

The patch assumes that the major and minor numbers of the snapshot device will
be 10 (ie.  misc device) and 231, the registration of which has already been
requested.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Pavel Machek
543cc27d09 [PATCH] swsusp: documentation updates
Update suspend-to-RAM documentation with new machines, and makes message
when processes can't be stopped little clearer.  (In one case, waiting
longer actually did help).

From: "Rafael J. Wysocki" <rjw@sisk.pl>

  Warn in the documentation that data may be lost if there are some
  filesystems mounted from USB devices before suspend.

  [Thanks to Alan Stern for providing the answer to the question in the
  Q:-A: part.]

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Randy Dunlap
74c7e2efbe [PATCH] kernel/power: move externs to header files
Move externs from C source files to header files.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Rafael J. Wysocki
61159a314b [PATCH] swsusp: separate swap-writing/reading code
Move the swap-writing/reading code of swsusp to a separate file.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Rafael J. Wysocki
f577eb30af [PATCH] swsusp: low level interface
Introduce the low level interface that can be used for handling the
snapshot of the system memory by the in-kernel swap-writing/reading code of
swsusp and the userland interface code (to be introduced shortly).

Also change the way in which swsusp records the allocated swap pages and,
consequently, simplifies the in-kernel swap-writing/reading code (this is
necessary for the userland interface too).  To this end, it introduces two
helper functions in mm/swapfile.c, so that the swsusp code does not refer
directly to the swap internals.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Andrew Morton
2b322ce210 [PATCH] revert "swsusp: fix breakage with swap on lvm"
This was a temporary thing for 2.6.16.

Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:07 -08:00
Anton Blanchard
e9028b0ff2 [PATCH] fix scheduler deadlock
We have noticed lockups during boot when stress testing kexec on ppc64.
Two cpus would deadlock in scheduler code trying to grab already taken
spinlocks.

The double_rq_lock code uses the address of the runqueue to order the
taking of multiple locks.  This address is a per cpu variable:

	if (rq1 < rq2) {
		spin_lock(&rq1->lock);
		spin_lock(&rq2->lock);
	} else {
		spin_lock(&rq2->lock);
		spin_lock(&rq1->lock);
	}

On the other hand, the code in wake_sleeping_dependent uses the cpu id
order to grab locks:

	for_each_cpu_mask(i, sibling_map)
		spin_lock(&cpu_rq(i)->lock);

This means we rely on the address of per cpu data increasing as cpu ids
increase.  While this will be true for the generic percpu implementation it
may not be true for arch specific implementations.

One way to solve this is to always take runqueues in cpu id order. To do
this we add a cpu variable to the runqueue and check it in the
double runqueue locking functions.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:02 -08:00
Linus Torvalds
2e6e33bab6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (78 commits)
  [PATCH] powerpc: Add FSL SEC node to documentation
  [PATCH] macintosh: tidy-up driver_register() return values
  [PATCH] powerpc: tidy-up of_register_driver()/driver_register() return values
  [PATCH] powerpc: via-pmu warning fix
  [PATCH] macintosh: cleanup the use of i2c headers
  [PATCH] powerpc: dont allow old RTC to be selected
  [PATCH] powerpc: make powerbook_sleep_grackle static
  [PATCH] powerpc: Fix warning in add_memory
  [PATCH] powerpc: update mailing list addresses
  [PATCH] powerpc: Remove calculation of io hole
  [PATCH] powerpc: iseries: Add bootargs to /chosen
  [PATCH] powerpc: iseries: Add /system-id, /model and /compatible
  [PATCH] powerpc: Add strne2a() to convert a string from EBCDIC to ASCII
  [PATCH] powerpc: iseries: Make more stuff static in platforms/iseries/mf.c
  [PATCH] powerpc: iseries: Remove pointless iSeries_(restart|power_off|halt)
  [PATCH] powerpc: iseries: mf related cleanups
  [PATCH] powerpc: Replace platform_is_lpar() with a firmware feature
  [PATCH] powerpc: trivial: Cleanup whitespace in cputable.h
  [PATCH] powerpc: Remove unused iommu_off logic from pSeries_init_early()
  [PATCH] powerpc: Unconfuse htab_bolt_mapping() callers
  ...
2006-03-22 22:20:46 -08:00
Linus Torvalds
2152f85366 Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (138 commits)
  [SCSI] libata: implement minimal transport template for ->eh_timed_out
  [SCSI] eliminate rphy allocation in favour of expander/end device allocation
  [SCSI] convert mptsas over to end_device/expander allocations
  [SCSI] allow displaying and setting of cache type via sysfs
  [SCSI] add scsi_mode_select to scsi_lib.c
  [SCSI] 3ware 9000 add big endian support
  [SCSI] qla2xxx: update MAINTAINERS
  [SCSI] scsi: move target_destroy call
  [SCSI] fusion - bump version
  [SCSI] fusion - expander hotplug suport in mptsas module
  [SCSI] fusion - exposing raid components in mptsas
  [SCSI] fusion - memory leak, and initializing fields
  [SCSI] fusion - exclosure misspelled
  [SCSI] fusion - cleanup mptsas event handling functions
  [SCSI] fusion - removing target_id/bus_id from the VirtDevice structure
  [SCSI] fusion - static fix's
  [SCSI] fusion - move some debug firmware event debug msgs to verbose level
  [SCSI] fusion - loginfo header update
  [SCSI] add scsi_reprobe_device
  [SCSI] megaraid_sas: fix extended timeout handling
  ...
2006-03-22 10:47:24 -08:00
Andrew Morton
78eef01b0f [PATCH] on_each_cpu(): disable local interrupts
When on_each_cpu() runs the callback on other CPUs, it runs with local
interrupts disabled.  So we should run the function with local interrupts
disabled on this CPU, too.

And do the same for UP, so the callback is run in the same environment on both
UP and SMP.  (strictly it should do preempt_disable() too, but I think
local_irq_disable is sufficiently equivalent).

Also uninlines on_each_cpu().  softirq.c was the most appropriate file I could
find, but it doesn't seem to justify creating a new file.

Oh, and fix up that comment over (under?) x86's smp_call_function().  It
drives me nuts.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:59 -08:00
Eric W. Biederman
06f9d4f94a [PATCH] unshare: Error if passed unsupported flags
A bare bones trivial patch to ensure we always get -EINVAL on the
unsupported cases for sys_unshare.  If this goes in before 2.6.16 it allows
us to forward compatible with future applications using sys_unshare.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: JANAK DESAI <janak@us.ibm.com>
Cc: <stable@kerenl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:55 -08:00
Mike Galbraith
9430d58e34 [PATCH] sched: remove sleep_avg multiplier
Remove the sleep_avg multiplier.  This multiplier was necessary back when
we had 10 seconds of dynamic range in sleep_avg, but now that we only have
one second, it causes that one second to be compressed down to 100ms in
some cases.  This is particularly noticeable when compiling a kernel in a
slow NFS mount, and I believe it to be a very likely candidate for other
recently reported network related interactivity problems.

In testing, I can detect no negative impact of this removal.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:53:54 -08:00
James Bottomley
d04cdb6421 Merge ../linux-2.6 2006-03-21 13:05:45 -06:00
Greg Kroah-Hartman
03e88ae1b1 [PATCH] fix module sysfs files reference counting
The module files, refcnt, version, and srcversion did not properly
increment the owner's module reference count, allowing the modules to
be removed while the files were open, causing oopses.

This patch fixes this, and also fixes the problem that the version and
srcversion files were not showing up, unless CONFIG_MODULE_UNLOAD was
enabled, which is not correct.

Cc: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Greg Kroah-Hartman
01ca70dca5 [PATCH] add EXPORT_SYMBOL_GPL_FUTURE() to RCU subsystem
As the RCU symbols are going to be changed to GPL in the near future,
lets warn users that this is going to happen.

Cc: Paul McKenney <paulmck@us.ibm.com>
Acked-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Greg Kroah-Hartman
9f28bb7e1d [PATCH] add EXPORT_SYMBOL_GPL_FUTURE()
This patch adds the ability to mark symbols that will be changed in the
future, so that kernel modules that don't include MODULE_LICENSE("GPL")
and use the symbols, will be flagged and printed out to the system log.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Sam Ravnborg
3fd6805f4d [PATCH] Clean up module.c symbol searching logic
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:58 -08:00
Jun'ichi Nomura
51107301b6 [PATCH] kobject: fix build error if CONFIG_SYSFS=n
Moving uevent_seqnum and uevent_helper to kobject_uevent.c
because they are used even if CONFIG_SYSFS=n
while kernel/ksysfs.c is built only if CONFIG_SYSFS=y,

Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-03-20 13:42:57 -08:00
Amy Griffis
71e1c784b2 [PATCH] fix audit_init failure path
Make audit_init() failure path handle situations where the audit_panic()
action is not AUDIT_FAIL_PANIC (default is AUDIT_FAIL_PRINTK).  Other uses
of audit_sock are not reached unless audit's netlink message handler is
properly registered.  Bug noticed by Peter Staubach.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
lorenzo@gnu.org
bf45da97a4 [PATCH] EXPORT_SYMBOL patch for audit_log, audit_log_start, audit_log_end and audit_format
Hi,

This is a trivial patch that enables the possibility of using some auditing
functions within loadable kernel modules (ie. inside a Linux Security Module).

_

Make the audit_log_start, audit_log_end, audit_format and audit_log
interfaces available to Loadable Kernel Modules, thus making possible
the usage of the audit framework inside LSMs, etc.

Signed-off-by: <Lorenzo Hernández García-Hierro <lorenzo@gnu.org>>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Ingo Molnar
5a0bbce58b [PATCH] sem2mutex: audit_netlink_sem
Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Ingo Molnar
4023e02080 [PATCH] simplify audit_free() locking
Simplify audit_free()'s locking: no need to lock a task that we are tearing
down.  [the extra locking also caused false positives in the lock
validator]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Dustin Kirkland
d9d9ec6e2c [PATCH] Fix audit operators
Darrel Goeddel initiated a discussion on IRC regarding the possibility
of audit_comparator() returning -EINVAL signaling an invalid operator.

It is possible when creating the rule to assure that the operator is one
of the 6 sane values.  Here's a snip from include/linux/audit.h  Note
that 0 (nonsense) and 7 (all operators) are not valid values for an
operator.

...

/* These are the supported operators.
 *      4  2  1
 *      =  >  <
 *      -------
 *      0  0  0         0       nonsense
 *      0  0  1         1       <
 *      0  1  0         2       >
 *      0  1  1         3       !=
 *      1  0  0         4       =
 *      1  0  1         5       <=
 *      1  1  0         6       >=
 *      1  1  1         7       all operators
 */
...

Furthermore, prior to adding these extended operators, flagging the
AUDIT_NEGATE bit implied !=, and otherwise == was assumed.

The following code forces the operator to be != if the AUDIT_NEGATE bit
was flipped on.  And if no operator was specified, == is assumed.  The
only invalid condition is if the AUDIT_NEGATE bit is off and all of the
AUDIT_EQUAL, AUDIT_LESS_THAN, and AUDIT_GREATER_THAN bits are
on--clearly a nonsensical operator.

Now that this is handled at rule insertion time, the default -EINVAL
return of audit_comparator() is eliminated such that the function can
only return 1 or 0.

If this is acceptable, let's get this applied to the current tree.

:-Dustin

--

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from 9bf0a8e137040f87d1b563336d4194e38fb2ba1a commit)
2006-03-20 14:08:55 -05:00
Steve Grubb
a6c043a887 [PATCH] Add tty to syscall audit records
Hi,

>From the RBAC specs:

FAU_SAR.1.1 The TSF shall provide the set of authorized
RBAC administrators with the capability to read the following
audit information from the audit records:

<snip>
(e) The User Session Identifier or Terminal Type

A patch adding the tty for all syscalls is included in this email.
Please apply.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Steve Grubb
5d3301088f [PATCH] add/remove rule update
Hi,

The following patch adds a little more information to the add/remove rule message emitted
by the kernel.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:55 -05:00
Amy Griffis
93315ed6dd [PATCH] audit string fields interface + consumer
Updated patch to dynamically allocate audit rule fields in kernel's
internal representation.  Added unlikely() calls for testing memory
allocation result.

Amy Griffis wrote:     [Wed Jan 11 2006, 02:02:31PM EST]
> Modify audit's kernel-userspace interface to allow the specification
> of string fields in audit rules.
>
> Signed-off-by: Amy Griffis <amy.griffis@hp.com>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from 5ffc4a863f92351b720fe3e9c5cd647accff9e03 commit)
2006-03-20 14:08:54 -05:00
David Woodhouse
d884596f44 [PATCH] Minor cosmetic cleanups to the code moved into auditfilter.c
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:54 -05:00
David Woodhouse
fe7752bab2 [PATCH] Fix audit record filtering with !CONFIG_AUDITSYSCALL
This fixes the per-user and per-message-type filtering when syscall
auditing isn't enabled.

[AV: folded followup fix from the same author]

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:54 -05:00
Dustin Kirkland
7306a0b9b3 [PATCH] Miscellaneous bug and warning fixes
This patch fixes a couple of bugs revealed in new features recently
added to -mm1:
* fixes warnings due to inconsistent use of const struct inode *inode
* fixes bug that prevent a kernel from booting with audit on, and SELinux off
  due to a missing function in security/dummy.c
* fixes a bug that throws spurious audit_panic() messages due to a missing
  return just before an error_path label
* some reasonable house cleaning in audit_ipc_context(),
  audit_inode_context(), and audit_log_task_context()

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:54 -05:00
Dustin Kirkland
8c8570fb8f [PATCH] Capture selinux subject/object context information.
This patch extends existing audit records with subject/object context
information. Audit records associated with filesystem inodes, ipc, and
tasks now contain SELinux label information in the field "subj" if the
item is performing the action, or in "obj" if the item is the receiver
of an action.

These labels are collected via hooks in SELinux and appended to the
appropriate record in the audit code.

This additional information is required for Common Criteria Labeled
Security Protection Profile (LSPP).

[AV: fixed kmalloc flags use]
[folded leak fixes]
[folded cleanup from akpm (kfree(NULL)]
[folded audit_inode_context() leak fix]
[folded akpm's fix for audit_ipc_perm() definition in case of !CONFIG_AUDIT]

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:54 -05:00
Dustin Kirkland
c8edc80c8b [PATCH] Exclude messages by message type
- Add a new, 5th filter called "exclude".
    - And add a new field AUDIT_MSGTYPE.
    - Define a new function audit_filter_exclude() that takes a message type
      as input and examines all rules in the filter.  It returns '1' if the
      message is to be excluded, and '0' otherwise.
    - Call the audit_filter_exclude() function near the top of
      audit_log_start() just after asserting audit_initialized.  If the
      message type is not to be audited, return NULL very early, before
      doing a lot of work.
[combined with followup fix for bug in original patch, Nov 4, same author]
[combined with later renaming AUDIT_FILTER_EXCLUDE->AUDIT_FILTER_TYPE
and audit_filter_exclude() -> audit_filter_type()]

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:54 -05:00
Amy Griffis
73241ccca0 [PATCH] Collect more inode information during syscall processing.
This patch augments the collection of inode info during syscall
processing. It represents part of the functionality that was provided
by the auditfs patch included in RHEL4.

Specifically, it:

- Collects information for target inodes created or removed during
  syscalls.  Previous code only collects information for the target
  inode's parent.

- Adds the audit_inode() hook to syscalls that operate on a file
  descriptor (e.g. fchown), enabling audit to do inode filtering for
  these calls.

- Modifies filtering code to check audit context for either an inode #
  or a parent inode # matching a given rule.

- Modifies logging to provide inode # for both parent and child.

- Protect debug info from NULL audit_names.name.

[AV: folded a later typo fix from the same author]

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:53 -05:00
Amy Griffis
f38aa94224 [PATCH] Pass dentry, not just name, in fsnotify creation hooks.
The audit hooks (to be added shortly) will want to see dentry->d_inode
too, not just the name.

Signed-off-by: Amy Griffis <amy.griffis@hp.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Steve Grubb
90d526c074 [PATCH] Define new range of userspace messages.
The attached patch updates various items for the new user space
messages. Please apply.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Dustin Kirkland
b63862f465 [PATCH] Filter rule comparators
Currently, audit only supports the "=" and "!=" operators in the -F
filter rules.

This patch reworks the support for "=" and "!=", and adds support
for ">", ">=", "<", and "<=".

This turned out to be a pretty clean, and simply process.  I ended up
using the high order bits of the "field", as suggested by Steve and Amy.
This allowed for no changes whatsoever to the netlink communications.
See the documentation within the patch in the include/linux/audit.h
area, where there is a table that explains the reasoning of the bitmask
assignments clearly.

The patch adds a new function, audit_comparator(left, op, right).
This function will perform the specified comparison (op, which defaults
to "==" for backward compatibility) between two values (left and right).
If the negate bit is on, it will negate whatever that result was.  This
value is returned.

Signed-off-by: Dustin Kirkland <dustin.kirkland@us.ibm.com>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Randy Dunlap
b0dd25a826 [PATCH] AUDIT: kerneldoc for kernel/audit*.c
- add kerneldoc for non-static functions;
- don't init static data to 0;
- limit lines to < 80 columns;
- fix long-format style;
- delete whitespace at end of some lines;

(chrisw: resend and update to current audit-2.6 tree)

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-03-20 14:08:53 -05:00
Jason Baron
7e7f8a036b [PATCH] make vm86 call audit_syscall_exit
hi,

The motivation behind the patch below was to address messages in
/var/log/messages such as:

Jan 31 10:54:15 mets kernel: audit(:0): major=252 name_count=0: freeing
multiple contexts (1)
Jan 31 10:54:15 mets kernel: audit(:0): major=113 name_count=0: freeing
multiple contexts (2)

I can reproduce by running 'get-edid' from:
http://john.fremlin.de/programs/linux/read-edid/.

These messages come about in the log b/c the vm86 calls do not exit via
the normal system call exit paths and thus do not call
'audit_syscall_exit'. The next system call will then free the context for
itself and for the vm86 context, thus generating the above messages. This
patch addresses the issue by simply adding a call to 'audit_syscall_exit'
from the vm86 code.

Besides fixing the above error messages the patch also now allows vm86
system calls to become auditable. This is useful since strace does not
appear to properly record the return values from sys_vm86.

I think this patch is also a step in the right direction in terms of
cleaning up some core auditing code. If we can correct any other paths
that do not properly call the audit exit and entries points, then we can
also eliminate the notion of context chaining.

I've tested this patch by verifying that the log messages no longer
appear, and that the audit records for sys_vm86 appear to be correct.
Also, 'read_edid' produces itentical output.

thanks,

-Jason

Signed-off-by: Jason Baron <jbaron@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-20 14:08:53 -05:00
Al Viro
afc847b7dd [PATCH] don't do exit_io_context() until we know we won't be doing any IO
testcase:

mount /dev/sdb10 /mnt
touch /mnt/tmp/b
umount /mnt
mount /dev/sdb10 /mnt
rm /mnt/tmp/b </mnt/tmp/b
umount /mnt

and watch blkdev_ioc line in /proc/slabinfo.  Vanilla kernel leaks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-03-18 18:33:46 -05:00
Oleg Nesterov
2d61b86775 [PATCH] disable unshare(CLONE_VM) for now
sys_unshare() does mmput(new_mm).  This is not enough if we have
mm->core_waiters.

This patch is a temporary fix for soon to be released 2.6.16.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
[ Checked with Uli: "I'm not planning to use unshare(CLONE_VM).  It's
  not needed for any functionality planned so far.  What we (as in Red
  Hat) need unshare() for now is the filesystem side." ]
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-18 10:49:36 -08:00
Roman Zippel
a0a0c28c1a [PATCH] posix-timers: fix requeue accounting when signal is ignored
When the posix-timer signal is ignored then the timer is rearmed by the
callback function.  The requeue pending accounting has to be fixed up else
the state might be wrong.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:51:25 -08:00
Christoph Lameter
67890d7084 [PATCH] time_interpolator: add __read_mostly
The pointer to the current time interpolator and the current list of time
interpolators are typically only changed during bootup.  Adding
__read_mostly takes them away from possibly hot cachelines.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:51:25 -08:00
Eric W. Biederman
e0e8eb54d8 [PATCH] unshare: Use rcu_assign_pointer when setting sighand
The sighand pointer only needs the rcu_read_lock on the
read side.  So only depending on task_lock protection
when setting this pointer is not enough.  We also need
a memory barrier to ensure the initialization is seen first.

Use rcu_assign_pointer as it does this for us, and clearly
documents that we are setting an rcu readable pointer.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17 07:46:59 -08:00
Paul Mackerras
23dd640112 Merge ../linux-2.6 2006-03-17 12:01:19 +11:00
James Bottomley
f33b5d783b Merge ../linux-2.6 2006-03-14 14:18:01 -06:00
GOTO Masanori
f9a3879abf [PATCH] Fix sigaltstack corruption among cloned threads
This patch fixes alternate signal stack corruption among cloned threads
with CLONE_SIGHAND (and CLONE_VM) for linux-2.6.16-rc6.

The value of alternate signal stack is currently inherited after a call of
clone(...  CLONE_SIGHAND | CLONE_VM).  But if sigaltstack is set by a
parent thread, and then if multiple cloned child threads (+ parent threads)
call signal handler at the same time, some threads may be conflicted -
because they share to use the same alternative signal stack region.
Finally they get sigsegv.  It's an undesirable race condition.  Note that
child threads created from NPTL pthread_create() also hit this conflict
when the parent thread uses sigaltstack, without my patch.

To fix this problem, this patch clears the child threads' sigaltstack
information like exec().  This behavior follows the SUSv3 specification.
In SUSv3, pthread_create() says "The alternate stack shall not be inherited
(when new threads are initialized)".  It means that sigaltstack should be
cleared when sigaltstack memory space is shared by cloned threads with
CLONE_SIGHAND.

Note that I chose "if (clone_flags & CLONE_SIGHAND)" line because:
  - If clone_flags line is not existed, fork() does not inherit sigaltstack.
  - CLONE_VM is another choice, but vfork() does not inherit sigaltstack.
  - CLONE_SIGHAND implies CLONE_VM, and it looks suitable.
  - CLONE_THREAD is another candidate, and includes CLONE_SIGHAND + CLONE_VM,
    but this flag has a bit different semantics.
I decided to use CLONE_SIGHAND.

[ Changed to test for CLONE_VM && !CLONE_VFORK after discussion --Linus ]

Signed-off-by: GOTO Masanori <gotom@sanori.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Linus Torvalds <torvalds@osdl.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-14 07:57:17 -08:00
Christoph Hellwig
7cd9013be6 [PATCH] remove __put_task_struct_cb export again
The patch '[PATCH] RCU signal handling' [1] added an export for
__put_task_struct_cb, a put_task_struct helper newly introduced in that
patch.  But the put_task_struct couldn't be used modular previously as
__put_task_struct wasn't exported.  There are not callers of it in modular
code, and it shouldn't be exported because we don't want drivers to hold
references to task_structs.

This patch removes the export and folds __put_task_struct into
__put_task_struct_cb as there's no other caller.

[1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-11 09:19:34 -08:00
Paul Mackerras
5164501794 Merge ../linux-2.6 2006-03-09 14:32:05 +11:00
Dipankar Sarma
529bf6be5c [PATCH] fix file counting
I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench.  Tested on both x86_64 and powerpc.

The way we do file struct accounting is not very suitable for batched
freeing.  For scalability reasons, file accounting was
constructor/destructor based.  This meant that nr_files was decremented
only when the object was removed from the slab cache.  This is susceptible
to slab fragmentation.  With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730  0       758844

At the same time, I see only a 2000+ objects in filp cache.  The following
patch I fixes this problem.

This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api.  In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.

Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:14:01 -08:00
Dipankar Sarma
21a1ea9eb4 [PATCH] rcu batch tuning
This patch adds new tunables for RCU queue and finished batches.  There are
two types of controls - number of completed RCU updates invoked in a batch
(blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark,
qlowmark).

By default, the per-cpu batch limit is set to a small value.  If the input
RCU rate exceeds the high watermark, we do two things - force quiescent
state on all cpus and set the batch limit of the CPU to INTMAX.  Setting
batch limit to INTMAX forces all finished RCUs to be processed in one shot.
 If we have more than INTMAX RCUs queued up, then we have bigger problems
anyway.  Once the incoming queued RCUs fall below the low watermark, the
batch limit is set to the default.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:14:01 -08:00
Ingo Molnar
81c29a857d [PATCH] idle threads should have a sane ->timestamp value
Idle threads should have a sane ->timestamp value, to avoid init kernel
thread(s) from inheriting it and causing miscalculations in
try_to_wake_up().

Reported-by: Mike Galbraith <efault@gmx.de>.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-08 14:14:00 -08:00
Atsushi Nemoto
5aee405c66 [PATCH] time: add barrier after updating jiffies_64
Add a compiler barrier so that we don't read jiffies before updating
jiffies_64.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 18:40:44 -08:00
Tony Lindgren
69239749e1 [PATCH] fix next_timer_interrupt() for hrtimer
Also from Thomas Gleixner <tglx@linutronix.de>

Function next_timer_interrupt() got broken with a recent patch
6ba1b91213 as sys_nanosleep() was moved to
hrtimer.  This broke things as next_timer_interrupt() did not check hrtimer
tree for next event.

Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
VST) implementations, as the system can be in idle when next hrtimer event
was supposed to happen.  At least ARM and S390 currently use
next_timer_interrupt().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 18:40:44 -08:00
Linus Torvalds
8ba7b0a14b Add early-boot-safety check to cond_resched()
Just to be safe, we should not trigger a conditional reschedule during
the early boot sequence.  We've historically done some questionable
early on, and the safety warnings in __might_sleep() are generally
turned off during that period, so there might be problems lurking.

This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to
cause a voluntary conditional reschedule.

Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-06 17:38:49 -08:00