linux

Author	SHA1	Message	Date
Ingo Molnar	23bdd703a5	sched: do not set softirqs to nice +19 do not set softirqs to nice +19. _If_ for whatever reason we missed to process some high-prio softirq and woke up ksoftirqd, we should give it a fair chance to actually get some work done, even if the system is under load. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	43ae34cb4c	sched: scheduler debugging, core scheduler debugging core: implement /proc/sched_debug and /proc/<PID>/sched files for scheduler debugging. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	77e54a1f88	sched: add CFS debug sysctls add CFS debug sysctls: only tweakable if SCHED_DEBUG is enabled. This allows for faster debugging of scheduler problems. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	b2cfba19f6	sched: remove unused rq types from sched.c remove unused rq types from sched.c, now that we switched over to CFS. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	634fa8c97c	sched: remove interactivity types remove now unused interactivity-heuristics related defined and types of the old scheduler. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	dff06c157b	sched: clean up include files in sched.c clean up include files in sched.c, they were still old-style <asm/>. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Balbir Singh	172ba844a8	sched: update delay-accounting to use CFS's precise stats update delay-accounting to use CFS's precise stats. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:52:00 +02:00
Ingo Molnar	1b9f19c212	sched: turn on the use of unstable events make use of sched-clock-unstable events. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	bb29ab2686	sched: x86, track TSC-unstable events track TSC-unstable events and propagate it to the scheduler code. Also allow sched_clock() to be used when the TSC is unstable, the rq_clock() wrapper creates a reliable clock out of it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	dd41f596cd	sched: cfs core code apply the CFS core code. this change switches over the scheduler core to CFS's modular design and makes use of kernel/sched_fair/rt/idletask.c to implement Linux's scheduling policies. thanks to Andrew Morton and Thomas Gleixner for lots of detailed review feedback and for fixlets. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>	2007-07-09 18:51:59 +02:00
Ingo Molnar	f3479f10c5	sched: remove the sleep-bonus interactivity code remove the sleep-bonus interactivity code from the core scheduler. scheduling policy is implemented in the policy modules, and CFS does not need such type of heuristics. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	c18a17329b	sched: remove expired_starving() remove the expired_starving() heuristics from the core scheduler. CFS does not need it, and this did not really work well in practice anyway, due to the rq->nr_running multiplier to STARVATION_LIMIT. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	f2ac58ee61	sched: remove sleep_type remove the sleep_type heuristics from the core scheduler - scheduling policy is implemented in the scheduling-policy modules. (and CFS does not use this type of sleep-type heuristics) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	45bf76df48	sched: cfs, add load-calculation methods add the new load-calculation methods of CFS. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	14531189f0	sched: clean up __normal_prio() position clean up: move __normal_prio() in head of normal_prio(). no code changed. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	71f8bd4600	sched: cleanup: move dequeue/enqueue_task() cleanup: move dequeue/enqueue_task() to a more logical place, to not split up __normal_prio()/normal_prio(). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	c24d20dbef	sched: move around resched_task() move resched_task()/resched_cpu() into the 'public interfaces' section of sched.c, for use by kernel/sched_fair/rt/idletask.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	e05606d330	sched: clean up the rt priority macros clean up the rt priority macros, pointed out by Andrew Morton. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:59 +02:00
Ingo Molnar	138a8aeb5b	sched: add cfs_rq ops add the set_task_cfs_rq() abstraction needed by CONFIG_FAIR_GROUP_SCHED. (not activated yet) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	41b86e9c51	sched: make posix-cpu-timers use CFS's accounting information update the posix-cpu-timers code to use CFS's CPU accounting information. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	20d315d42a	sched: add rq_clock()/__rq_clock() add rq_clock()/__rq_clock(), a robust wrapper around sched_clock(), used by CFS. It protects against common type of sched_clock() problems (caused by hardware): time warps forwards and backwards. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	6aa645ea5f	sched: cfs rq data types add the CFS rq data types to sched.c. (the old scheduler fields are still intact, they are removed by a later patch) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	fa72e9e484	sched: cfs core, kernel/sched_idletask.c add kernel/sched_idletask.c - which implements the idle thread scheduling class. This further simplifies sched.c (under CFS), for example a number of 'if (p == rq->idle)' type of special-cases can be removed from sched.c, and schedule() gets simpler too. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	bb44e5d1c6	sched: cfs core, kernel/sched_rt.c add kernel/sched_rt.c: SCHED_FIFO/SCHED_RR support. The behavior and semantics of SCHED_FIFO/SCHED_RR tasks is unchanged. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	bf0f6f24a1	sched: cfs core, kernel/sched_fair.c add kernel/sched_fair.c - which implements the bulk of CFS's behavioral changes for SCHED_OTHER tasks. see Documentation/sched-design-CFS.txt about details. Authors: Ingo Molnar <mingo@elte.hu> Dmitry Adamushko <dmitry.adamushko@gmail.com> Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Mike Galbraith <efault@gmx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>	2007-07-09 18:51:58 +02:00
Ingo Molnar	425e0968a2	sched: move code into kernel/sched_stats.h create sched_stats.h and move sched.c schedstats code into it. This cleans up sched.c a bit. no code changes are caused by this patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	1df21055e3	sched: add init_idle_bootup_task() add the init_idle_bootup_task() callback to the bootup thread, unused at the moment. (CFS will use it to switch the scheduling class of the boot thread to the idle class) Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	f64f61145a	sched: remove sched_exit() remove sched_exit(): the elaborate dance of us trying to recover timeslices given to child tasks never really worked. CFS does not need it either. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	c65cc87052	sched: uninline set_task_cpu() uninline set_task_cpu(): CFS will add more code to it. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:58 +02:00
Ingo Molnar	0437e109e1	sched: zap the migration init / cache-hot balancing code the SMP load-balancer uses the boot-time migration-cost estimation code to attempt to improve the quality of balancing. The reason for this code is that the discrete priority queues do not preserve the order of scheduling accurately, so the load-balancer skips tasks that were running on a CPU 'recently'. this code is fundamental fragile: the boot-time migration cost detector doesnt really work on systems that had large L3 caches, it caused boot delays on large systems and the whole cache-hot concept made the balancing code pretty undeterministic as well. (and hey, i wrote most of it, so i can say it out loud that it sucks ;-) under CFS the same purpose of cache affinity can be achieved without any special cache-hot special-case: tasks are sorted in the 'timeline' tree and the SMP balancer picks tasks from the left side of the tree, thus the most cache-cold task is balanced automatically. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:57 +02:00
Ingo Molnar	d15bcfdbe1	sched: rename idle_type/SCHED_IDLE enum idle_type (used by the load-balancer) clashes with the SCHED_IDLE name that we want to introduce. 'CPU_IDLE' instead of 'SCHED_IDLE' is more descriptive as well. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-07-09 18:51:57 +02:00
Thomas Gleixner	746976a301	NTP: remove clock_was_set() call to prevent deadlock The clock_was_set() call in seconds_overflow() which happens only when leap seconds are inserted / deleted is wrong in two aspects: 1. it results in a call to on_each_cpu() with interrupts disabled 2. it is potential deadlock source vs. call_lock in smp_call_function() The only possible side effect of the removal might be, that an absolute CLOCK_REALTIME timer fires 1 second too late, in the rare case of leap second deletion and an absolute CLOCK_REALTIME timer which expires in the affected time frame. It will never fire too early. This was probably observed by the reporter of a June 30th -> July 1st hang: http://lkml.org/lkml/2007/7/3/103 A similar problem was observed by Dave Jones, who provided a screen shot with a lockdep back trace, which allowed to analyse the problem. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-03 13:54:27 -07:00
Rafael J. Wysocki	2391dae3e3	PM: introduce set_target method in pm_ops Commit `52ade9b3b9` changed the suspend code ordering to execute pm_ops->prepare() after the device model per-device .suspend() calls in order to fix some ACPI-related issues. Unfortunately, it broke the at91 platform which assumed that pm_ops->prepare() would be called before suspending devices. at91 used pm_ops->prepare() to get notified of the target system sleep state, so that it could use this information while suspending devices. However, with the current suspend code ordering pm_ops->prepare() is called too late for this purpose. Thus, at91 needs an additional method in 'struct pm_ops' that will be used for notifying the platform of the target system sleep state. Moreover, in the future such a method will also be needed by ACPI. This patch adds the .set_target() method to 'struct pm_ops' and makes the suspend code call it, if implemented, before executing the device model per-device .suspend() calls. It also modifies the at91 code to use pm_ops->set_target() instead of pm_ops->prepare(). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-07-01 12:29:44 -07:00
Masami Hiramatsu	a66e356c04	relayfs: fix overwrites When I use relayfs with "overwrite" mode, read() still sets incorrect number of consumed bytes. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Tom Zanussi <zanussi@us.ibm.com> Acked-by: David Wilder <dwilder@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-28 11:38:18 -07:00
David Wilder	8d62fdebda	relay file read: start-pos fix Fix a bug in the relay read interface causing the number of consumed bytes to be set incorrectly. Signed-off-by: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: David Wilder <dwilder@us.ibm.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-28 11:34:54 -07:00
Thomas Gleixner	a06381fec7	FUTEX: Restore the dropped ERSCH fix The return value of futex_find_get_task() needs to be -ESRCH in case that the search fails. This was part of the original futex fixes and got accidentally dropped, when the futex-tidy-up patch was split out. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Stable Team <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-24 12:08:53 -07:00
Tony Jones	7b018b2888	audit: fix oops removing watch if audit disabled Removing a watched file will oops if audit is disabled (auditctl -e 0). To reproduce: - auditctl -e 1 - touch /tmp/foo - auditctl -w /tmp/foo - auditctl -e 0 - rm /tmp/foo (or mv) Signed-off-by: Tony Jones <tonyj@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-24 08:59:12 -07:00
Christoph Lameter	92c4ca5c3a	sched: fix next_interval determination in idle_balance() The intervals of domains that do not have SD_BALANCE_NEWIDLE must be considered for the calculation of the time of the next balance. Otherwise we may defer rebalancing forever. Siddha also spotted that the conversion of the balance interval to jiffies is missing. Fix that to. From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> also continue the loop if !(sd->flags & SD_LOAD_BALANCE). Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> It did in fact trigger under all three of mainline, CFS, and -rt including CFS -- see below for a couple of emails from last Friday giving results for these three on the AMD box (where it happened) and on a single-quad NUMA-Q system (where it did not, at least not with such severity). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-24 08:59:11 -07:00
Cedric Le Goater	4e71e474c7	fix refcounting of nsproxy object when unshared When a namespace is unshared, a refcount on the previous nsproxy is abusively taken, leading to a memory leak of nsproxy objects. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-24 08:59:10 -07:00
Thomas Gleixner	58229a1899	posix-timers: Prevent softirq starvation by small intervals and SIG_IGN posix-timers which deliver an ignored signal are currently rearmed in the timer softirq: This is necessary because the timer needs to be delivered again when SIG_IGN is removed. This is not a problem, when the interval is reasonable. With high resolution timers enabled one might arm a posix timer with a very small interval and ignore the signal. This might lead to a softirq starvation when the interval is so small that the timer is requeued onto the softirq pending list right away. This problem was pointed out by Jan Kiszka. Thanks Jan ! The correct solution would be to stop the timer, when the signal is ignored and rearm it when SIG_IGN is removed. Unfortunately this requires modification in sigaction and involves non trivial sighand locking. It's too late in the release cycle for such a change. For now we just keep the timer running and enforce that the timer only fires every jiffie. This does not break anything as we keep the overrun counter correct. It adds a little inaccuracy to the timer_gettime() interface, but... The more complex change is necessary anyway to fix another short coming of the current implementation, which I discovered while looking at this problem: A pending signal is discarded when SIG_IGN is set. In case that a posixtimer signal is pending then it is discarded as well, but when SIG_IGN is removed later nothing rearms the timer. This is not new, it's that way since posix timers have been merged. So nothing to worry about right now. I have a working solution to fix all of this, but the impact is too large for both stable and 2.6.22. I'm going to send it out for review in the next days. This should go into 2.6.21.stable as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jan Kiszka <jan.kiszka@web.de> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Stable Team <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-21 15:57:04 -07:00
Linus Torvalds	fa490cfd15	Fix possible runqueue lock starvation in wait_task_inactive() Miklos Szeredi reported very long pauses (several seconds, sometimes more) on his T60 (with a Core2Duo) which he managed to track down to wait_task_inactive()'s open-coded busy-loop. He observed that an interrupt on one core tries to acquire the runqueue-lock but does not succeed in doing so for a very long time - while wait_task_inactive() on the other core loops waiting for the first core to deschedule a task (which it wont do while spinning in an interrupt handler). This rewrites wait_task_inactive() to do all its waiting optimistically without any locks taken at all, and then just double-check the end result with the proper runqueue lock held over just a very short section. If there were races in the optimistic wait, of a preemption event scheduled the process away, we simply re-synchronize, and start over. So the code now looks like this: repeat: /* Unlocked, optimistic looping! / rq = task_rq(p); while (task_running(rq, p)) cpu_relax(); / Get the real values / rq = task_rq_lock(p, &flags); running = task_running(rq, p); array = p->array; task_rq_unlock(rq, &flags); / Check them.. / if (unlikely(running)) { cpu_relax(); goto repeat; } / Preempted away? Yield if so.. */ if (unlikely(array)) { yield(); goto repeat; } Basically, that first "while()" loop is done entirely without any locking at all (and doesn't check for the case where the target process might have been preempted away), and so it's possibly "incorrect", but we don't really care. Both the runqueue used, and the "task_running()" check might be the wrong tests, but they won't oops - they just mean that we could possibly get the wrong results due to lack of locking and exit the loop early in the case of a race condition. So once we've exited the loop, we then get the proper (and careful) rq lock, and check the running/runnable state _safely_. And if it turns out that our quick-and-dirty and unsafe loop was wrong after all, we just go back and try it all again. (The patch also adds a lot of comments, which is the actual bulk of it all, to make it more obvious why we can do these things without holding the locks). Thanks to Miklos for all the testing and tracking it down. Tested-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 11:52:55 -07:00
Ingo Molnar	a0f98a1cb7	sched: fix SysRq-N (normalize RT tasks) Gene Heskett reported the following problem while testing CFS: SysRq-N is not always effective in normalizing tasks back to SCHED_OTHER. The reason for that turns out to be the following bug: - normalize_rt_tasks() uses for_each_process() to iterate through all tasks in the system. The problem is, this method does not iterate through all tasks, it iterates through all thread groups. The proper mechanism to enumerate over all threads is to use a do_each_thread() + while_each_thread() loop. Reported-by: Gene Heskett <gene.heskett@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 11:52:55 -07:00
Benjamin Herrenschmidt	caec4e8dc8	Fix signalfd interaction with thread-private signals Don't let signalfd dequeue private signals off other threads (in the case of things like SIGILL or SIGSEGV, trying to do so would result in undefined behaviour on who actually gets the signal, since they are force unblocked). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 10:18:32 -07:00
Thomas Gleixner	bd197234b0	Revert "futex_requeue_pi optimization" This reverts commit `d0aa7a70bf`. It not only introduced user space visible changes to the futex syscall, it is also non-functional and there is no way to fix it proper before the 2.6.22 release. The breakage report ( http://lkml.org/lkml/2007/5/12/17 ) went unanswered, and unfortunately it turned out that the concept is not feasible at all. It violates the rtmutex semantics badly by introducing a virtual owner, which hacks around the coupling of the user-space pi_futex and the kernel internal rt_mutex representation. At the moment the only safe option is to remove it fully as it contains user-space visible changes to broken kernel code, which we do not want to expose in the 2.6.22 release. The patch reverts the original patch mostly 1:1, but contains a couple of trivial manual cleanups which were necessary due to patches, which touched the same area of code later. Verified against the glibc tests and my own PI futex tests. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Ulrich Drepper <drepper@redhat.com> Cc: Pierre Peiffer <pierre.peiffer@bull.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 09:48:41 -07:00
Rafael J. Wysocki	2f41dddbbd	swsusp: Fix userland interface Fix oops caused by 'cat /dev/snapshot', reported by Arkadiusz Miskiewicz, and make it impossible to thaw tasks with the help of the swsusp userland interface while there is a snapshot image ready to save. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
Paul Jackson	3e903e7b16	cpuset: zero malloc - fix for old cpusets The cpuset code to present a list of tasks using a cpuset to user space could write to an array that it had kmalloc'd, after a kmalloc request of zero size. The problem was that the code didn't check for writes past the allocated end of the array until -after- the first write. This is a race condition that is likely rare -- it would only show up if a cpuset went from being empty to having a task in it, during the brief time between the allocation and the first write. Prior to roughly 2.6.22 kernels, this was also a benign problem, because a zero kmalloc returned a few usable bytes anyway, and no harm was done with the bogus write. With the 2.6.22 kernel changes to make issue a warning if code tries to write to the location returned from a zero size allocation, this problem is no longer benign. This cpuset code would occassionally trigger that warning. The fix is trivial -- check before storing into the array, not after, whether the array is big enough to hold the store. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-16 13:16:15 -07:00
Alexey Kuznetsov	778e9a9c3e	pi-futex: fix exit races and locking problems 1. New entries can be added to tsk->pi_state_list after task completed exit_pi_state_list(). The result is memory leakage and deadlocks. 2. handle_mm_fault() is called under spinlock. The result is obvious. 3. results in self-inflicted deadlock inside glibc. Sometimes futex_lock_pi returns -ESRCH, when it is not expected and glibc enters to for(;;) sleep() to simulate deadlock. This problem is quite obvious and I think the patch is right. Though it looks like each "if" in futex_lock_pi() got some stupid special case "else if". :-) 4. sometimes futex_lock_pi() returns -EDEADLK, when nobody has the lock. The reason is also obvious (see comment in the patch), but correct fix is far beyond my comprehension. I guess someone already saw this, the chunk: if (rt_mutex_trylock(&q.pi_state->pi_mutex)) ret = 0; is obviously from the same opera. But it does not work, because the rtmutex is really taken at this point: wake_futex_pi() of previous owner reassigned it to us. My fix works. But it looks very stupid. I would think about removal of shift of ownership in wake_futex_pi() and making all the work in context of process taking lock. From: Thomas Gleixner <tglx@linutronix.de> Fix 1) Avoid the tasklist lock variant of the exit race fix by adding an additional state transition to the exit code. This fixes also the issue, when a task with recursive segfaults is not able to release the futexes. Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH problem finally. Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup in the lock protected section by using the in_atomic userspace access functions. This removes also the ugly lock drop / unqueue inside of fixup_pi_state() Fix 4) Fix a stale lock in the error path of futex_wake_pi() Added some error checks for verification. The -EDEADLK problem is solved by the rtmutex fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:34 -07:00
Thomas Gleixner	1a539a8728	rt-mutex: fix chain walk early wakeup bug Alexey Kuznetsov found some problems in the pi-futex code. One of the root causes is: When a wakeup happens, we do not to stop the chain walk so we follow a not longer relevant locking chain. Drop out when this happens. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:34 -07:00
Thomas Gleixner	c0d1d2bf5a	rt-mutex: fix stale return value Alexey Kuznetsov found some problems in the pi-futex code. The major problem is a stale return value in rt_mutex_slowlock(): When the pi chain walk returns -EDEADLK, but the waiter was woken up during the phases where the locks were dropped, the rtmutex could be acquired, but due to the stale return value -EDEADLK returned to the caller. Reset the return value in the retry path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-08 17:23:34 -07:00
Roland McGrath	b74d0deb96	Restrict clearing TIF_SIGPENDING This patch should get a few birds. It prevents sigaction calls from clearing TIF_SIGPENDING in other threads, which could leak -ERESTART. And It fixes ptrace_stop not to clear it, which done at the syscall exit stop could leak -ERESTART. It probably removes the harm from signalfd, at least assuming it never calls dequeue_signal on kernel threads that might have used block_all_signals. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-06-07 08:52:15 -07:00

1 2 3 4 5 ...

2210 Commits